# Non-specific filtering of beta-distributed data

- Xinhui Wang
^{1}, - Peter W Laird
^{2}, - Toshinori Hinoue
^{2}, - Susan Groshen
^{1}and - Kimberly D Siegmund
^{1}Email author

**15**:199

**DOI: **10.1186/1471-2105-15-199

© Wang et al.; licensee BioMed Central Ltd. 2014

**Received: **3 November 2013

**Accepted: **12 June 2014

**Published: **19 June 2014

## Abstract

### Background

Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

### Results

We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

### Conclusions

We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

## Background

Non-specific filtering of variables, the selection of a subset of variables based on a characteristic unrelated to the outcome of interest, is often applied for dimension reduction in high-dimensional data sets. The approach can be used for both supervised and unsupervised analysis. For differential expression, non-specific filtering of features prior to hypothesis testing can increase power to detect differentially expressed genes [1]. For cluster analysis, it can improve sensitivity of finding disease clusters [2]. The most common filter methods use a measure of variance to rank variables, hoping to enrich for features that vary due to biological signal [3–5]. In these settings, the variance of the (possibly log-transformed) data is usually independent of the mean. However, when studying DNA methylation measured as a proportion, this may not always be the case, and alternate filter statistics may be preferred.

DNA methylation is an epigenetic mark that varies between different cell types, correlating with DNA packaging within the cell and facilitating cell-type specific function. Today, Illumina’s DNA methylation BeadArrays allow for high-throughput analysis, with their most recent platform measuring hundreds of thousands of targeted loci for large numbers of samples. On Illumina’s platform, DNA methylation is measured by the percentage of total fluorescence due to methylation. This value is bounded between 0 and 1, and can be modelled using a Beta distribution. To perform cluster analysis on such data, Houseman et al. (2008) [5] developed a recursive partitioning mixture model (RPMM) for Beta-distributed data. They applied a non-specific filter prior to cluster analysis, using the standard deviation of the Beta values. However, for Beta-distributed data, the variance is a function of the mean, and a standard deviation filter will bias the selection of most variable features towards those having a mean near the middle of the 0 to 1 scale. This bias could favour selecting features that show cell-type specific methylation and be desirable for clustering subgroups of normal tissue, as CpG methylation at cell-type specific marks can be sensitive to shifts in the distributions of cells driving the associations of CpG methylation with sample characteristic (e.g. disease state, or age) [6–8]. However, at the same time, a bias that favours selecting features with mean DNA methylation levels near 0.5 can be problematic in studies to discover cancer subtypes where aberrant DNA methylation may only be observed in a small fraction of tumors [9, 10]. When the subset of tumors with aberrant profiles is rare, the average DNA methylation level across a set of tumors for sites that are normally unmethylated is closer to the normal state of 0 than 0.5. At the same time, the average level for sites that are normally methylated is closer to 1. When filtering variables based on standard deviation, clusters having only a few samples may not separate distinctly from the rest. To decrease the association between the mean and variance of methylation proportions measured on the Illumina platform, Du et al. (2010) [11] propose using a logit transformation (on the log2 scale). We explore alternate transformations that take the Beta-distribution explicitly into account. In particular, we consider methods making use of the cumulative distribution function, the variance stabilizing transformation for a Beta distribution [12].

We compare different filtering methods in a collection of real data sets generated on either Illumina’s HumanMethylation27 or HumanMethylation450 platform. The variety of examples considered will allow us to evaluate filter methods across different data distributions and structures. We find that the properties of the data set, specifically the fraction of samples in a subtype, or the variation of features within groups, can lead to very different behaviour of some filtering methods.

## Methods

### Illumina HumanMethylation BeadArrays

Illumina’s BeadArray technology analyzes more than 27,000 targeted CpGs on the HumanMethylation27 (HM27) platform, and over 480,000 on the HumanMethylation450 (HM450) [13]. Whereas the HM27 array primarily targets CpGs in promoter regions, the HM450 expands coverage of exons, gene bodies, and 3′UTRs, targeting sites in 99% of RefSeq genes [13]. At each targeted position, the quantity of methylated (M) and unmethylated (U) DNA is measured by fluorescence intensity, and the proportion of methylated target is summarized by the average Beta value = M/(M + U), a value bounded by 0 and 1. Many such targets show skewed distributions near the boundary conditions, motivating the use of a Beta distribution for statistical modelling [5]. As not all targeted sites show variation in proportion of DNA methylation, our goal is to use non-specific filtering to reduce the dimension of variables in our analysis.

We refer to each targeted CpG site on the array as a feature, and evaluate a number of methods for ranking features and filtering for dimension reduction. A number of methods explicitly use parameters from, or variance-stabilizing transformations of, Beta distributions.

### Beta distribution

For a single feature, we model the distribution of DNA methylation across independent samples using a Beta distribution. Let X ~ Beta(α, β), $\mathit{f}\left(\mathit{x}\right)=\frac{1}{\mathit{B}\left(\mathit{\alpha},\mathit{\beta}\right)}{\mathit{x}}^{\mathit{\alpha}-1}{\left(1-\mathit{x}\right)}^{\mathit{\beta}-1}$, where $\mathit{B}\left(\mathit{\alpha},\mathit{\beta}\right)={\displaystyle {\int}_{0}^{1}{\mathit{u}}^{\mathit{\alpha}-1}}{\left(1-\mathit{u}\right)}^{\mathit{\beta}-1}$ and α, β > 0. The mean and variance are given by $\mathit{u}=\frac{\mathit{\alpha}}{\mathit{\alpha}+\mathit{\beta}}$ and ${\mathit{\sigma}}^{2}=\frac{\mathit{\mu}\left(1-\mathit{\mu}\right)}{\left(\mathit{\alpha}+\mathit{\beta}+1\right)}$, respectively, with the variance a function of the mean. A useful reparameterization is *Beta*(*μ*, *ϕ*), with *ϕ* = *α* + *β* a precision parameter independent of the mean.

Transformations of the data can also lead to independence of the mean and variance. Du et al. (2010) [11] proposed the M-value, a (log2) logit transformation of the methylation proportion. However, the true variance stabilizing transformation for the Beta distribution is the cumulative distribution function (CDF), $\mathit{Y}={\mathit{f}}_{\mathit{B}}\left(\mathit{X};\mathit{\alpha},\mathit{\beta}\right)=\frac{1}{\mathit{B}\left(\mathit{\alpha},\mathit{\beta}\right)}{\displaystyle {\int}_{0}^{\mathit{X}}{\mathit{t}}^{\mathit{\alpha}-1}}{\left(1-\mathit{t}\right)}^{\mathit{\beta}-1}\mathit{dt}$[12]. If the original data X follow a Beta distribution, the data after transformation (Y) will follow a Uniform distribution with mean 1/2 and variance 1/12. Any lack of fit of a single Beta distribution would suggest that the data arise from a mixture of Betas. We measure lack of fit using the distance of our transformed data from their expected distribution. Two filters below rank features using the CDF-transformed data, Y = CDF(X).

### Filter methods

**Description of feature filtering methods**

Method | Description | How to calculate the statistic* |
---|---|---|

SD-b | Standard deviation based on beta values | SD = sqrt(1/N ∑(Xi-mean(X)) |

SD-m | Standard deviation based on M values | SD = sqrt(1/N ∑(Xi-mean(X)) |

MAD | Median absolute deviation of beta values | median(|Xi-median(X)|) |

DIP | Measure of unimodality in a sample | The max difference, over all sample points, between the empirical distribution function and the unimodal distribution function that minimizes that maximum difference |

Precision | Inverse precision parameter | 1/phi = 1/(mean(X)(1-mean(X))/SD |

BQ-GOF | Beta Quantile Goodness-of-fit | Sum the absolute differences, over 25 quantile points, between the empirical distribution function and the expected beta distribution function |

TM-GOF | Transformed Moment Goodness-of-fit | The Euclidean distance between the empirical standardized transformed moments and the expected center of the transformed moments (1/2,sqrt(1/12)) |

TQ-GOF | Transformed Quantile Goodness-of-fit | Sum the absolute differences, over 25 quantile points, between the empirical cumulative distribution function and the expected cumulative beta distribution function (uniform distribution function) |

BR | Best rank of 8 single filter methods | The best rank value of 8 single filter methods (the highest rank value) |

AR | Average of the top 2 ranks | The average of the best two rank values of 8 single filter methods |

WAR | Weighted average of the top 4 ranks | The weighted average of the best four rank values of 8 single filter methods (weight = 4:3:2:1) |

where $\overline{\mathit{x}}$ and *s*^{2} are the mean and variance for a given feature. For this method the features with lower precision have higher variation. Filter 6, Beta-quantile Goodness-of-Fit (BQ-GOF), is a comparison of the observed to theoretical quantiles from a Beta distribution. It sums for each feature, the absolute difference between corresponding quantiles from the observed cumulative distribution function and the theoretical one obtained using the estimated parameters $\widehat{\mathit{\alpha}},\widehat{\mathit{\beta}}$.

Filters 7 and 8 measure goodness-of-fit on the CDF-transformed data, Y = CDF(X). Filter 7, Beta-Transformed Moments Goodness-of-Fit (TM-GOF). The CDF-transformed data are ranked using the distance of the mean, $\overline{\mathit{y}}$, and standard deviation, s_{y}, from their expected values. The complete procedure is summarized as follows: 1. For each feature, estimate $\widehat{\mathit{\alpha}},\widehat{\mathit{\beta}}$; 2. Compute Y = CDF(X); 3. Compute, $\overline{\mathit{y}}$ and s_{y} the mean and standard deviation of the transformed data; 4. Calculate *s*_{
y
} and ${\mathit{s}}_{{\mathit{s}}_{\mathit{y}}}$, standard deviation for $\overline{\mathit{y}}$ and s_{y} across all features; 5. Rank features by their standardized Euclidean distance $\sqrt{{\left(\frac{\overline{\mathit{y}}-1/2}{{\mathit{s}}_{\overline{\mathit{y}}}}\right)}^{2}+{\left(\frac{{\mathit{s}}_{\mathit{y}}-\sqrt{1/12}}{{\mathit{s}}_{{\mathit{s}}_{\mathit{y}}}}\right)}^{2}}$. The features containing a mixture of Betas will have larger Euclidean distances compared to features that are from a single Beta distribution. Filter 8, Beta Transformed Quantiles Goodness-of-Fit (TQ-GOF), besides using the mean and SD of the CDF-transformed data as a pair of statistics to measure lack of fit, we can use the quantile differences between the observed CDF of Y and the theoretical CDF. Here, we rank the features by the sum of the absolute difference of the corresponding quantiles. This is similar to the BQ-GOF (Filter 6) except the quantiles, instead of the cumulative quantiles, are compared for the CDF-transformed data.

Filters 9 through 11 are summaries of the ranks from the individual statistics used above. Filter 9, Best Rank (BR) selects as the statistic the top rank across the eight statistics, Filter 10, Average Rank (AR), averages the top 2 ranks, and Filter 11, Weighted Average Rank (WAR), averages the top four ranks using weights 4:3:2:1, respectively.

### Simulation study

We perform a simulation study to evaluate the ability of the eleven non-specific filters to enrich a ranked list of features with those informative of subgroups. The data are simulated from distributions observed in our colon cancer data set (data set #1; see Real data sets below). In this data set, 6 out of 26 subjects (23%) contain a hypermethylation profile known as the CpG island methylator phenotype (CIMP), determined using a separate technology [10].

We simulate DNA methylation data from Beta distributions with parameters (μ_{ij},ϕ_{ij}), where i = 1 or 2, for the CIMP and non-CIMP subsets, and j = 1,…,2000 indicates the feature. A random 10% of features are selected to be informative, with (μ_{1j},ϕ_{1j}) and (μ_{2j},ϕ_{2j}) estimated from the CIMP and non-CIMP subgroups, respectively. For the non-informative features, a single set of parameters (μ_{.j},ϕ_{.j}) are estimated from the non-CIMP subgroup only, and used for both subgroups (i = 1,2). We simulate 200 samples, considering sample size ratios of 1:9, 1:1, and 9:1. As the feature characteristics vary between the two groups (e.g., CIMP cancers shows higher mean and variance of measures on average compared to non-CIMP cancers, Additional file 1: Figure S1A-C), the ratios 1:9 and 9:1 can represent very different scenarios. For 100 replicate data sets, we rank the 2000 features based on the different filter methods. For each data set, the same 1800 distributions are used for the non-informative features, and a new random sample of 200 features is selected for the informative features. For each data set and each filter statistic, we rank the features by the statistic, and compute sensitivity and specificity for identifying the 200 differentially methylated features, for feature lists of all possible lengths. The average sensitivity and specificity is computed over the 100 replicate data sets for each filter, and presented using receiver operating characteristic (ROC) curves.

In addition, we compare for the different filtering methods the sample misclassification rates when performing cluster analysis using a Recursive-Partitioning Mixture Model (RPMM) [5]. RPMM was designed for clustering DNA methylation signatures, and clusters samples using a mixture of Beta distributions in a recursive partitioning routine. For each filter method, cluster analysis of the samples is performed on the top 100, 200, and 400 ranked features (5%, 10%, 20%, respectively), when the true percentage of informative probes was 10% of all features. In the results we will see that all filter methods performed well when the distributions of the informative features mimicked the distributions observed in the real data set. We attributed this to a very strong cluster signal from the subset of informative features. In an attempt to differentiate the performance of the filter methods, we restricted the distributions of the informative features to those from a subset of features showing a smaller effect size. We defined the effect size of a feature by $\widehat{\mathit{\theta}}=ln\frac{{\widehat{\mathit{\mu}}}_{2\mathit{j}}/\left(1-{\widehat{\mathit{\mu}}}_{2\mathit{j}}\right)}{{\widehat{\mathit{\mu}}}_{1\mathit{j}}/\left(1-{\widehat{\mathit{\mu}}}_{1\mathit{j}}\right)}$, and sampled 200 informative features from the subset with $\left|\widehat{\mathit{\theta}}\right|\le 1$, or $\left|\widehat{\mathit{\theta}}\right|\le 0.5$. The 1800 non-informative features remain unchanged from the earlier ROC-curve evaluation. The limit on the effect size of the informative features reduced the cluster signal in the data, and resulted in greater variation in performance among the different filtering approaches.

### Real data sets

**Description of data sets used in application analysis**

Data set | Description | Platform | Source | # of probes after preprocessing | # of samples |
---|---|---|---|---|---|

1 | Colon cancer | HM27 | Local | 19,965 | 20 NONCIMP vs. 6 CIMP |

2 | Glioblastoma | HM27 | TCGA, plate 1,2,3,10 | 20,549 | 74 NONCIMP vs. 12 CIMP |

3 | Glioblastoma | HM450 | TCGA, plate 79,111,130 | 374,601 | 93 NONCIMP vs. 6 CIMP |

4 | Kidney | HM27 | TCGA, plate 64 | 21,624 | 50 KIRC vs. 45 normal |

5 | Kidney | HM450 | TCGA, all KIRC | 374,708 | 283 KIRC vs.160 normal |

6 | Breast | HM27 | TCGA, plate 93 | 21,787 | 37 Infiltrating Ductal Carcinoma vs. 20 normal |

7 | Breast | HM450 | TCGA, plate 109 | 377,853 | 56 Infiltrating Ductal Carcinoma vs. 17 normal |

8 | Normal blood | HM450 | GEO:GSE40279, plate 2 | 383,911 | 84 blood samples |

We perform RPMM cluster analysis on different lists of filtered features to assess the ability of different filtering approaches to identify (1) cancer subtypes, (2) cancer/normal tissue types, or (3) young from old individuals. For each data set we applied the 11 filtering methods, each time selecting the top 1000 features for RPMM clustering. We also considered a hybrid variable selection approach, in which we pool the top 500 features selected from two filters methods above (SD-b and TM-GOF). These filters are chosen because they each perform well for different biological conditions studied (see Results below). For data sets 1–7 the misclassification rate was computed by comparing the top two clusters to the known tissue types. We also evaluated the cluster agreement between the clusters identified by the RPMM routine (typically varying from 2–6 groups) with our two known tissue classes using the adjusted rand index. For data set 8, we evaluated the differences in mean age between the two major cluster groups. To evaluate the general effect of filtering the data, we analyzed the HM27 data sets without any filtering, and the HM450 data sets after selecting a random subset of 1000 features. Feature reduction for the HM450 data was necessary, as the cluster analysis software required a lower dimensional data set in order to run. Also, since the HM450 clustering results varied with random selection of features, we repeated the analysis ten times and report the average misclassification error rate over the ten replicates (data sets 1–7).

For two filter methods that show good performance (SD-b and TM-GOF, filters #1 and #7), we report the frequency of the top selected features by genomic context. On the HM27 array the selected probes are characterized using the UCSC definition of CpG island and the gene-based definitions provided by Illumina: “promoter”, “transcribed region”, “exonic region”, and “intronic region”. For the HM450 array we stratified four gene-based categories (Promoter/Exon/Intron/Intergenic) [16] by their position relative to a CpG island (hg19 UCSC definition).

All analyses are performed using the R programming language 2.15.2 (http://www.r-project.org). Infinium data were processed using the methylumi package in Bioconductor, using a combination of Normal-Exponential background correction, dye bias equalization, and beta-mixture quantile normalization (BMIQ) to remove technical artifacts [17, 18]. With a goal of discovering latent disease subtypes, we removed features occurring on the X and Y chromosomes which would be enriched for sex-related variation, and features with other data quality issues (e.g. contain common SNP within 10 bp of the target CpG that may misrepresent DNA methylation level, or map to multiple regions of genome hg19 and lack target specificity. Common SNPs are defined as having MAF > 0.01 in dbSNP build 135 per the UCSC snp135common track.) After pre-filtering, 22,198 CpG targets remain on the HM27 array, and 384,310 on the HM450. Sample R code for the filtering methods is provided in Additional file 4.

## Results

### Colon cancer data

### Simulation study

#### Enrichment of ranked lists

#### Cluster analysis

### Real data application

*except*TM-GOF and TQ-GOF performed well, including no filtering for the HM27 data and random filtering for the HM450. Interestingly, for the breast tissues (data sets #6 and #7), SD-b and SD-m performed best. None of the summary-based filter methods ever performed better than the best single-filter method. They also performed inconsistently across the different data sets. The hybrid approach that combined an equal number of top SD-b probes with TM-GOF probes showed some merit. Sometimes the hybrid selection scheme behaved as well as the best single filter method (Colon data set #1, Kidney data sets #4, #5), and at other times it resulted in error rates that were intermediate between the other two (Glioblastoma data set #3, Breast cancer data sets #6, #7). Overall, the TM-GOF and TQ-GOF methods consistently performed best for identifying the CIMP subgroup in cancer data, while the SD-b filter performed best at distinguishing cancer from non-cancer tissue.

**Misclassification rate of RPMM cluster analysis to find 2 groups using different variable filtering methods (top 1000 features)**

Data set 1 | Data set 2 | Data set 3 | Data set 4 | Data set 5 | Data set 6 | Data set 7 | |
---|---|---|---|---|---|---|---|

| Colon cancer | Glioblastoma | Glioblastoma | Kidney | Kidney | Breast | Breast |

| HM27 | HM27 | HM450 | HM27 | HM450 | HM27 | HM450 |

| 20 non-CIMP vs. 6 CIMP | 74 non-CIMP vs. 12 CIMP | 93 non-CIMP vs. 6 CIMP | 50 KIRC vs. 45 non-cancer | 283 KIRC vs. 160 non-cancer | 37 Breast cancer vs. 20 non-cancer | 56 Breast cancer vs. 17 non-cancer |

| 0.31 | 0.22 | NA | 0 | NA | 0.12 | NA |

| |||||||

| 0.34 | 0.27 | 0.40 | 0.004 | 0.005 | 0.12 | 0.20 |

| 0.19 | 0.07 | 0.49 | 0 | 0.02 | 0 | 0.12 |

| 0.12 | 0.07 | 0.42 | 0.02 | 0.03 | 0.12 | 0.08 |

| 0.38 | 0.35 | 0.49 | 0 | 0.005 | 0 | 0.14 |

| 0.23 | 0.36 | 0.45 | 0 | 0.005 | 0 | 0.14 |

| 0.08 | 0 | 0.10 | 0.03 | 0.01 | 0.11 | 0.22 |

| 0.19 | 0 | 0.07 | 0 | 0.01 | 0.25 | 0.23 |

| 0.08 | 0.02 | 0.06 | 0.36 | 0.47 | 0.44 | 0.49 |

| 0.08 | 0.03 | 0.06 | 0.35 | 0.47 | 0.44 | 0.48 |

| 0.12 | 0.02 | 0.11 | 0.02 | 0.02 | 0.23 | 0.19 |

| 0.08 | 0.06 | 0.11 | 0.02 | 0.02 | 0.25 | 0.19 |

| 0.12 | 0.07 | 0.45 | 0.02 | 0.01 | 0.11 | 0.10 |

| 0.08 | 0.07 | 0.20 | 0.05 | 0.01 | 0.26 | 0.36 |

This same removal of features with outliers, followed by filtering the top 1000 features and cluster analysis, was performed on data sets #5-#7. Each time we found TM-GOF was able to find disease state clusters at the first division just as well as SD-b (results not shown). However, for the CIMP cancer data sets (data sets #1-3), a pre-filtering of features with outliers (using median+/−3*IQR criteria) resulted in equally bad clustering for all filtering methods (data not shown). These results suggest that broad pre-filtering using this definition removed features informative for a CIMP classification.Next we asked if we could find the same tumor substructure in the kidney data set if we had started with the 50 kidney tumors only. In general, this appeared to be the case. Using TM-GOF as our filter, we identified a distinct subgroup of five tumors, four of which were those previously identified when clustering the larger set of tumor and non-tumor tissue (Figure 5C). A subset of 50 features (5%) is shared by the two analyses, selected in the top 1000 features of both data sets (Figure 5A and C). When using features selected using SD-b, the tumors are separated into subgroups of size 12 and 38 (Figure 5D). The cluster of five tumors identified from the TM-GOF filter is a subset of the cluster of 12 identified using SD-b filter; a subset of 44 of the 1000 features (4.4%) were selected by both filter methods (Figure 5C and D).

**Mean age in two clusters identified by RPMM using different filtering methods on blood samples of a normal population**

Filter method | Sample size group 1 | Sample size group 2 | Mean age group 1 | Mean age group 2 | Difference in mean ages | T test p-value |
---|---|---|---|---|---|---|

| 84 | 0 | 59.54 | 0 | NA | NA |

| 84 | 0 | 59.54 | 0 | NA | NA |

| 52 | 32 | 57.10 | 63.50 | 6.40 | 0.03 |

| 64 | 20 | 57.92 | 64.70 | 6.78 | 0.04 |

| 40 | 44 | 57.73 | 61.18 | 3.46 | 0.25 |

| 62 | 22 | 58.21 | 63.27 | 5.06 | 0.09 |

| 75 | 9 | 59.11 | 63.11 | 4.00 | 0.38 |

| 75 | 9 | 59.11 | 63.11 | 4.00 | 0.38 |

| 57 | 27 | 57.89 | 63.00 | 5.11 | 0.08 |

| 56 | 28 | 57.38 | 63.86 | 6.48 | 0.03 |

| 55 | 29 | 57.04 | 64.28 | 7.24 | 0.01 |

| 56 | 28 | 57.09 | 64.43 | 7.34 | 0.02 |

We report the genomic context of the features selected by TM-GOF and SD-b, our two best performing filter methods (Additional file 7: Table S3a and 3b). For the colon cancer and glioblastoma data sets (#1-3), both filter methods enriched for features in CpG islands, which makes sense for detecting a CpG island methylator phenotype cancer subtype. In the cancer versus normal tissue comparisons, the different filters prioritized different feature subsets. For the HM27 data, SD-b prioritized features from non-CpG island regions while TM-GOF still prioritized features from CpG island regions. Both filters enriched for features from exonic regions, with only SD-b from the Kidney data set (#4) preferentially selecting features in promoters. For the HM450 Kidney cancer vs non-cancer tissue comparison, both filter methods over-represented non-CpG island features, however TM-GOF selected more from exons and SD-b selected more from introns and intergenic regions. The Breast cancer versus non-cancer tissue (data set #7) showed the greatest variation in enrichment categories by the TM-GOF and SD-b filter methods. The better performing SD-b filter selected intergenic features, both inside and outside CpG islands. In general, across all HM450 data sets the TM-GOF filter selected more features from CpG island promoters than the SD-b filter (Additional file 7: Table S3b). At the same time, the SD-b filter selected more features from non-CpG island intergenic regions. Thus the two filters were sensitive to prioritizing features in different regions of the genome. This likely explains their different performance for clustering samples from different biological conditions.

We comment briefly on the effect processing HM450 data has on feature selection. We present results for data processed using a combination of background correction, dye-bias [17] and BMIQ normalization [18]. We performed analyses both with and without BMIQ normalization and saw a huge enrichment of design type 1 features prior to BMIQ normalization. Following BMIQ normalization the distribution of selected design type 1 and type 2 features aligned more closely to the distribution on the array. Interestingly, despite the different probe types being selected after normalization, the distribution of features by genomic context varied little (results not shown). Thus we believe the genomic context of the feature is a stronger predictor of feature selection than the platform feature design type.

## Discussion

We used both simulated and real data to evaluate the performance of a number of variable filtering methods for cluster analysis, when the variables are proportions that are bounded on the 0 to 1 scale. Both the simulated and real data show that TM-GOF and TQ-GOF are the best at identifying a subset of cancers having the CpG island methylator phenotype (CIMP). The new filters that use a Beta variance stabilizing transformation are very sensitive to outlier measurements. This may benefit the search for low frequency cancer subtypes that have extreme values occurring across a large number of features (e.g. CpG island methylation phenotype), but may not translate to an ability to directly cluster cancer versus normal tissues well. For clustering cancer versus normal tissue, an outlier removal step was required before the tissue clusters could be properly recovered. In general, the cancer-based simulation study results were not generalizable to the clustering of normal tissue, or tumor versus non-tumor tissue, suggesting that the filter methods are sensitive to the variation in observed DNA methylation distributions due to the underlying biology.

Overall, SD-b performed very well in the real data examples including normal tissues. One explanation could be that the SD-b filter enriches for features in regions having cell-type specific DNA methylation differences. It tended to find groups of approximately equal size, finding a separation of groups by mean age for the normal blood samples that was statistically significant (difference = 6.4 years, p = 0.03); in the cancer and non-cancer studies it identified clusters based on disease state in the first partitioning of samples. Although at first glance it appeared to perform poorly in detecting the CIMP subtype in the colon cancer data, upon further partitioning of the data the sub-cluster of interest appeared. It is unknown if this is a coincidence from the data selected in this study. Although the SD-b filter did not show a high AUC in the simulation study, it did cluster the samples perfectly using RPMM when we did not set boundaries on the largest effect size, suggesting that the cluster analysis can be strongly influenced by a small number of highly informative features.

We found the SD-b and TM-GOF filters tended to prioritize features in different areas of the genome. For each HM450 data set, TM-GOF selected a higher number of features in CGI promoters compared to SD-b, while SD-b selected a larger number of features in non-CGI intergenic regions. The better performance of TM-GOF for detecting a CIMP subtype in cancer, and SD-b for clustering normal tissues, suggests that features in different regions of the genome are not equally informative for all biological conditions. A recent study reported a novel classification of breast cancer using markers of normal-breast epithelial cell subtypes [19], and might explain our superior classification of cancer versus normal breast using SD-b, the filter performing best in normal tissues.

The SD-m filter was a close competitor to SD-b, but Precision showed an unexplained sensitivity to tissue type. Both Precision and SD-m performed slightly better than SD-b in the analysis of CIMP cancers and nearly as well in clustering cancer and non-cancer kidney and breast. In the analysis of whole blood, SD-m found clusters that correlated with age, but Precision did not. Because of this unexplained sensitivity of Precision to non-cancer tissue we do not recommend its general use.

One filter method not included in our comparison is arcsine-square-root transformation, also a good variance stabilization method suitable for data bounded between 0 and 1. Similar to the logit transformation, the arcsine-square-root transformation can be written as an incomplete beta function having only one parameter [12]. Thus we would expect a filter based on its standard deviation to behave similarly to our filter using the logit transformed data (SD-m). Another statistic that could be used as a filter method is the SD ratio, $\mathrm{SD}\left(\mathrm{X}\right)/\sqrt{\mathrm{mean}\left(\mathrm{X}\right)\left(1-\mathrm{mean}\left(\mathrm{X}\right)\right)}$. For a Beta distributed variable X, this ratio is equivalent to the (inverse) precision method used in this paper (filter #5).

Although the summary-based filtering methods take advantage of using the top ranked filter methods, they are not always more robust than the single-filter methods. This is because sometimes the best rank of a feature can be affected by a single non-informative filtering method. Thus, due to the different (and somewhat complimentary) characteristics of the features enriched by SD-b and TM-GOF methods, we prefer to use both SD-b and TM-GOF methods for any data when our main purpose of cluster analysis is to identify novel subgroups. Our results suggest that SD-b is very robust in enriching for features that identify big subgroups, while TM-GOF and TQ-GOF are very sensitive in enriching features to identify low frequency cancer subtypes that have outlying values occurring across a large number of features (e.g. CpG island methylation phenotype).

We noticed that in the data sets comparing DNA methylation in cancer to non-cancer tissue, the differences in standard deviation (SD) between sample groups are not symmetrically distributed. The majority of features have a much higher SD in cancer samples than in normal samples. However, in data sets with non-CIMP vs. CIMP cancer subtypes, the differences in SD are symmetrically distributed with mean around zero (data not shown). This suggests that SD on beta values may be more informative in data sets with huge SD differences between subgroups than in data sets with balanced SD differences around zero.

One limitation of our simulation design is that for each tissue subgroup our measures are simulated to follow a beta distribution and the best performing filter methods make proper use of this knowledge. In reality, a mixture of betas might yield a more realistic measure from a population of mixed cell types. However, instead of simulating this added complexity which would require additional model assumptions, we chose to look for patterns of behaviour from the analysis of a variety of real data sets that spanned different biological conditions (e.g. tumor only, tumor versus normal, or single normal tissue). This evaluation shows: (1) the top two filter methods, TM-GOF and SD-b, prioritize features from different parts of the genome, (2) TM-GOF is much more susceptible to outlier measures, and (3) that the underlying biology can drive their performance.

One question not addressed in this study is the number of features to carry forward to the cluster analysis. One might plot the filter statistics to see if they show a bimodal distribution, suggesting subgroups of features with different behaviour. In our experience, the statistics are unimodal so we tend to use a number of thresholds for features selection (e.g. top 1000, top 2000, top 5000 features), and evaluate the robustness of our groups across the different feature lists.

## Conclusions

We found two filter statistics, SD-b and TM-GOF, outperform the rest in different data sets with different nature. We would suggest using each one, as cluster analysis is for the purpose of class discovery and the two methods tend to prioritize different sets of features, thus serving as complimentary feature enrichment methods for DNA methylation data.

### Availability of supporting data

The Cancer Genome Atlas data (data sets 2–7) are publicly available from the TCGA data portal (https://tcga-data.nci.nih.gov/tcga/). The blood samples (data set 8) are the subset of samples from plate 2 of GEO data set GSE40279 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40279). The exact samples included in our analysis are provided in Additional file 3.

## Declarations

### Acknowledgements

This work was supported by NCI grant numbers R01 CA097346, 5P30 CA014089, NHGRI grant number R01 HG006705, and NIEHS grant number P30 ES07048. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.

## Authors’ Affiliations

## References

- Bourgon R, Gentleman R, Huber W: Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci U S A. 2010, 107 (21): 9546-9551.View ArticlePubMed CentralPubMedGoogle Scholar
- Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23 (19): 2507-2517.View ArticlePubMedGoogle Scholar
- Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung JA, Chiew YE, Haviv I, Gertig D, DeFazio A, Bowtell DD: Novel molecular subtypes of serous and endometrioid ovarian cancer linked to. Clin Cancer Res. 2008, 14 (16): 5198-5208.View ArticlePubMedGoogle Scholar
- Kim EY, Kim SY, Ashlock D, Nam D: MULTI-K: accurate classification of microarray subtypes using ensemble k-means. BMC Bioinformatics. 2009, 10: 260-View ArticlePubMed CentralPubMedGoogle Scholar
- Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT: Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics. 2008, 9: 365-View ArticlePubMed CentralPubMedGoogle Scholar
- Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT: DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012, 13: 86-View ArticlePubMed CentralPubMedGoogle Scholar
- Koestler DC, Christensen B, Karagas MR, Marsit CJ, Langevin SM, Kelsey KT, Wiencke JK, Houseman EA: Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis. Epigenetics. 2013, 8 (8): 816-826.View ArticlePubMed CentralPubMedGoogle Scholar
- Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlen SE, Greco D, Soderhall C, Scheynius A, Kere J: Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PloS one. 2012, 7 (7): e41361-View ArticlePubMed CentralPubMedGoogle Scholar
- Noushmehr H, Weisenberger DJ, Diefes K, Phillips HS, Pujara K, Berman BP, Pan F, Pelloski CE, Sulman EP, Bhat KP, Verhaak RGW, Hoadley KA, Hayes DN, Perou CM, Schmidt HK, Ding L, Wilson RK, Van Den Berg D, Shen H, Bengtsson H, Neuvial P, Cope LM, Buckley J, Herman JG, Baylin SB, Laird PW, Aldape K: Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell. 2010, 17: 510-522.View ArticlePubMed CentralPubMedGoogle Scholar
- Weisenberger DJ, Siegmund KD, Campan M, Young J, Long TI, Faasse MA, Kang GH, Widschwendter M, Weener D, Buchanan D, Koh H, Simms L, Barker M, Leggett B, Levine J, Kim M, French AJ, Thibodeau SN, Jass J, Haile R, Laird PW: CpG island methylator phenotype underlies sporadic microsatellite instability and is tightly associated with BRAF mutation in colorectal cancer. Nat Genet. 2006, 38: 787-793.View ArticlePubMedGoogle Scholar
- Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, Lin SM: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010, 11: 587-View ArticlePubMed CentralPubMedGoogle Scholar
- Rocke DM: On the beta transformation family. Technometrics. 1993, 35: 72-81.View ArticleGoogle Scholar
- Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan JB, Shen R: High density DNA methylation array with single CpG site resolution. Genomics. 2011, 98 (4): 288-295.View ArticlePubMedGoogle Scholar
- Hartigan J, Hartigan P: The dip test of unimodality. Ann Stat. 1985, 13: 70-84.View ArticleGoogle Scholar
- Bell D, Berchuck A, Birrer M, Chien J, Cramer D, Dao F, Dhir R, DiSaia P, Gabra H, Glenn P, Godwin A, Gross J, Hartmann L, Huang M, Huntsman D, Iacocca M, Imielinski M, Kalloger S, Karlan B, Levine D, Mills G, Morrison C, Mutch D, Olvera N, Orsulic S, Park K, Petrelli N, Rabeno B, Rader J, Sikic B, et al: Integrated genomic analyses of ovarian carcinoma. Nature. 2011, 474 (7353): 609-615.View ArticleGoogle Scholar
- Weinstein JN, Akbani R, Broom BM, Wang W, Verhaak RG, McConkey D, Lerner S, Morgan M, Creighton CJ, Smith C, Kwiatkowski DJ, Cherniack AD, Kim J, Sekhar Pedamallu C, Noble MS, Al-Ahmadie HA, Reuter VE, Rosenberg JE, Bajorin DF, Bochner BH, Solit DB, Koppie T, Robinson B, Gordenin DA, Fargo D, Klimczak LJ, Roberts SA, Au J, Laird PW, Hinoue T, et al: Comprehensive molecular characterization of urothelial bladder carcinoma. Nature. 2014, 507 (7492): 315-322.View ArticleGoogle Scholar
- Triche TJ, Weisenberger DJ, Van Den Berg D, Laird PW, Siegmund KD: Low-level processing of illumina infinium DNA methylation BeadArrays. Nucleic Acids Res. 2013, 41 (7): e90-View ArticlePubMed CentralPubMedGoogle Scholar
- Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D, Beck S: A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics. 2013, 29 (2): 189-196.View ArticlePubMed CentralPubMedGoogle Scholar
- Santagata S, Thakkar A, Ergonul A, Wang B, Woo T, Hu R, Harrell JC, McNamara G, Schwede M, Culhane AC, Kindelberger D, Rodig S, Richardson A, Schnitt SJ, Tamimi RM, Ince TA: Taxonomy of breast cancer based on normal cell phenotype predicts outcome. J Clin Invest. 2014, 124 (2): 859-870.View ArticlePubMed CentralPubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.