- Methodology article
- Open Access
Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions
- E Andres Houseman^{1, 2}Email author,
- Brock C Christensen^{2},
- Ru-Fang Yeh^{3},
- Carmen J Marsit^{4},
- Margaret R Karagas^{5},
- Margaret Wrensch^{6},
- Heather H Nelson^{7},
- Joseph Wiemels^{3},
- Shichun Zheng^{6},
- John K Wiencke^{6} and
- Karl T Kelsey^{2, 4}
https://doi.org/10.1186/1471-2105-9-365
© Houseman et al; licensee BioMed Central Ltd. 2008
Received: 23 April 2008
Accepted: 09 September 2008
Published: 09 September 2008
Abstract
Background
Epigenetics is the study of heritable changes in gene function that cannot be explained by changes in DNA sequence. One of the most commonly studied epigenetic alterations is cytosine methylation, which is a well recognized mechanism of epigenetic gene silencing and often occurs at tumor suppressor gene loci in human cancer. Arrays are now being used to study DNA methylation at a large number of loci; for example, the Illumina GoldenGate platform assesses DNA methylation at 1505 loci associated with over 800 cancer-related genes. Model-based cluster analysis is often used to identify DNA methylation subgroups in data, but it is unclear how to cluster DNA methylation data from arrays in a scalable and reliable manner.
Results
We propose a novel model-based recursive-partitioning algorithm to navigate clusters in a beta mixture model. We present simulations that show that the method is more reliable than competing nonparametric clustering approaches, and is at least as reliable as conventional mixture model methods. We also show that our proposed method is more computationally efficient than conventional mixture model approaches. We demonstrate our method on the normal tissue samples and show that the clusters are associated with tissue type as well as age.
Conclusion
Our proposed recursively-partitioned mixture model is an effective and computationally efficient method for clustering DNA methylation data.
Keywords
- Mixture Model
- Bayesian Information Criterion
- Beta Distribution
- Illumina GoldenGate
- Random Forest Analysis
Background
Epigenetics is the study of heritable changes in gene function that cannot be explained by changes in DNA sequence [1]. One of the most commonly studied epigenetic alterations is cytosine methylation, which occurs in the context of a CpG dinucleotide. Concentrations of CpGs known as CpG islands, when sufficiently methylated, are associated with transcriptional gene silencing tantamount to "one hit" as part of Knudson's two hit hypothesis of tumor suppressor gene inactivation [2]. DNA methylation associated gene silencing is a well recognized epigenetic mechanism that often occurs at tumor suppressor gene loci in human cancer. Hundreds of reports of methylation induced silencing at tumor suppressor genes in virtually all types of human cancer have been published [3, 4].
While there has been a tremendous effort to characterize epigenetic alterations in cancer, surprisingly little work has been done in disease-free tissues. There is a basic need for epigenetic profiling of normal tissues to better understand the contribution of these profiles to tissue specificity, especially in the context of phenotypically important CpGs, where deregulation is associated with human diseases such as cancer. While efforts to characterize the methylation profiles of normal tissues in humans and mice have begun, and certain themes are slowly becoming apparent, relatively few reports have emerged [4–11]. Most CpGs or CpG regions have been found to have a bimodal distribution of methylation profiles, either hypo- or hypermethylated. Another theme is the disproportionate location of cell or tissue type dependent differentially methylated regions in non-CpG island sites [4, 8]. Furthermore, the Human Epigenome Project showed that the human major histocompatability loci have differential methylation across various human tissues, but that differential methylation does not necessarily lead to differential expression [8]. It is therefore critical to first outline the basal-state of phenotypically important epigenetic marks that are known to contribute to cancer in order to have a background for comparison to other normal and diseased tissues. An approach which characterizes the basal epigenetic state is best suited to foster the discovery of epigenetic profiles that are associated with particular disease states, patient characteristics, exposures or other covariates that may contribute to pathogenesis.
Cluster analysis is often used to identify methylation subgroups in data [12, 13] and, in particular, Siegmund et al. (2004) argue that model-based clustering techniques are often superior to nonparametric approaches [13]. Large-scale methylation arrays are now available for studying methylation genome-wide; the GoldenGate methylation platform from the manufacturer Illumina (San Diego, CA) simultaneously measures cytosine methylation at 1505 phenotypically-important loci associated with over 800 cancer-related genes. The result of the array is a sequence of "beta" values, one for each locus, calculated as the average of approximately 30 replicates (approximately 30 beads per site per sample) of the quantity max(M, 0)/(|U| + |M| + Q), where U is the fluorescent signal from an unmethylated allele on a single bead, M is that from a methylated allele, and Q is a constant chosen to ensure that the quantity is well-defined; an absolute value is used in the denominator of the formula to compensate for negative signals due to background subtraction. Note that each beta value is an approximately continuous variable lying between zero and one, where zero represents an unmethylated locus and one represents a methylated locus. As such, the beta value is appropriately modeled with a beta distribution; we further motivate the choice of beta distribution in the Methods section below. A data set consisting of such sequences produces a high-dimensional data-analysis problem which poses challenges for traditional clustering approaches. In addition, analysis of heterogeneous tissue data can lead to a large number of clusters, as we demonstrate below, which presents further challenges for clustering techniques. For example, nonparametric approaches rely on a choice of metric, which may be difficult to justify in the context of high dimensions and numerous clusters. On the other hand, in model-based clustering, multi-modality of the data likelihood may lead to numerical instability or difficulty in determining the best solution [14].
We propose a novel method for model-based clustering of data of the type produced by Illumina GoldenGate arrays. Our method makes use of a beta mixture model [15]. Although one could use BIC (or similar quantities) to select the number of clusters in the data set, we propose a recursive-partitioning algorithm that provides the number of clusters and a reliable solution in a shorter amount of time than sequential attempts with different numbers of assumed clusters. This is similar in spirit to the idea of recursive partitioning used in Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH, [16]), in which clusters are recursively partitioned using a nonparametric algorithm such as PAM [17]. Our method is also an unsupervised variant of Hierarchical Mixtures of Experts [18], a fuzzy version of CART [19]. We also propose a method for reducing the number of loci considered in the analysis, and selecting the optimal number using an "augmented" BIC statistic. We also present a simulation study comparing its properties to those of competitor methods. Finally, we demonstrate the methodology on GoldenGate methylation array data obtained from 217 normal tissue samples.
Results
Simulation
Classification error and computation time for various clustering methods applied to simulated data.
Classification Error (%) | ||||||||
---|---|---|---|---|---|---|---|---|
J | HC | DynTree | HOPACH (best) | HOPACH (greedy) | MM(1–6) | RPMM (ICL-BIC) | RPMM (BIC) | |
Case 1 | 25 | 33.2 | 44.7 | 9.9 | 16.4 | 12.6 | 15.5 | 15.4 |
50 | 32.5 | 43.8 | 5.0 | 10.0 | 6.2 | 5.5 | 5.5 | |
500 | 33.9 | 38.4 | 3.5 | 11.3 | 1.5 | 0.1 | 0.1 | |
1000 | 34.0 | 38.5 | 9.2 | 14.4 | 1.1 | 0.1 | 0.1 | |
Case 2 | 5 | 59.4 | 60.5 | 65.1 | 65.8 | 59.4 | 59.4 | 59.4 |
10 | 58.9 | 60.0 | 66.9 | 67.5 | 59.2 | 59.2 | 59.2 | |
25 | 30.0 | 39.6 | 4.1 | 8.1 | 0.0 | 0.0 | 0.0 | |
50 | 29.9 | 39.6 | 3.6 | 6.4 | 0.3 | 0.3 | 0.3 | |
Computation Time (seconds) | ||||||||
J | HC | DynTree | HOPACH (best) | HOPACH (greedy) | MM(1–6) | RPMM (ICL-BIC) | RPMM (BIC) | |
Case 1 | 25 | 0.00 | 0.04 | 4.15 | 1.18 | 36.39 | 13.80 | 13.83 |
50 | 0.01 | 0.05 | 3.29 | 1.09 | 51.14 | 14.23 | 14.23 | |
500 | 0.03 | 0.08 | 2.98 | 1.04 | 436.82 | 90.99 | 91.05 | |
1000 | 0.06 | 0.11 | 3.05 | 1.10 | 848.10 | 176.99 | 176.81 | |
Case 2 | 5 | 0.00 | 0.04 | 2.80 | 1.21 | 29.73 | 5.14 | 6.09 |
10 | 0.00 | 0.04 | 2.01 | 1.13 | 46.48 | 9.69 | 10.05 | |
25 | 0.00 | 0.01 | 3.33 | 1.23 | 34.56 | 8.85 | 8.86 | |
50 | 0.01 | 0.01 | 2.63 | 1.16 | 47.52 | 10.90 | 10.86 |
Number of classes obtained for various clustering methods applied to simulated data
Case 1 (5 true classes) | Case 2 (4 true classes) | |||||||
---|---|---|---|---|---|---|---|---|
Method | J | Median | Mean | SD | J | Median | Mean | SD |
DynTree | 25 | 3 | 2.5 | 0.50 | 25 | 2 | 2.0 | 0.00 |
50 | 3 | 2.5 | 0.50 | 50 | 2 | 2.0 | 0.00 | |
500 | 3 | 2.7 | 0.58 | 500 | 2 | 2.0 | 0.00 | |
1000 | 3 | 2.8 | 0.59 | 1000 | 2 | 2.0 | 0.00 | |
HOPACH (best) | 25 | 40 | 38.0 | 12.10 | 5 | 17 | 18.9 | 9.10 |
50 | 35 | 35.4 | 11.38 | 10 | 14 | 15.0 | 8.27 | |
500 | 23 | 23.0 | 9.52 | 25 | 25 | 24.7 | 9.80 | |
1000 | 23 | 23.1 | 9.47 | 50 | 25 | 25.3 | 7.34 | |
HOPACH (greedy) | 25 | 8 | 13.4 | 14.41 | 5 | 5 | 7.1 | 6.35 |
50 | 6 | 11.9 | 12.66 | 10 | 5 | 7.1 | 7.11 | |
500 | 5 | 6.6 | 5.19 | 25 | 7.5 | 10.8 | 8.52 | |
1000 | 4 | 6.2 | 4.41 | 50 | 8 | 10.1 | 7.85 | |
RPMM | 25 | 8 | 7.7 | 2.00 | 5 | 2 | 2.0 | 0.10 |
50 | 5 | 5.6 | 1.32 | 10 | 2 | 2.4 | 2.28 | |
500 | 5 | 5.0 | 0.22 | 25 | 4 | 4.0 | 0.20 | |
1000 | 5 | 5.0 | 0.00 | 50 | 4 | 4.1 | 0.58 |
In the Methods section we propose a variant of BIC to compare model fit for different numbers J of loci. For Case I, the mixture models always minimized the "augmented BIC" at J = 1000, while for Case II, the mixture models always minimized the augmented BIC at J = 25. For Case I, nearly all 1413 dimensions were at least somewhat informative; it is interesting to note that J was always minimized at its highest value for this case. For Case II, the number of informative dimensions was J = 20, so the minimum J was closest to the true number of informative markers among the J considered in this simulation. In additional simulations that used a finer mesh of J, J was minimized at 20. Similar results were obtained when the classes were less balanced (e.g. Case I with class probabilities respectively 0.15, 0.30, 0.2, 0.25, and 0.1).
Normal Tissue
Cross-classification of sample type with latent classes obtained from proposed method
Class | bladder | blood (ad) | blood (nb) | brain | cervical | H & N | kidney | lung | placenta | pleura | sm intestine | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|
000 | 3 | 2 | 12 | 8 | 3 | 28 | ||||||
0010 | 19 | 5 | 24 | |||||||||
0011 | 20 | 2 | 1 | 23 | ||||||||
0100 | 2 | 2 | 1 | 4 | 2 | 2 | 1 | 14 | ||||
01010 | 1 | 4 | 5 | |||||||||
0101100 | 3 | 3 | ||||||||||
0101101 | 3 | 3 | ||||||||||
010111 | 2 | 2 | ||||||||||
01100 | 1 | 1 | 2 | |||||||||
01101 | 5 | 5 | ||||||||||
0111 | 13 | 13 | ||||||||||
1000 | 3 | 3 | ||||||||||
100100 | 2 | 2 | ||||||||||
100101 | 4 | 4 | ||||||||||
1001100 | 3 | 3 | ||||||||||
1001101 | 4 | 4 | ||||||||||
100111 | 5 | 5 | ||||||||||
101 | 34 | 34 | ||||||||||
1100 | 18 | 18 | ||||||||||
1101 | 12 | 12 | ||||||||||
11100 | 5 | 5 | ||||||||||
11101 | 3 | 3 | ||||||||||
1111 | 1 | 1 | 2 | |||||||||
Total | 5 | 30 | 55 | 12 | 3 | 11 | 6 | 53 | 19 | 18 | 5 | 217 |
Using a permutation test with chi-square statistic, the P value for a hypothesis of no association between class and sample type was less than 0.0001. Thus, our proposed method found clusters relevant to sample type. In addition, a permutation test using a Kruskal-Wallis test statistic produced P < 0.0001 for a hypothesis of no difference in mean age among the classes. Interestingly, when the clustering and subsequent hypothesis test was restricted to blood, the P < 0.0001 for a hypothesis of no difference in mean age among latent classes. Between the two classes found among the adult liquid blood samples, age was significantly different (P < 0.0039), consistent with known associations between age and methylation.
Note that the recursion path contains information about relative similarity. For example, in Figure 1, the classes (0...) seem rather distinct from the classes (1...). There is a band of loci that show high methylation in two classes at the top of the figure, indexed as 11101 and 11100, corresponding to brain tissues, but not in the other classes in the left node (1...), mostly corresponding to blood samples; this band also occurs at the bottom of Figure 1, in classes such as 01010, which also include brain tissues. However, as in any hierarchical clustering procedure, the ordering of the children within a node is meaningless, so that classes (11...) and (10...) could have their positions swapped. There is a wider band of loci to the right, which though the distinction may be subtle, appears to drive the distinction between (0...) and (1...). However, because the initial levels of recursion may constitute only a crude splitting of the data, the tree representing the recursion path likely has greater meaning only at lower levels of recursion.
Cross-classification of sample type with clusters obtained from HOPACH
Class | bladder | blood (ad) | blood (nb) | brain | cervical | H & N | kidney | lung | placenta | pleura | sm intestine | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 30 | 1 | 31 | |||||||||
2 | 55 | 55 | ||||||||||
3 | 10 | 10 | ||||||||||
4 | 2 | 1 | 10 | 13 | ||||||||
5 | 5 | 2 | 6 | 53 | 18 | 5 | 89 | |||||
6 | 1 | 1 | ||||||||||
7 | 16 | 16 | ||||||||||
8 | 1 | 1 | ||||||||||
9 | 1 | 1 | ||||||||||
Total | 5 | 30 | 55 | 12 | 3 | 11 | 6 | 53 | 19 | 18 | 5 | 217 |
Cross-classification of latent classes obtained from proposed method with clusters obtained from HOPACH
Class | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Total |
---|---|---|---|---|---|---|---|---|---|---|
000 | 28 | 28 | ||||||||
0010 | 24 | 24 | ||||||||
0011 | 23 | 23 | ||||||||
0100 | 2 | 12 | 14 | |||||||
01010 | 5 | 5 | ||||||||
0101100 | 3 | 3 | ||||||||
0101101 | 3 | 3 | ||||||||
010111 | 1 | 1 | 2 | |||||||
01100 | 1 | 1 | 2 | |||||||
01101 | 4 | 1 | 5 | |||||||
0111 | 12 | 1 | 13 | |||||||
1000 | 3 | 3 | ||||||||
100100 | 2 | 2 | ||||||||
100101 | 4 | 4 | ||||||||
1001100 | 3 | 3 | ||||||||
1001101 | 4 | 4 | ||||||||
100111 | 5 | 5 | ||||||||
101 | 34 | 34 | ||||||||
1100 | 18 | 18 | ||||||||
1101 | 12 | 12 | ||||||||
11100 | 5 | 5 | ||||||||
11101 | 3 | 3 | ||||||||
1111 | 1 | 1 | 2 | |||||||
Total | 31 | 55 | 10 | 13 | 89 | 1 | 16 | 1 | 1 | 217 |
Cross-classification of sample type with latent classes obtained from proposed method among subjects within the 5^{th} class obtained by HOPACH
Class | bladder | cervical | kidney | lung | pleura | sm intestine | Total |
---|---|---|---|---|---|---|---|
000 | 3 | 2 | 12 | 8 | 3 | 28 | |
0010 | 19 | 5 | 24 | ||||
0011 | 20 | 2 | 1 | 23 | |||
0100 | 2 | 1 | 4 | 2 | 2 | 1 | 12 |
010111 | 1 | 1 | |||||
01100 | 1 | 1 | |||||
Total | 5 | 2 | 6 | 53 | 18 | 5 | 89 |
Discussion
Our proposed method is a model-based version of the HOPACH algorithm [16]: we recursively use a beta-mixture model [15] to propose a split of an existing cluster, preserving the split only when it is judged on the basis of BIC to better fit the data. We have found that the integrated-classification-likelihood BIC (ICL-BIC) [15] and BIC produce practically identical results; this result is not unexpected, since the two differ only in a term involving entropy and a perfect classification results in zero entropy. A high-dimensional data set will tend to result in approximately perfect classification of most subjects, even though each subject has the opportunity to be portioned out over multiple clusters (a distinction between our method and nonparametric methods such as HOPACH). Consequently, in the deeper levels of recursion, the algorithm will not depend on the distinction between BIC and ICL-BIC. For the earlier recursion stages, we assume that there is a strong preference for splitting, which overcomes the entropy penalty in the ICL-BIC.
Siegmund et al. (2004) argue that model-based clustering is preferred in this context over hierarchical clustering [13], a finding that bears out in our simulations. One reason for the superior performance, at least in a high-dimensional context, is that the metric used to characterize the differences in nonparametric contexts may be relatively insensitive to differences in particular dimensions. This may play a role in the apparent differences in classification of normal tissue between our proposed method and HOPACH.
K-means have been used recently to cluster methylation outcomes [12], though the work of van der Laan and Pollard (2003) seems to suggest that HOPACH may yield results that are superior to K-means. In particular, with K-means it is difficult to know how many classes are inherent in the data without resampling-based methods such as the gap statistic [21], with implications for scalability. Also, the "curse of dimensionality" would tend to degrade the performance of procedures such as K-means when there are a large number of clusters and the observed data is of high dimension. In general, nonparametric methods such as the fanny algorithm [17] rely on tuning parameters that are difficult to optimize without resampling. An additional problem with non-parametric procedures is that they typically consider only the first moment (mean) of the underlying distributions, ignoring the second-moment (variance) which for DNA methylation as measured by the GoldenGate assay, may play a critical role in distinguishing tissues.
We propose a dimension-reduction strategy which simply ranks candidate dimensions on the basis of some criterion such as variance, fits the top J dimensions in a mixture model, and employs an augmented version of BIC to compare model fit across different values of J. This is a departure from the penalized-likelihood methods of the kind described in [22], which would become computationally difficult for truly high-dimensional data. Our approach is similar in spirit to supervised principal components methods such as [23]. Interestingly, for the normal tissue data, all 1413 loci were found to be informative. The implication is that methylation at even the least variable locus, COL6A1_P283_F (a locus on chromosome 21 within a gene that encodes the extracellular matrix component collagen VI), contains information about tissue type. In fact, box-plots showing the distribution of COL6A1_P283_F methylation demonstrate great heterogeneity in apparent distribution by tissue type [see Additional file 1], even though all methylation average beta values were less than 0.05. The Kruskal-Wallis P value for testing the association of COL6A1_P283_F methylation with tissue type was P < 0.0001. While this difference could be a subtle artifact of plate-to-plate variability and heterogeneity in plate sample composition, it also may be that the average beta measured by the GoldenGate assay is in fact an average of methylation status over different cell types: that is, tissue samples may consist of cells having heterogeneous methylation states, with each tissue type having a unique mixture. A separate paper describing analysis of normal tissue in detail is under preparation.
We remark that we did not normalize the different plates used in laboratory analysis, for reasons described below. However, when we used the analytical methods of this paper to classify methylation profiles within plate strata and within tissue type strata, we found much greater variability between tissue types than between plates (results not shown). Our results are consistent with expectations developed from an understanding of the current literature on DNA methylation in normal tissues, though we cannot entirely rule out subtle biases introduced by variation between plates. It may be possible to incorporate plate effects in our proposed methodology by modeling the mean response (e.g., see [22]), but the computational considerations are beyond the scope of the present paper.
Conclusion
In summary, our method appears to have good properties both with respect to classification error and computation time. It achieves these properties by combining the strengths of model-based and hierarchical methods, navigating the underlying clusters quickly through recursive partitioning, but doing so in a way that makes use of a reasonable probability model. This model is also used to compare different dimensions J of input, thus refining the discriminative ability in a scalable manner.
Methods
Normal Tissue Data
These data are part of a larger project comparing normal tissue to tumors for many different types of tumor (e.g. bladder, brain, leukemia, lung, mesothelioma, etc.); each of these comparisons constitutes a distinct set of analyses deserving of a separate, focused analysis, so we omit the details from the present manuscript. In addition, a more comprehensive analysis of normal tissue will appear in a separate manuscript targeted to biologists. Within logistical constraints (e.g. availability of normal tissue, which is difficult to obtain for some tissue types, and time and budget constraints) we randomized tissue type to 14 plates (96 wells each). However, because of the great number of different tissue types, the difficulty in collecting normal tissue of specific types (recall that methylation profiles differ profoundly by tissue type), and heterogeneity of both normal and tumor samples, we expect the plate-to-plate heterogeneity to be dominated the heterogeneity of sample type on each plate. When we used the methods of this paper to classify methylation profiles, stratifying first by plate and examining associations with sample type, then stratifying by distinct tissue-type and examining associations with plate, we found that there are unequivocal differences between sample type on a single plate, and generally no differences between plate within a given sample type, with the possible exception of placenta, whose methylation values have greater variability and therefore may have more sensitivity to random heterogeneity with small sample size. On the other hand, attempts to normalize plates using standard techniques may result in overcorrection, since plate-to-plate variability is mostly driven by sample heterogeneity, and corrections will tend to average out the distinctions of interest.
Note that in addition to providing average "beta" values, the GoldenGate platform also provides averages $\overline{M}$ and $\overline{U}$ (average fluorescence for methylated and unmethylated alleles). For our data, the quantities $\overline{M}/(\overline{U}+\overline{M})$ were close to the reported average "beta" values, with the standard deviation of the difference between the two less than 0.01, a fact which has implications below.
Recursive-partitioning for a Beta Mixture Model
with respect to the set of all parameters (α, β, η) to be estimated:
and selecting the value of K corresponding to the minimum BIC. In the context of beta mixture models, a modified version of BIC is available [15]: the ICL-BIC is defined as the sum of (4) and the entropy term $-2{\displaystyle {\sum}_{i=1}^{n}{\displaystyle {\sum}_{k=1}^{K}{w}_{ik}\mathrm{log}\phantom{\rule{0.1em}{0ex}}({w}_{ik})}}$ (where 0 log (0) = 0). The entire operation has approximate complexity $nJ{K}_{\mathrm{max}\phantom{\rule{0.1em}{0ex}}}^{2}$, where K_{max} is the maximum number of classes attempted. The square term arises under the assumption that for a single model with K classes, the complexity will be of order nJK.
Because likelihoods for model-based clustering algorithms can be multi-modal [14, 26], commercial mixture model software packages often use multiple starting values for fitting the model, and subsequently choose the estimates corresponding to the maximum likelihood. However, careful choice of starting values can often minimize the effort [22, 26]. One option is to use hierarchical clustering to find K clusters (cutting the clustering dendrogram at the appropriate height), and constructing a weight matrix W corresponding to these clusters. Another, similar, option is to use a fuzzy clustering algorithm such as the fanny algorithm [17] available in the R package cluster.
When ω_{ i }≡ 1 for all i, (2) and (5) are equivalent. When 0 <ω_{ i }< 1, subject i only partially contributes to estimation, and when ω_{ i }= 0, subject i is excluded entirely from consideration in the model. The EM algorithm described above is easily modified by multiplying each w_{ ik }by ω_{ i }in (3), where the interpretation is that the classes under consideration are only a partial set, and that subject i belongs to one of these classes only with probability ω_{ i }. If we begin by fitting a 2-class model to the entire data set, the result is two sets of posterior weights representing the posterior probabilities of membership in each of the two classes. Under the assumption that each of these classes can be further split, and that each subject belongs to the subsequent splits only with probability equal to the weight assigned to the un-split class, we apply the weighted-likelihood EM algorithm to obtain the two classes corresponding to the new split.
where the first set of parameters, defining wtdBIC_{2}, are obtained from the two-class mixture model and the second set of parameters, defining wtdBIC_{1}, are obtained from a one-class model. Alternatively, one may add an entropy term to wtdBIC_{2}(r) to produce the ICL version of this BIC value; note that for a one-class model, entropy is always zero. If wtdBIC_{2}(r) is greater than wtdBIC_{1}(r), we terminate the recursion at node r. The worst-case complexity of this algorithm is n log(n)J. However, at deeper levels of recursion, two-class models will tend to fit poorly relative to single-class models, and most nodes will terminate before descending to the deepest levels. We have demonstrated empirically that the proposed method tends to terminate with the number of leaf classes equal to the true number of classes, so that the average complexity is typically of approximate order nJK log(K). Furthermore, in the deeper classes, subjects whose weights are negligible can be dropped entirely from the weighted EM algorithm, so that the complexity of the node-level fit at deeper levels is much less than n.
Dimension reduction
where BIC_{ J }is the BIC computed for just the J selected loci. The augmented BIC is now comparable across different values of J. As we have demonstrated, augBIC_{ J }leads to sensible dimension reduction. Again, an ICL version of augBIC_{ J }may be used instead.
Simulation
For our second case (Case II), which represented a lower-dimensional setting (L = 200) with greater variation in variance of individual beta distributions, we considered 100 subjects from 4 classes, described as follows. Four sets of 10 "informative" beta parameters were drawn randomly at the beginning of the simulation study: a_{1j}~ Gamma(10,10), b_{1j}~ Gamma(10,10); a_{2j}~ Gamma(400,10), b_{2j}~ Gamma(100,10); a_{3j}~ Gamma(100,10), b_{3j}~ Gamma(400,10); and a_{4j}~ Gamma(100,1), b_{4j}~ Gamma(250,1). These were used to construct four classes of 20 informative dimensions: α_{1} = (a_{2}, a_{1}), α_{2} = (a_{2}, a_{4}), α_{1} = (a_{3}, a_{1}), α_{1} = (a_{3}, a_{4}), where a_{1} = (a_{1j}), and similarly for the β_{ k }parameters with b_{ l }= (b_{ lj }). Each such 20-dimensional parameter was augmented with a set of 180 "noninformative" parameters, constructed as 60 copies of the vector (100,1,50) for α_{ k }and 60 copies of the vector (1,100,50) for β_{ k }. The class probabilities were respectively 0.2,0.3,0.2, and 0.3. Although the pattern corresponding to this collection of parameters may be difficult to visualize from the description, Figure 3(B) shows a typical data set generated under these conditions, and reveals a small set of informative markers, some having distinctions in mean and others in variability. Similar analyses were conducted forth is simulation, except with J ∈ {5, 10, 25, 50}, and 4 classes assumed for hierarchical clustering.
Misclassification rate was assessed for all simulated data sets and analyses. Each estimated class was matched to true class by minimizing the distance between the J means calculated according to (1). When the number of estimated classes was greater than the true number, multiple estimated classes were assigned to a single matching true class, thus generating no misclassification error when the estimated class merely partitioned the true class. When the number of estimated classes was fewer than the true number, subjects within true classes that failed to match to an estimated class were considered misclassified. In the latter case, coarsening of the true classes would lead to the smaller absorbed class being judged as misclassified. In the Results section, we showed that HOPACH tends to overestimate the number of classes for the cases we considered, so our strategy, which favors inappropriate partitioning over inappropriate coarsening, is conservative with respect to comparison with HOPACH in this set of simulations.
We conducted additional simulations, for which we omit detailed results. We used normal distributions having the same mean and standard deviation implied by the beta-based simulation described in the text. For assessing classification error, estimated classes were matched to their true classes by matching the fitted alpha and beta parameters to the corresponding parameterization of mean and standard deviation, i.e. inverting (1), in the manner described above. We conducted another simulation based on a different configuration: 200 items, each generated as a random normal with mean and standard deviations as described below, but subsequently transformed as n^{-1} {r_{ i }- 0.5}, where r_{ i }is the rank of the i th value, to obtain a rank-based transformation of the data. The means were generated as 25 random standard normal variables (once at the start of the simulation study) for each of 4 classes, concatenated with 175 zeroes; and the standard deviations were generated as a Gamma(5, 500) (once at the start of the simulation study) for each of 4 classes, concatenated with 175 repeats of 0.01. Note that this configuration represents a substantial departure from the beta distribution, since the marginal distribution of the data are forced to have a uniform distribution. For assessing classification error, estimated classes were matched to their true classes by matching the fitted alpha and beta parameters to the corresponding parameterization of conditional mean and standard deviation, obtained via simulation.
Validation of normal tissue results by Random Forest
To validate results, we conducted a supervised Random Forest analysis [20] using the R package randomForest with 20,000 trees and a value of $38\approx \sqrt{1413}$ (the default) for the mtry parameter (number of randomly selected variables per bootstrap). We also tried values of 14 and 76 for mtry, obtaining slightly greater classification error. Both tissue type (categorical with 11 levels) and age (continuous) were used as response variables. Detailed discussion of these results will appear in a separate manuscript written for biologists. In addition, we conducted a Random Forest analysis on the classification indices for the 9 HOPACH classes and 23 mixture model classes described in the Results section, with mtry = 9 and mtry = 23, respectively, and 20,000 trees.
Availability
An implementation of the methods described in this paper is available [see Additional file 2]. The R environment is required for the software. Additional supporting functions are available from the authors upon request.
Declarations
Acknowledgements
Supported by: Mesothelioma Applied Research Foundation grant, NIH/NIEHS Training grant in environmental health sciences T32ES007155, NIH/NIEHS grant P42ES05947, and NCI grants CA48061, CA126939, CA078609, CA100679, CA121147, and FAMRI.
Authors’ Affiliations
References
- Russo V, Martienssen RA, Riggs AD: Epigenetic mechanisms of gene regulation. Cold Spring Harbor Laboratory Press; 1996.Google Scholar
- Knudson AG: Chasing the cancer demon. Annu Rev Genet 2000, 34: 1–19. 10.1146/annurev.genet.34.1.1View ArticlePubMedGoogle Scholar
- Jones PA, Baylin SB: The fundamental role of epigenetic events in cancer. Nat Rev Genet 2002, 3: 415–428. 10.1038/nrg962View ArticlePubMedGoogle Scholar
- Sakamoto H, Suzuki M, Abe T, Hosoyama T, Himeno E, Tanaka S, Greally JM, Hattori N, Yagi S, Shiota K: Cell type-specific methylation profiles occurring disproportionately in CpG-less regions that delineate developmental similarity. Genes Cells 2007, 12: 1123–1132. 10.1111/j.1365-2443.2007.01120.xView ArticlePubMedGoogle Scholar
- Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, et al.: DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet 2006, 38: 1378–1385. 10.1038/ng1909PubMed CentralView ArticlePubMedGoogle Scholar
- Fraga MF, Ballestar E, Paz MF, Ropero S, Setien F, Ballestar ML, Heine-Suner D, Cigudosa JC, Urioste M, Benitez J, et al.: Epigenetic differences arise during the lifetime of monozygotic twins. Proc Natl Acad Sci USA 2005, 102: 10604–10609. 10.1073/pnas.0500398102PubMed CentralView ArticlePubMedGoogle Scholar
- Frigola J, Song J, Stirzaker C, Hinshelwood RA, Peinado MA, Clark SJ: Epigenetic remodeling in colorectal cancer results in coordinate gene suppression across an entire chromosome band. Nat Genet 2006, 38: 540–549. 10.1038/ng1781View ArticlePubMedGoogle Scholar
- Rakyan VK, Hildmann T, Novik KL, Lewin J, Tost J, Cox AV, Andrews TD, Howe KL, Otto T, Olek A, et al.: DNA methylation profiling of the human major histocompatibility complex: a pilot study for the human epigenome project. PLoS Biol 2004, 2: e405. 10.1371/journal.pbio.0020405PubMed CentralView ArticlePubMedGoogle Scholar
- Schilling E, Rehli M: Global, comparative analysis of tissue-specific promoter CpG methylation. Genomics 2007, 90: 314–323. 10.1016/j.ygeno.2007.04.011View ArticlePubMedGoogle Scholar
- Shann YJ, Cheng C, Chiao CH, Chen DT, Li PH, Hsu MT: Genome-Wide Mapping and Characterization of Hypomethylated Sites in Human Tissues and Breast Cancer Cell Lines. Genome Res 2008.Google Scholar
- Song F, Smith JF, Kimura MT, Morrow AD, Matsuyama T, Nagase H, Held WA: Association of tissue-specific differentially methylated regions (TDMs) with differential gene expression. Proc Natl Acad Sci USA 2005, 102: 3336–3341. 10.1073/pnas.0408436102PubMed CentralView ArticlePubMedGoogle Scholar
- Shen L, Kondo Y, Guo Y, Zhang J, Zhang L, Ahmed S, Shu J, Chen X, Waterland RA, Issa J-PJ: Genome-wide profiling of DNA methylation reveals a class of normally methylated CpG island promoters. PLOS Genetics 2007, 3: e181. 10.1371/journal.pgen.0030181PubMed CentralView ArticleGoogle Scholar
- Siegmund KD, Laird PW, Laird-Offringa IA: A comparison of cluster analysis methods using DNA methylation data. Bioinformatics 2004, 20: 1896–1904. 10.1093/bioinformatics/bth176View ArticlePubMedGoogle Scholar
- Stephens M: Dealing with label switching in mixture models. Journal of the Royal Statistical Society Series B 2000, 62: 795–809. 10.1111/1467-9868.00265View ArticleGoogle Scholar
- Ji Y, Wu C, Liu P, Wang J, Coombes KR: Applications of beta-mixture models in bioinformatics. Bioinformatics 2005, 21: 2118–2122. 10.1093/bioinformatics/bti318View ArticlePubMedGoogle Scholar
- Laan MJ, Pollard KS: A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference 2003, 117: 275–303. 10.1016/S0378-3758(02)00388-9View ArticleGoogle Scholar
- Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley; 1990.View ArticleGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001.View ArticleGoogle Scholar
- Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Boca Raton, Florida: Chapman & Hall; 1984.Google Scholar
- Breiman L: Random Forests. Machine Learning 2001, 45: 5–32. 10.1023/A:1010933404324View ArticleGoogle Scholar
- Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a dataset via the gap statistic. J Royal Statist Soc B 2001, 63: 411–423. 10.1111/1467-9868.00293View ArticleGoogle Scholar
- Houseman EA, Coull BA, Betensky RA: Feature-specific penalized latent class analysis for genomicdata. Biometrics 2006, 62: 1062–1070. 10.1111/j.1541-0420.2006.00566.xView ArticlePubMedGoogle Scholar
- Bair E, Tibshirani R: Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data. PLoS Biol 2004, 2: 1544–9173. 10.1371/journal.pbio.0020108View ArticleGoogle Scholar
- Lefkopoulou M, Moore D, Ryan L: The analysis of multiple correlated binary outcomes: application to rodent teratology experiments. Journal of the American Statistical Association 1989, 84: 810–815. 10.2307/2289671View ArticleGoogle Scholar
- Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Statist Soc B 1977, 39: 1–38.Google Scholar
- Leroux BG, Puterman ML: Maximum-Penalized-Likelihood Estimation for Independent and Markov-Dependent Mixture Models. Biometrics 1992, 48: 545–558. 10.2307/2532308View ArticlePubMedGoogle Scholar
- Fraley C, Raftery AE: Bayesian regularization for normal mixture estimation and model-based clustering. Department of Statistics, University of Washington; 2005.Google Scholar
- Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R. Bioinformatics 2008, 24: 719–720. 10.1093/bioinformatics/btm563View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.