A Beta-mixture model for dimensionality reduction, sample classification and analysis
© Laurila et al; licensee BioMed Central Ltd. 2011
Received: 8 October 2010
Accepted: 27 May 2011
Published: 27 May 2011
Patterns of genome-wide methylation vary between tissue types. For example, cancer tissue shows markedly different patterns from those of normal tissue. In this paper we propose a beta-mixture model to describe genome-wide methylation patterns based on probe data from methylation microarrays. The model takes dependencies between neighbour probe pairs into account and assumes three broad categories of methylation, low, medium and high. The model is described by 37 parameters, which reduces the dimensionality of a typical methylation microarray significantly. We used methylation microarray data from 42 colon cancer samples to assess the model.
Based on data from colon cancer samples we show that our model captures genome-wide characteristics of methylation patterns. We estimate the parameters of the model and show that they vary between different tissue types. Further, for each methylation probe the posterior probability of a methylation state (low, medium or high) is calculated and the probability that the state is correctly predicted is assessed. We demonstrate that the model can be applied to classify cancer tissue types accurately and that the model provides accessible and easily interpretable data summaries.
We have developed a beta-mixture model for methylation microarray data. The model substantially reduces the dimensionality of the data. It can be used for further analysis, such as sample classification or to detect changes in methylation status between different samples and tissues.
Interest in understanding the effects of epigenetics in relation to different complex diseases is increasing. One epigenetic mechanism of particular interest is DNA methylation at cytosines in CpG dinucleotides. The methylation patterns of genes may change and these alterations have been shown to be related to complex diseases, such as heart diseases , schizophrenia  and different cancers [3, 4]. In cancer, several methylation changes are detectable at the early stages of cancer or even in pre-cancerous tissues or blood [5, 6]. In addition, other methylation alterations have been shown to be specific to cancer type and stage [7, 8].
reduction of the dimensionality of a methylation profile
for each probe, computation of the posterior probability of a methylation state,
computation of a posterior probability that the latter state was correctly predicted.
We apply the model to a set of methylation microarray measurements from colon cancer samples and show that the model parameters reflect global patterns in the data. Based on the estimated parameters, we are able to classify the samples with high accuracy and to exhibit global differences between the cancer samples. Furthermore, the model assigns a methylation state to each probe value. Using the states, accessible data summaries are provided.
Results and Discussion
The Methylation Array and the Number of CpGs in Probes in Different Genomic Regions
Parameter Estimates and Interpretation
We used the proposed method to analyze methylation microarray data from normal and colon cancer tumor samples. Colon cancer can be divided into different types and here we study patients with microsatellite instable (MSI) and microsatellite stable (MSS) tumors. In addition, we have samples of benign colon adenomas that are not considered as cancer tumors but are classified as MSS-type adenomas. Our data set contains 42 Illumina methylation 27k microarray samples, divided into 6 normal, 6 adenoma, 6 MSI and 24 MSS-samples.
The basic idea behind the model is the assumption that the methylation level of a CpG (probe) can be divided into three different states, low (L), medium (M) and high (H) methylation. The three states are biologically motivated in the following sense. L corresponds to the situation, where (almost) all cells in a sample are unmethylated and H to the situation, where (almost) all cells are methylated, irrespectively of the composition of the cells in the sample. M captures the situation in which the cells are only partly methylated (e.g. hemi-methylation), or some cell types in the sample are methylated while others are not.
We assume that statistical properties of these different states are the same throughout the genome and that methylation of a CpG site depends on its location (I, S or O) in the genome in relation to the nearest CpG-island. We concentrate on modelling genes with two probes; however, our model can also be extended to genes with more than two methylation probes (for the genes with more than two probes the two first probes were used). The beta-values of a gene's probe pairs are dependent with a correlation coefficient of r = 0.668 between the probes. A similar degree of correlation has been reported for other data sets, while the probes measuring methylation levels of different genes have shown no dependency . We built a model that can be considered a hidden Markov model (HMM) of probe pairs within a gene. The hidden states are the methylation levels, L, M and H. Further, the CpG probe pairs are classified by their locations into classes (I,I), (S,S), (O,O) and (I,S). The latter includes both (I,S) and (S,I) pairs. Other cases were omitted as they contained only a few or no probe pairs. We built one model for each of the classes.
We fitted a mixture of three beta-distributions (distributions corresponding to the low, medium and high methylation states, respectively) for each sample in the colon cancer data set, such that the beta-distribution corresponding to the medium methylation state is symmetrical; as described in greater detail in Methods. The beta-distribution gives the density of the beta-value given the hidden state. That is, for each sample we estimate beta-distribution parameters α and β, mixture proportions ω and a transition probability matrix T for the HMM (see Methods for further details). The mixture proportions are the a priori probabilities that a probe is found in a given hidden state and the transition matrix gives the probabilities that a probe in some hidden state k1 = L, M, or H, is followed by a probe in state k2 = L, M, or H. For the class (I,I) we set the mixture coefficient of the high methylation state to 0, i.e. this state could not be reached as high methylation appears to be very rare in this class. The same is assumed for the I probes in the class (I,S). Figure 2 shows the empirical and the fitted mixture distributions for one MSI-sample for different classes. We also built a model with only two states for every class and a model where the high methylation state was included in all classes but these did not reflect the data properties equally well as the model presented.
Regarding the mixture proportions, the group of MSI cancers is most easily discriminated from the other samples (some examples are shown in Figure 3). Indeed, for the (I,I) probe pair class, the MSI mixture proportions for the low methylation state are clearly lower and the mixture proportions of the medium methylation state are higher than in the other classes. Similarly, but less clearly, differences for the shore (S) region probe pairs could be detected, e.g. medium and high methylation state mixture proportions are higher for MSI cancers compared with the other samples (results not shown). Furthermore, many mixture proportions vary in the same range when studying normals, adenomas and MSS cancers. However, normals have slightly higher proportions for low methylation in (I,I) class than the other samples.
The probes in the outside regions show the biggest variations between the groups. Normals have small coefficients for low methylation whereas the high and medium methylation states are almost equally probable. On the contrary, for all tumor samples medium methylation is clearly the most evident and the proportions of the low and high methylation vary greatly. In addition, for some adenoma and MSS cancer samples, low methylation was more probable than high methylation while in MSI cancer samples the low methylation state always had the smallest coefficient. Mixture proportions in the outside regions also distinguished the two clusters of adenomas and MSS cancers well. Overall, MSS cancers and adenomas share similar proportions in all the classes. Class (I,S) mixture proportions do not show as large differences between groups as other classes.
Numbers of probe pairs.
We use the posterior mixture proportions to calculate the false annotation rate, FAR (see Section Data Analysis). FAR measures how often a probe pair is assigned to the wrong state: To each probe pair, we assign the hidden state (k1, k2), with k1, k2 = L, M, or H, with the highest posterior proportion. The probability that this assignment is incorrect is given by FAR (see Section Data Analysis). For the colon data set, FAR = 0.128, implying that about 1 in 8 probe pairs should have a wrong annotation. For example, in Figure 2, the bottom left plot, wrong annotations are likely to occur when the probability of the low and the medium states are similar (e.g. beta-value around 0.1), while confident annotations are made when e.g. the beta-value is around 0.5.
Methylation state changes.
Norm vs Aden
Norm vs MSI
Norm vs MSS
Aden vs MSI
Aden vs MSS
MSS vs MSI
L → M
M → L
M → H
H → M
L → H
H → L
For example, when comparing normals with MSI cancers, over 80% of changes were from low to medium methylation. In comparison, between normals and adenomas and normals and MSS cancers, the proportions were 42% and 37%, respectively.
Number of changes in different regions in group comparisons.
Norm vs Aden
Norm vs MSI
Norm vs MSS
Aden vs MSI
Aden vs MSS
MSS vs MSI
In this paper, we have proposed a model for microarray methylation data. The model uses four different probe pair classes and three different methylation states. It is motivated by the empirical distribution of beta-values and knowledge of the genomic content of CpG dinucleotides. It reduces the dimensionality of a microarray data set to 37, the number of parameters in the model. The model allows us to assign one of three broad classes (low, medium or high) to each methylation probe value and assess the correctness of the assignment.
Further, we illustrate the use of the model by analysing a colon cancer data set. Normal and MSI samples could easily be distinguished from the other samples, but adenomas and MSS cancers were mixed together. However, the hierarchial clustering based on all beta-values (27k probes) also mixed these two groups. This suggests that the methylation patterns in adenomas and MSS cancers are very similar, which is in agreement with previous studies [17, 18]. In addition, we identified differences in the genomic localisation of methylation changes. This observation may be used to discriminate between adenomas and MSS cancers from genome-wide methylation data.
In the future it would be interesting to integrate information from different data sources, such as methylation, gene expression and copy numbers, into one model. It may also be beneficial to take the full step and model at the level of the DNA sequence directly, anticipating the rapidly growing interest in next-generation sequencing.
Data Sets and Preprocessing
where M is the value of the methylated bead type probe and U is the value of the unmethylated bead type probe. Beta-values vary between zero and one. In the analysis, we omit probes with β = 0, these are generally of bad quality.
Here, k = L, M, H can be considered hidden (unknown) states and ω aki , the a priori probability that a probe from sample i and class a is in state k. The ω aki 's are called mixture proportions for sample i and class a, and fulfill ∑ k ω aki = 1. If a probe is in hidden state k, it emits a methylation value according to a beta-distribution with parameter (α aki , β aki ) and normalizing constant B(α aki , β aki ). We assume that the beta-distribution of the medium methylation state is symmetric, i.e. αa Mi= βa Mi. Throughout the paper k refers to the methylation state, k = L, M, H.
Here, (a1, a2) denotes the class of the probe pair, is short for the normalizing constant of the corresponding beta-distribution, and is the a priori probability (mixture proportion) that the probe pair has (hidden) methylation state (k1, k2). There are nine such hidden states.
The bottom line in Equation 1 gives the density of the paired methylation values (x ij , xi(j+1)) as a mixture distribution over the nine hidden states. Given the probe pair is in hidden state (k1, k2), the methylation values are emitted independently according to two beta distributions. The parameters of the beta-distribution are assumed to depend only on the corresponding probe, not the probe pair, and again we assume symmetry for the medium methylation state, i.e. , l = 1, 2. The middle line shows the same density written as a Hidden Markov Model. The first probe is in state k1 with probability , while the second probe is in state k2 with probability (given the first is in k1). Thus, is a 3 × 3 transition matrix for each class (a1, a2) and sample i.
If Equation 1 is marginalized to obtain the density for a single probe, we find Equation 1 with , where is the empirical frequency of class (a1, a2) among the probe pairs. Similarly for the second probe in the pair.
Model parameters are estimated using maximum likelihood. Briefly, first the beta-distribution parameters are defined for each probe class (CI, CS, CO) and for each state (L,M,H) using R's optim-function and EM-algorithm, the parameters are obtained according to Equation 1. Secondly, for each probe pair, the obtained beta-distribution parameters are used to estimate the mixture parameters and transition probabilities of the matrix T . In this step, the Baum-Welch algorithm is used. After the model estimation, parameters are obtained that include 13 beta-distribution paratemers (medium methylation distribution is symmetric, i.e., only one parameter is needed) for L, M and H states, 7 mixture proportion parameters for probe classes (I,I), (S,I), (S,S) and (O,O) (one for the class (I,I) as high methylation cannot be obtained and two for the other classes) and 17 transition probabilities for the four probe pair classes (6 parameters for (S,S) and (O,O), 3 for (S,I) and 2 for (I,I)).
The most likely methylation states for each probe pair are computed with the Viterbi algorithm, in addition, we compute posterior probabilities for each possible state combination for each probe pair. We exclude the classes (S,O), (O,S), (I,O) and (O,I) from the analysis because there are very few probe pairs in these classes (121 in total). For convenience, we group (S,I) and (I,S) together by swapping the probes of the latter. In this way we are left with four classes, (I,I), (S,S), (O,O), and (S,I).
where ν ti s, t = (a1, a2, k1, k2), are the mixture proportions of the different two probe classes, μ tg and are the empirical mean and variance of the ν ti s over all samples i in group g. If sample i belongs to group g, then it is left out when calculating the mean and the variance of that particular group. In addition, we performed the same procedure using beta-values from 100, 5000, 1000, 2000, and 5000 probes. Also we used k-means with the same number of beta-values. These were selected as those having the largest variance among all the probes. Note that the first approach assumes we know the groups, while in the second approach k-means finds the optimal division of samples into four groups.
are the posterior mixture proportions given (x j , xj+1), calculated with the Viterbi algorithm. The FAR is a natural measure here as it provides the posterior probability (i.e. given the methylation values) that a probe pair is classified as being in hidden state (k1, k2), when in fact it is in .
We used Fischer's linear discriminant analysis  to test for differences in posterior mixture proportions between groups. To assess the significance of the test statistics we permuted group labels 10 000 times and redid the analysis. We used a significance level of 1%.
Programs used for statistical analysis were written in R http://www.r-project.org/ and are available upon request.
KL was supported by Tampere Graduate School in Information Science and Engineering (TISE) and the Otto A. Malm foundation; KL and OY-H by the Academy of Finland, (application number 129657, Finnish Programme for Centres of Excellence in Research 2006-2011); CW by the Danish Cancer Society; BO, PL, CLA and TO by the John and Birthe Meyer Foundation, The Danish Council for Independent Research Medical Sciences, the Lundbeck Foundation, and The Danish Ministry of the Interior and Health.
- Movassagh M, Choy MK, Goddard M, Bennett MR, Down TA, Foo RS: Differential DNA methylation correlates with differential expression of angiogenic factor in human heart failure. PLoS ONE 2010, 5: e8564. 10.1371/journal.pone.0008564PubMed CentralView ArticlePubMedGoogle Scholar
- Iwamoto K, Kato T: Epigenetic profiling in schizophrenia and major mental disorders. Neuropsychobiology 2009, 60: 5–11. 10.1159/000234811View ArticlePubMedGoogle Scholar
- Menigatti M, Cattaneo E, Sabates-Bellver J, Ilinsky W, Went P, Buffoli F, Marquez VE, Jiricny J, Marra G: The protein tyrosine phosphatase receptor type R gene is an early and frequent target of silencing in human colorectal tumorigenesis. Mol Cancer 2009, 8: 124. 10.1186/1476-4598-8-124PubMed CentralView ArticlePubMedGoogle Scholar
- Seidel C, Schagdarsurengin U, Blümke K, Würl P, Pfeifer GP, Hauptmann S, Taubert H, Dammann R: Frequent hypermethylation of MST1 and MST2 in soft tissue sarcoma. Mol Carcinog 2007, 46(10):865–871. 10.1002/mc.20317View ArticlePubMedGoogle Scholar
- Ibanez de Caceres I, Battagli C, Esteller M, Herman JG, Edelson MI, D E, Bergman C, Ehya H, Eisenberg BL, Cairns P: Tumor cell-specific BRCA1 and RASSF1A hypermethylation in serum, plasma, and peritoneal fluid from ovarian cancer patients. Cancer Res 2004, 64(18):6476–6481. 10.1158/0008-5472.CAN-04-1529View ArticlePubMedGoogle Scholar
- Kanai Y: Genome-wide DNA methylation profiles in precancerous conditions and cancers. Cancer Sci 2010, 101: 36–45. 10.1111/j.1349-7006.2009.01383.xView ArticlePubMedGoogle Scholar
- Yang HJ, Liu VW, Wang Y, Tsang PC, Ngan HY: Differential DNA methylation profiles in gynecological cancers and correlation with clinico-pathological data. BMC Cancer 2006, 6: 212. 10.1186/1471-2407-6-212PubMed CentralView ArticlePubMedGoogle Scholar
- Watts GS, Futscher BW, Holtan N, Degeest K, Domann FE, Rose SL: DNA methylation changes in ovarian cancer are cumulative with disease progression and identify tumor stage. BMC Med Genomics 2008, 1: 47. 10.1186/1755-8794-1-47PubMed CentralView ArticlePubMedGoogle Scholar
- Meng H, Murrelle EL, Li G: Identification of a small optimal subset of CpG sites as bio-markes from high-throughput DNA methylation profiles. BMC Bioinformatics 2008, 9: 457. 10.1186/1471-2105-9-457PubMed CentralView ArticlePubMedGoogle Scholar
- Sun S, Yan PS, Huang TH, Lin S: Identifying differentially methylated genes using mixed effect and generalized least square models. BMC Bioinformatics 2009, 10: 404. 10.1186/1471-2105-10-404PubMed CentralView ArticlePubMedGoogle Scholar
- Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Gayther SA, Apostolidou S, Jones A, Lechner M, Beck S, Jacobs IJ, Widschwendter M: An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One 2009, 4(12):e8274. 10.1371/journal.pone.0008274PubMed CentralView ArticlePubMedGoogle Scholar
- Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT: Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics 2008, 9: 365. 10.1186/1471-2105-9-365PubMed CentralView ArticlePubMedGoogle Scholar
- Lacey MR, Ehrlich M: Modeling dependence in methylation patterns with application to ovarian carcinomas. Stat Appl Genet Mol Biol 2009, 8: 40.Google Scholar
- Heijmans BT, Kremer D, Tobi EW, Boomsma DI, Slagboom PE: Heritable rather than age-related environmental and stochastic factors dominate variation in DNA methylation of the human IGF2/H19 locus. Human mol Genet 2007, 16(5):547–554. 10.1093/hmg/ddm010View ArticleGoogle Scholar
- Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M, Ji H, Potash JB, Sabunciyan S, Feinberg AP: The human colon cancer methylome shows similar hypo-and hypermethylation at conserved tissue-specific CpG island shores. Nat Genet 2009, 41(2):178–186. 10.1038/ng.298PubMed CentralView ArticlePubMedGoogle Scholar
- Houshdaran S, Hawley S, Palmer C, Campan M, Olsen MN, Ventura AP, Knudsen BS, Drescher CW, Urban ND, Brown PO, Laird PW: DNA methylation profiles of ovarian epithelial carcinoma tumors and cell lines. PLoS One 2010, 5: e9359. 10.1371/journal.pone.0009359PubMed CentralView ArticlePubMedGoogle Scholar
- Rashid A, Shen L, Morris JS, Issa JP, Hamilton SR: CpG island methylation in colorectal adenomas. Am J Pathol 2001, 159(3):1129–1135. 10.1016/S0002-9440(10)61789-0PubMed CentralView ArticlePubMedGoogle Scholar
- Kim HC, Roh SA, Ga IH, Kim JS, Yu CS, Kim JC: CpG island methylation as an early event during adenoma progression in carcinogenesis of sporadic colorectal cancer. J Gastroenterol Hepatol 2005, 20(12):1920–1926. 10.1111/j.1440-1746.2005.03943.xView ArticlePubMedGoogle Scholar
- Dyrskjøt L, Thykjaer T, Kruhøffer M, Jensen JL, Marcussen N, Hamilton-Dutoit S, Wolf H, Orntoft TF: Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet 2003, 33: 90–96. 10.1038/ng1061View ArticlePubMedGoogle Scholar
- Broët P, Lewin A, Richardson S, Dalmasso C, Magdelenat H: A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 2004, 20(16):2562–2571. 10.1093/bioinformatics/bth285View ArticlePubMedGoogle Scholar
- Mardia KV, Kent JT, Bibby JM: Multivariate analysis. London: Academic Press Inc; 1979.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.