A Beta-mixture model for dimensionality reduction, sample classification and analysis
© Laurila et al; licensee BioMed Central Ltd. 2011
Received: 8 October 2010
Accepted: 27 May 2011
Published: 27 May 2011
Skip to main content
© Laurila et al; licensee BioMed Central Ltd. 2011
Received: 8 October 2010
Accepted: 27 May 2011
Published: 27 May 2011
Patterns of genome-wide methylation vary between tissue types. For example, cancer tissue shows markedly different patterns from those of normal tissue. In this paper we propose a beta-mixture model to describe genome-wide methylation patterns based on probe data from methylation microarrays. The model takes dependencies between neighbour probe pairs into account and assumes three broad categories of methylation, low, medium and high. The model is described by 37 parameters, which reduces the dimensionality of a typical methylation microarray significantly. We used methylation microarray data from 42 colon cancer samples to assess the model.
Based on data from colon cancer samples we show that our model captures genome-wide characteristics of methylation patterns. We estimate the parameters of the model and show that they vary between different tissue types. Further, for each methylation probe the posterior probability of a methylation state (low, medium or high) is calculated and the probability that the state is correctly predicted is assessed. We demonstrate that the model can be applied to classify cancer tissue types accurately and that the model provides accessible and easily interpretable data summaries.
We have developed a beta-mixture model for methylation microarray data. The model substantially reduces the dimensionality of the data. It can be used for further analysis, such as sample classification or to detect changes in methylation status between different samples and tissues.
Interest in understanding the effects of epigenetics in relation to different complex diseases is increasing. One epigenetic mechanism of particular interest is DNA methylation at cytosines in CpG dinucleotides. The methylation patterns of genes may change and these alterations have been shown to be related to complex diseases, such as heart diseases , schizophrenia  and different cancers [3, 4]. In cancer, several methylation changes are detectable at the early stages of cancer or even in pre-cancerous tissues or blood [5, 6]. In addition, other methylation alterations have been shown to be specific to cancer type and stage [7, 8].
High-throughput technologies, such as microarrays and large-scale sequencing, allow genome-wide methylation measurements. Analysis of methylation data requires efficient statistical methods to be able to identify potential methylation biomarkers and differential methylation patterns across sample types. Several methods have been proposed to pinpoint significant methylation differences in patients with cancer and to classify different tissue types. Examples include feature selection methods , mixed effect and generalized least square methods  and singular value decomposition-based methods . In addition, methylation patterns of distinct microarray probes have also been modeled with beta-mixture models, which subsequently are used with partitioning algorithms to separate different tissue types into clusters . Each probe is modelled with its own beta-distribution that depends on the tissue type only. In the present paper, we propose a beta-mixture model to describe genome-wide methylation. We assume that methylation levels of nearby probes are dependent, as demonstrated previously by  and , and that methylation can be categorized into three different broad categories or states, low, medium and high methylation. In addition, we include knowledge about the genomic background (proximity to CpG-islands) of the probes following suggestions in . In total, our model has 37 parameters (see Section Data Analysis) per sample compared with approximately 27k probes in Illumina methylation arrays. The parameters comprise genomic background, methylation state and level, and dependency between probes. In short, the model facilitates
reduction of the dimensionality of a methylation profile
for each probe, computation of the posterior probability of a methylation state,
computation of a posterior probability that the latter state was correctly predicted.
We apply the model to a set of methylation microarray measurements from colon cancer samples and show that the model parameters reflect global patterns in the data. Based on the estimated parameters, we are able to classify the samples with high accuracy and to exhibit global differences between the cancer samples. Furthermore, the model assigns a methylation state to each probe value. Using the states, accessible data summaries are provided.
We used the proposed method to analyze methylation microarray data from normal and colon cancer tumor samples. Colon cancer can be divided into different types and here we study patients with microsatellite instable (MSI) and microsatellite stable (MSS) tumors. In addition, we have samples of benign colon adenomas that are not considered as cancer tumors but are classified as MSS-type adenomas. Our data set contains 42 Illumina methylation 27k microarray samples, divided into 6 normal, 6 adenoma, 6 MSI and 24 MSS-samples.
The basic idea behind the model is the assumption that the methylation level of a CpG (probe) can be divided into three different states, low (L), medium (M) and high (H) methylation. The three states are biologically motivated in the following sense. L corresponds to the situation, where (almost) all cells in a sample are unmethylated and H to the situation, where (almost) all cells are methylated, irrespectively of the composition of the cells in the sample. M captures the situation in which the cells are only partly methylated (e.g. hemi-methylation), or some cell types in the sample are methylated while others are not.
We assume that statistical properties of these different states are the same throughout the genome and that methylation of a CpG site depends on its location (I, S or O) in the genome in relation to the nearest CpG-island. We concentrate on modelling genes with two probes; however, our model can also be extended to genes with more than two methylation probes (for the genes with more than two probes the two first probes were used). The beta-values of a gene's probe pairs are dependent with a correlation coefficient of r = 0.668 between the probes. A similar degree of correlation has been reported for other data sets, while the probes measuring methylation levels of different genes have shown no dependency . We built a model that can be considered a hidden Markov model (HMM) of probe pairs within a gene. The hidden states are the methylation levels, L, M and H. Further, the CpG probe pairs are classified by their locations into classes (I,I), (S,S), (O,O) and (I,S). The latter includes both (I,S) and (S,I) pairs. Other cases were omitted as they contained only a few or no probe pairs. We built one model for each of the classes.
We fitted a mixture of three beta-distributions (distributions corresponding to the low, medium and high methylation states, respectively) for each sample in the colon cancer data set, such that the beta-distribution corresponding to the medium methylation state is symmetrical; as described in greater detail in Methods. The beta-distribution gives the density of the beta-value given the hidden state. That is, for each sample we estimate beta-distribution parameters α and β, mixture proportions ω and a transition probability matrix T for the HMM (see Methods for further details). The mixture proportions are the a priori probabilities that a probe is found in a given hidden state and the transition matrix gives the probabilities that a probe in some hidden state k 1 = L, M, or H, is followed by a probe in state k 2 = L, M, or H. For the class (I,I) we set the mixture coefficient of the high methylation state to 0, i.e. this state could not be reached as high methylation appears to be very rare in this class. The same is assumed for the I probes in the class (I,S). Figure 2 shows the empirical and the fitted mixture distributions for one MSI-sample for different classes. We also built a model with only two states for every class and a model where the high methylation state was included in all classes but these did not reflect the data properties equally well as the model presented.
Regarding the mixture proportions, the group of MSI cancers is most easily discriminated from the other samples (some examples are shown in Figure 3). Indeed, for the (I,I) probe pair class, the MSI mixture proportions for the low methylation state are clearly lower and the mixture proportions of the medium methylation state are higher than in the other classes. Similarly, but less clearly, differences for the shore (S) region probe pairs could be detected, e.g. medium and high methylation state mixture proportions are higher for MSI cancers compared with the other samples (results not shown). Furthermore, many mixture proportions vary in the same range when studying normals, adenomas and MSS cancers. However, normals have slightly higher proportions for low methylation in (I,I) class than the other samples.
The probes in the outside regions show the biggest variations between the groups. Normals have small coefficients for low methylation whereas the high and medium methylation states are almost equally probable. On the contrary, for all tumor samples medium methylation is clearly the most evident and the proportions of the low and high methylation vary greatly. In addition, for some adenoma and MSS cancer samples, low methylation was more probable than high methylation while in MSI cancer samples the low methylation state always had the smallest coefficient. Mixture proportions in the outside regions also distinguished the two clusters of adenomas and MSS cancers well. Overall, MSS cancers and adenomas share similar proportions in all the classes. Class (I,S) mixture proportions do not show as large differences between groups as other classes.
Numbers of probe pairs.
We use the posterior mixture proportions to calculate the false annotation rate, FAR (see Section Data Analysis). FAR measures how often a probe pair is assigned to the wrong state: To each probe pair, we assign the hidden state (k 1, k 2), with k 1, k 2 = L, M, or H, with the highest posterior proportion. The probability that this assignment is incorrect is given by FAR (see Section Data Analysis). For the colon data set, FAR = 0.128, implying that about 1 in 8 probe pairs should have a wrong annotation. For example, in Figure 2, the bottom left plot, wrong annotations are likely to occur when the probability of the low and the medium states are similar (e.g. beta-value around 0.1), while confident annotations are made when e.g. the beta-value is around 0.5.
Methylation state changes.
Norm vs Aden
Norm vs MSI
Norm vs MSS
Aden vs MSI
Aden vs MSS
MSS vs MSI
L → M
M → L
M → H
H → M
L → H
H → L
For example, when comparing normals with MSI cancers, over 80% of changes were from low to medium methylation. In comparison, between normals and adenomas and normals and MSS cancers, the proportions were 42% and 37%, respectively.
Number of changes in different regions in group comparisons.
Norm vs Aden
Norm vs MSI
Norm vs MSS
Aden vs MSI
Aden vs MSS
MSS vs MSI
In this paper, we have proposed a model for microarray methylation data. The model uses four different probe pair classes and three different methylation states. It is motivated by the empirical distribution of beta-values and knowledge of the genomic content of CpG dinucleotides. It reduces the dimensionality of a microarray data set to 37, the number of parameters in the model. The model allows us to assign one of three broad classes (low, medium or high) to each methylation probe value and assess the correctness of the assignment.
Further, we illustrate the use of the model by analysing a colon cancer data set. Normal and MSI samples could easily be distinguished from the other samples, but adenomas and MSS cancers were mixed together. However, the hierarchial clustering based on all beta-values (27k probes) also mixed these two groups. This suggests that the methylation patterns in adenomas and MSS cancers are very similar, which is in agreement with previous studies [17, 18]. In addition, we identified differences in the genomic localisation of methylation changes. This observation may be used to discriminate between adenomas and MSS cancers from genome-wide methylation data.
In the future it would be interesting to integrate information from different data sources, such as methylation, gene expression and copy numbers, into one model. It may also be beneficial to take the full step and model at the level of the DNA sequence directly, anticipating the rapidly growing interest in next-generation sequencing.
where M is the value of the methylated bead type probe and U is the value of the unmethylated bead type probe. Beta-values vary between zero and one. In the analysis, we omit probes with β = 0, these are generally of bad quality.
Here, k = L, M, H can be considered hidden (unknown) states and ω aki , the a priori probability that a probe from sample i and class a is in state k. The ω aki 's are called mixture proportions for sample i and class a, and fulfill ∑ k ω aki = 1. If a probe is in hidden state k, it emits a methylation value according to a beta-distribution with parameter (α aki , β aki ) and normalizing constant B(α aki , β aki ). We assume that the beta-distribution of the medium methylation state is symmetric, i.e. α aMi= β aMi. Throughout the paper k refers to the methylation state, k = L, M, H.
Here, (a 1, a 2) denotes the class of the probe pair, is short for the normalizing constant of the corresponding beta-distribution, and is the a priori probability (mixture proportion) that the probe pair has (hidden) methylation state (k 1, k 2). There are nine such hidden states.
The bottom line in Equation 1 gives the density of the paired methylation values (x ij , x i(j+1)) as a mixture distribution over the nine hidden states. Given the probe pair is in hidden state (k 1, k 2), the methylation values are emitted independently according to two beta distributions. The parameters of the beta-distribution are assumed to depend only on the corresponding probe, not the probe pair, and again we assume symmetry for the medium methylation state, i.e. , l = 1, 2. The middle line shows the same density written as a Hidden Markov Model. The first probe is in state k 1 with probability , while the second probe is in state k 2 with probability (given the first is in k 1). Thus, is a 3 × 3 transition matrix for each class (a 1, a 2) and sample i.
If Equation 1 is marginalized to obtain the density for a single probe, we find Equation 1 with , where is the empirical frequency of class (a 1, a 2) among the probe pairs. Similarly for the second probe in the pair.
Model parameters are estimated using maximum likelihood. Briefly, first the beta-distribution parameters are defined for each probe class (C I, C S, C O) and for each state (L,M,H) using R's optim-function and EM-algorithm, the parameters are obtained according to Equation 1. Secondly, for each probe pair, the obtained beta-distribution parameters are used to estimate the mixture parameters and transition probabilities of the matrix T . In this step, the Baum-Welch algorithm is used. After the model estimation, parameters are obtained that include 13 beta-distribution paratemers (medium methylation distribution is symmetric, i.e., only one parameter is needed) for L, M and H states, 7 mixture proportion parameters for probe classes (I,I), (S,I), (S,S) and (O,O) (one for the class (I,I) as high methylation cannot be obtained and two for the other classes) and 17 transition probabilities for the four probe pair classes (6 parameters for (S,S) and (O,O), 3 for (S,I) and 2 for (I,I)).
The most likely methylation states for each probe pair are computed with the Viterbi algorithm, in addition, we compute posterior probabilities for each possible state combination for each probe pair. We exclude the classes (S,O), (O,S), (I,O) and (O,I) from the analysis because there are very few probe pairs in these classes (121 in total). For convenience, we group (S,I) and (I,S) together by swapping the probes of the latter. In this way we are left with four classes, (I,I), (S,S), (O,O), and (S,I).
where ν ti s, t = (a 1, a 2, k 1, k 2), are the mixture proportions of the different two probe classes, μ tg and are the empirical mean and variance of the ν ti s over all samples i in group g. If sample i belongs to group g, then it is left out when calculating the mean and the variance of that particular group. In addition, we performed the same procedure using beta-values from 100, 5000, 1000, 2000, and 5000 probes. Also we used k-means with the same number of beta-values. These were selected as those having the largest variance among all the probes. Note that the first approach assumes we know the groups, while in the second approach k-means finds the optimal division of samples into four groups.
are the posterior mixture proportions given (x j , x j+1), calculated with the Viterbi algorithm. The FAR is a natural measure here as it provides the posterior probability (i.e. given the methylation values) that a probe pair is classified as being in hidden state (k 1, k 2), when in fact it is in .
We used Fischer's linear discriminant analysis  to test for differences in posterior mixture proportions between groups. To assess the significance of the test statistics we permuted group labels 10 000 times and redid the analysis. We used a significance level of 1%.
Programs used for statistical analysis were written in R http://www.r-project.org/ and are available upon request.
KL was supported by Tampere Graduate School in Information Science and Engineering (TISE) and the Otto A. Malm foundation; KL and OY-H by the Academy of Finland, (application number 129657, Finnish Programme for Centres of Excellence in Research 2006-2011); CW by the Danish Cancer Society; BO, PL, CLA and TO by the John and Birthe Meyer Foundation, The Danish Council for Independent Research Medical Sciences, the Lundbeck Foundation, and The Danish Ministry of the Interior and Health.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.