A Beta-mixture model for dimensionality reduction, sample classification and analysis

Background Patterns of genome-wide methylation vary between tissue types. For example, cancer tissue shows markedly different patterns from those of normal tissue. In this paper we propose a beta-mixture model to describe genome-wide methylation patterns based on probe data from methylation microarrays. The model takes dependencies between neighbour probe pairs into account and assumes three broad categories of methylation, low, medium and high. The model is described by 37 parameters, which reduces the dimensionality of a typical methylation microarray significantly. We used methylation microarray data from 42 colon cancer samples to assess the model. Results Based on data from colon cancer samples we show that our model captures genome-wide characteristics of methylation patterns. We estimate the parameters of the model and show that they vary between different tissue types. Further, for each methylation probe the posterior probability of a methylation state (low, medium or high) is calculated and the probability that the state is correctly predicted is assessed. We demonstrate that the model can be applied to classify cancer tissue types accurately and that the model provides accessible and easily interpretable data summaries. Conclusions We have developed a beta-mixture model for methylation microarray data. The model substantially reduces the dimensionality of the data. It can be used for further analysis, such as sample classification or to detect changes in methylation status between different samples and tissues.


Background
Interest in understanding the effects of epigenetics in relation to different complex diseases is increasing. One epigenetic mechanism of particular interest is DNA methylation at cytosines in CpG dinucleotides. The methylation patterns of genes may change and these alterations have been shown to be related to complex diseases, such as heart diseases [1], schizophrenia [2] and different cancers [3,4]. In cancer, several methylation changes are detectable at the early stages of cancer or even in pre-cancerous tissues or blood [5,6]. In addition, other methylation alterations have been shown to be specific to cancer type and stage [7,8].
High-throughput technologies, such as microarrays and large-scale sequencing, allow genome-wide methylation measurements. Analysis of methylation data requires efficient statistical methods to be able to identify potential methylation biomarkers and differential methylation patterns across sample types. Several methods have been proposed to pinpoint significant methylation differences in patients with cancer and to classify different tissue types. Examples include feature selection methods [9], mixed effect and generalized least square methods [10] and singular value decomposition-based methods [11]. In addition, methylation patterns of distinct microarray probes have also been modeled with beta-mixture models, which subsequently are used with partitioning algorithms to separate different tissue types into clusters [12]. Each probe is modelled with its own beta-distribution that depends on the tissue type only. In the present paper, we propose a beta-mixture model to describe genome-wide methylation. We assume that methylation levels of nearby probes are dependent, as demonstrated previously by [13] and [14], and that methylation can be categorized into three different broad categories or states, low, medium and high methylation. In addition, we include knowledge about the genomic background (proximity to CpG-islands) of the probes following suggestions in [15]. In total, our model has 37 parameters (see Section Data Analysis) per sample compared with approximately 27k probes in Illumina methylation arrays. The parameters comprise genomic background, methylation state and level, and dependency between probes. In short, the model facilitates (i) reduction of the dimensionality of a methylation profile (ii) for each probe, computation of the posterior probability of a methylation state, (iii) computation of a posterior probability that the latter state was correctly predicted. We apply the model to a set of methylation microarray measurements from colon cancer samples and show that the model parameters reflect global patterns in the data. Based on the estimated parameters, we are able to classify the samples with high accuracy and to exhibit global differences between the cancer samples. Furthermore, the model assigns a methylation state to each probe value. Using the states, accessible data summaries are provided.

Results and Discussion
The Methylation Array and the Number of CpGs in Probes in Different Genomic Regions The Illumina human methylation 27k array consists of 27,578 probes that measure the methylation status of CpGs in the human genome at single nucleotide resolution. The array measures genome-wide methylation and the probes target over 14,000 genes. The great majority of the genes included in the array have two methylation probes (80.9%), while 17.6% of the genes have one methylation probe and only 1.5% have more than two probes. The degree of a methylation of each probe is measured by the beta-value which is a continuous variable varying between zero and one, where one means full methylation. The probes of the Illumina human methylation 27k array are 50 nucleotides long and contain different numbers of CpGs. Probe locations can be divided into CpG islands (I), CpG island shore regions (within 2000 bp from CpG-islands) (S) and outside regions (O) (these definitions are adapted from [15]). About one quarter of the probes (6,150) are located in the shore regions, one sixth (4,922) in outside regions and the rest (13,819) in the CpG-islands. Depending on the region, methylation appears to happen at different rates such that CpG-islands are usually less methylated than the CpGs in other genomic regions [15]. Logically, this results in an uneven distribution of methylation in regions with different numbers of CpGs because more CpGs occur in CpG-islands than in outside regions; see Figure 1. As a consequence, the number of CpGs affects the amount of methylation. In fact, we found a negative correlation between the number of CpGs in a probe and the beta-value of the probe (Pearson's correlation coefficient r = -0.28). However, this correlation is reduced markedly when studying the probes in I, S or O regions separately (correlation coefficients r = -0.049, r = -0.053 and r = 0.083, respectively). The probe distance from the CpG-island in the O class showed no correlation with the strength of methylation.

Parameter Estimates and Interpretation
We used the proposed method to analyze methylation microarray data from normal and colon cancer tumor samples. Colon cancer can be divided into different types and here we study patients with microsatellite instable (MSI) and microsatellite stable (MSS) tumors. In addition, we have samples of benign colon adenomas that are not considered as cancer tumors but are classified as MSS-type adenomas. Our data set contains 42 Illumina methylation 27k microarray samples, divided into 6 normal, 6 adenoma, 6 MSI and 24 MSS-samples. The basic idea behind the model is the assumption that the methylation level of a CpG (probe) can be divided into three different states, low (L), medium (M) and high (H) methylation. The three states are biologically motivated in the following sense. L corresponds to the situation, where (almost) all cells in a sample are unmethylated and H to the situation, where (almost) all cells are methylated, irrespectively of the composition of the cells in the sample. M captures the situation in which the cells are only partly methylated (e.g. hemimethylation), or some cell types in the sample are methylated while others are not.
The latter could be the case if the sample consists of different cell types (sample purity) or one cell type shows heterogeneity, as would be expected for cancer cells. The three states are further empirically motivated ( Figure 2).
We assume that statistical properties of these different states are the same throughout the genome and that methylation of a CpG site depends on its location (I, S or O) in the genome in relation to the nearest CpGisland. We concentrate on modelling genes with two probes; however, our model can also be extended to genes with more than two methylation probes (for the genes with more than two probes the two first probes were used). The beta-values of a gene's probe pairs are dependent with a correlation coefficient of r = 0.668 between the probes. A similar degree of correlation has been reported for other data sets, while the probes measuring methylation levels of different genes have shown no dependency [16]. We built a model that can be con- and (I,S). The latter includes both (I,S) and (S,I) pairs. Other cases were omitted as they contained only a few or no probe pairs. We built one model for each of the classes.
We fitted a mixture of three beta-distributions (distributions corresponding to the low, medium and high methylation states, respectively) for each sample in the colon cancer data set, such that the beta-distribution corresponding to the medium methylation state is symmetrical; as described in greater detail in Methods. The betadistribution gives the density of the beta-value given the hidden state. That is, for each sample we estimate betadistribution parameters a and b, mixture proportions ω and a transition probability matrix T for the HMM (see Methods for further details). The mixture proportions are the a priori probabilities that a probe is found in a given hidden state and the transition matrix gives the probabilities that a probe in some hidden state k 1 = L, M, or H, is followed by a probe in state k 2 = L, M, or H. For the class (I,I) we set the mixture coefficient of the high methylation state to 0, i.e. this state could not be reached as high methylation appears to be very rare in this class. The same is assumed for the I probes in the class (I,S). Figure 2 shows the empirical and the fitted mixture distributions for one MSI-sample for different classes. We also built a model with only two states for every class and a model where the high methylation state was included in all classes but these did not reflect the data properties equally well as the model presented.
The mixture model parameters in different classes varied between samples, but some general trends in the sample groups (normal, adenoma, MSI, MSS) could be seen; see Figure 3 for examples. In the figure, the hierarchial clustering of the samples based on the methylation array data (all probes) is also shown. The adenomas and the MSS cancers are mixed in two big clusters and the differences between these clusters can be seen in the parameters as well, cf. Figure 3

and 4.
Regarding the mixture proportions, the group of MSI cancers is most easily discriminated from the other samples (some examples are shown in Figure 3). Indeed, for the (I,I) probe pair class, the MSI mixture proportions for the low methylation state are clearly lower and the mixture proportions of the medium methylation state are higher than in the other classes. Similarly, but less clearly, differences for the shore (S) region probe pairs could be detected, e.g. medium and high methylation state mixture proportions are higher for MSI cancers compared with the other samples (results not shown). Furthermore, many mixture proportions vary in the same range when studying normals, adenomas and MSS cancers. However, normals have slightly higher proportions for low methylation in (I,I) class than the other samples. The probes in the outside regions show the biggest variations between the groups. Normals have small coefficients for low methylation whereas the high and medium methylation states are almost equally probable. On the contrary, for all tumor samples medium methylation is clearly the most evident and the proportions of the low and high methylation vary greatly. In addition, for some adenoma and MSS cancer samples, low methylation was more probable than high methylation while in MSI cancer samples the low methylation state always had the smallest coefficient. Mixture proportions in the outside regions also distinguished the two clusters of adenomas and MSS cancers well. Overall, MSS cancers and adenomas share similar proportions in all the classes. Class (I,S) mixture proportions do not show as large differences between groups as other classes.
We can see similar differences between groups in the estimated parameters of the beta-distribution as in the mixture proportions. We performed a principal component analysis (PCA) on the vector of beta-distribution parameters (there are 13 for each sample) to illustrate how the parameters vary across samples and tissues. In Figure 4 all samples are plotted based on the three first principal components. Again, normals and MSI cancers are clustered into distinct groups while adenomas and MSS cancers overlap. However, the two clusters of the hierarchial clustering ( Figure 3) encompassing adenomas and MSS cancers could be distinguished by the principal components. To further illustrate the use of the model we classified the samples using the mixture proportions and a leave-one-out procedure (Equation 1 in Section Data Sets). We used as classes the four biggest clusters that can be seen in Figure 3 (clusters 1-4). These were obtained by hierarchial clustering. The fifth cluster was omitted as it contained only two samples i.e., altogether 40 samples of the clusters 1-4 were used for the classification analysis. Overall, 32 out of the remaining 40 samples (80%) were classified correctly, see Table 1. For comparison, we followed the same procedure using betavalues from 100, 500, 1000, 2000, and 5000 probes and found similar classification results though with lower performance (see Additional file 1 Tables S6-S10). Using k-means only results generally in better performance: 82.5% using mixture proportions and 90% when using 5000 probes; for fewer probes the performance was less than when using the mixture proportions (see Additional file 1 Tables S1-S6 and Table  S11). Note that k-means find the best division of the samples into four clusters, whereas the leave-one-out method assumes clusters are defined and classifies the samples one at a time.

Data Summaries
In this section we further illustrate the use of the model. For each probe pair, we computed the posterior mixture proportions and the most likely state. The most frequent state in a group was defined as the group's overall state.
In Table 2, a summary of the results is shown. There are clear differences between groups which we expect from the previous analysis. Normal samples have almost twice as many HH probe pairs as the other groups. Likewise, MSI cancers have fewer probe pairs with low methylation states, while the medium methylation state is far more common than in the other groups. The distributions of methylation states for adenomas and MSS cancers resemble each other. We use the posterior mixture proportions to calculate the false annotation rate, FAR (see Section Data Analysis). FAR measures how often a probe pair is assigned to the wrong state: To each probe pair, we    Next, we selected the probe pairs where the overall group state differed between groups. We further reduced the number of probe pairs by only choosing those for which the posterior mixture proportions showed a significant difference between groups using Fischer's linear discriminant analysis. Table 3 shows the results. For each probe pair, one or both probes might differ between two groups; in the table we count how many probes show a given change, e.g. L M. Most differences were found between normals and MSI cancers with almost 2000 differences while only 35 changed probes are detected between adenoma and MSS samples. This again shows that based on methylation data, adenomas and MSS cancers are difficult to distinguish. In addition to the number of changes, also the type of changes differed between comparisons.
For example, when comparing normals with MSI cancers, over 80% of changes were from low to medium methylation. In comparison, between normals and adenomas and normals and MSS cancers, the proportions were 42% and 37%, respectively.
The characteristics of the changed probes also differed between comparisons; see Table 4. For all comparisons except between normals and MSS cancers the majority of methylation changes happened at CpG-islands. Further, there are differences between adenomas and MSS cancers which may be used to distinguish between the two tissue types.

Conclusions
In this paper, we have proposed a model for microarray methylation data. The model uses four different probe pair classes and three different methylation states. It is motivated by the empirical distribution of beta-values and knowledge of the genomic content of CpG dinucleotides. It reduces the dimensionality of a microarray data set to 37, the number of parameters in the model. The model allows us to assign one of three broad classes (low, medium or high) to each methylation probe value and assess the correctness of the assignment.
Further, we illustrate the use of the model by analysing a colon cancer data set. Normal and MSI samples could easily be distinguished from the other samples, but adenomas and MSS cancers were mixed together. However, the hierarchial clustering based on all betavalues (27k probes) also mixed these two groups. This suggests that the methylation patterns in adenomas and MSS cancers are very similar, which is in agreement with previous studies [17,18]. In addition, we identified differences in the genomic localisation of methylation changes. This observation may be used to discriminate between adenomas and MSS cancers from genome-wide methylation data.
In the future it would be interesting to integrate information from different data sources, such as methylation, gene expression and copy numbers, into one model. It may also be beneficial to take the full step and model at the level of the DNA sequence directly, anticipating the rapidly growing interest in next-generation sequencing.

Data Sets and Preprocessing
We used a data set that consists of 42 Illumina methylation 27k microarray samples, that can be divided into 6 normal, 6 adenoma, 6 MSI and 24 MSS-samples. Raw data was preprocessed and normalized with the background method using Illumina's BeadStudio software. After this the methylation value for each probe (the beta-value) was computed with the formula where M is the value of the methylated bead type probe and U is the value of the unmethylated bead type probe. Beta-values vary between zero and one. In the analysis, we omit probes with b = 0, these are generally of bad quality.

Model
Let s j be the location (I, S or O ) of the jth probe in the genome and let x ij be the beta-value computed for the probe j of sample i. Based on the locations, the data is divided into classes C a = {j|s j = a}, where a {I, S, O}. For each class, we assume that the data follows a beta mixture model: Here, k = L, M, H can be considered hidden (unknown) states and ω aki , the a priori probability that a probe from sample i and class a is in state k. The ω aki 's are called mixture proportions for sample i and class a, and fulfill ∑ k ω aki = 1. If a probe is in hidden state k, it emits a methylation value according to a betadistribution with parameter (a aki , b aki ) and normalizing constant B(a aki , b aki ). We assume that the beta-distribution of the medium methylation state is symmetric, i.e. a aMi = b aMi . Throughout the paper k refers to the methylation state, k = L, M, H.
To take dependencies between neighboring probe into account we do the following (for an illustration, see Figure 5). For a probe pair (x ij , x i(j+1) ), i.e. the first two probes in a gene, we model the two probes assuming Markov dependency between them,  Here, (a 1 , a 2 ) denotes the class of the probe pair, B a l k l i is short for the normalizing constant B(α a l k l i , β a l k l i ) of the corresponding beta-distribution, and ν a 1 a 2 k 1 ,k 2 i = ω a 1 a 2 k 1 i T a 1 a 2 k 1 ,k 2 i is the a priori probability (mixture proportion) that the probe pair has (hidden) methylation state (k 1 , k 2 ). There are nine such hidden states.
The bottom line in Equation 1 gives the density of the paired methylation values (x ij , x i(j+1) ) as a mixture distribution over the nine hidden states. Given the probe pair is in hidden state (k 1 , k 2 ), the methylation values are emitted independently according to two beta distributions. The parameters of the beta-distribution are assumed to depend only on the corresponding probe, not the probe pair, and again we assume symmetry for the medium methylation state, i.e. α a l Mi = β a l Mi, l = 1, 2. The middle line shows the same density written as a Hidden Markov Model. The first probe is in state k 1 with probability ω a 1 a 2 k 1 i, while the second probe is in state k 2 with probability T a 1 a 2 k 1 ,k 2 i (given the first is in k 1 ). Thus, T a 1 a 2 k 1 ,k 2 i is a 3 × 3 transition matrix for each class (a 1 , a 2 ) and sample i.
If Equation 1 is marginalized to obtain the density for a single probe, we find Equation 1 with ω aki = k 2 ν aa 2 k,k 2 i π aa 2 , where π a 1 a 2 is the empirical frequency of class (a 1 , a 2 ) among the probe pairs. Similarly for the second probe in the pair.

Data Analysis
Model parameters are estimated using maximum likelihood. Briefly, first the beta-distribution parameters are defined for each probe class (C I , C S , C O ) and for each state (L,M,H) using R's optim-function and EM-algorithm, the parameters are obtained according to Equation 1. Secondly, for each probe pair, the obtained betadistribution parameters are used to estimate the mixture parameters ω a 1 a 2 k 1 k 2 and transition probabilities of the matrix T . In this step, the Baum-Welch algorithm is used. After the model estimation, parameters are obtained that include 13 beta-distribution paratemers (medium methylation distribution is symmetric, i.e., only one parameter is needed) for L, M and H states, 7 mixture proportion parameters for probe classes (I,I For the classification of the samples using estimated parameters, we apply a leave-one-out method [19]. If there are G different groups, then sample i is classified as belonging to the group that minimizes the Mahanalobis distance where ν ti s, t = (a 1 , a 2 , k 1 , k 2 ), are the mixture proportions of the different two probe classes, μ tg and σ 2 tg are the empirical mean and variance of the ν ti s over all samples i in group g. If sample i belongs to group g, then it is left out when calculating the mean and the variance of that particular group. In addition, we performed the same procedure using beta-values from 100, 5000, 1000, 2000, and 5000 probes. Also we used k-means with the same number of beta-values. These were selected as those having the largest variance among all the probes. Note that the first approach assumes we know the groups, while in the second approach k-means finds the optimal division of samples into four groups.
We compute a false annotation rate (FAR) for the data which we define similarly to the false discovery and the false negative rates in [20]; that is, for a set J of probe pairs (with cardinality #J) the FAR is defined by where γ (x j , x j+1 ) = max k 1 ,k 2 ∈{L,M,H} is the posterior probability of the most probable state of probe pair (j, j + 1) with methylation values (x j , x j+1 ) and P(k 1 k 2 |x j , x j+1 ), k 1 , k 2 ∈ {L,M,H}, are the posterior mixture proportions given (x j , x j+1 ), calculated with the Viterbi algorithm. The FAR is a natural measure here as it provides the posterior probability (i.e. given the methylation values) that a probe pair is classified as being in hidden state (k 1 , k 2 ), when in fact it is in (k 1 , k 2 ).
We used Fischer's linear discriminant analysis [21] to test for differences in posterior mixture proportions between groups. To assess the significance of the test statistics we permuted group labels 10 000 times and redid the analysis. We used a significance level of 1%.
Programs used for statistical analysis were written in R http://www.r-project.org/ and are available upon request.

Additional material
Additional file 1: Tables S1-S11. Classification results using k-means clustering and the leave-one-out method with Mahalanobis distance. In Tables S1-S10, 100-5000 probes with the highest variance across samples were used in the analysis. Table S11 shows the results of k-means classification using the mixture proportions only (as in Table 1). Clusters are the same as in Figure 3 and 3 in Table 1.