Mixture models
Formally, a mixture model is defined as follows. Let X = X1,..., X
p
denote random variables (RVs) representing the features of a p dimensional data set D with N samples x
i
, i = 1,..., N where each x
i
consists of a realization (xi 1,..., x
ip
) of (X1,..., X
p
). A K component mixture distribution is given by
where the π
k
≥ 0 are the mixture coefficients with
. Each of the K clusters is identified with one of the components. In the most straightforward case, the component distributions P(x
i
|θ
k
) are given by a product distribution over X1,..., X
p
parameterized by parameters θ
k
= (θk 1,..., θ
kp
),
This is the well known naïve Bayes model (e.g. [6–8]). However, the formulation can accommodate more complex component distributions, including any multivariate distribution from the exponential family. One advantage of adopting naïve Bayes models as component disributions is that they allow the seamless integration of heterogeneous sources of data (say discrete and continuous features) in a single model. This has been made use of, for instance, for the joined analysis of gene expression and transcription factor binding data [9] or geno- and phenotype data of complex diseases [10].
When using mixtures in a clustering context, the aim is to find the assignment of data points to components which maximizes the likelihood of the data. The classical algorithm for obtaining the maximum likelihood parameters, which is also employed in PyMix, is the Expectation Maximization (EM) algorithm [11]. The basic idea of the EM procedure is to use the current model parameters to obtain conditional expectations of the component memberships of the data points. These expectations in turn can be used to update the model parameters in a maximum likelihood fashion. Iteration over these two steps can be shown to converge to a local maximum in the likelihood. The central quantity of EM for mixtures is the component membership posterior, i.e. the probability that a data point x
i
was generated by component k. By applying Bayes' rule, this posterior is given by
The final cluster assignments are then also obtained based on this posterior. Each data point is assigned to the component which generated it with the highest probability, i.e. x
i
is assigned to k* = argmax
k
P (k|x
i
, Θ).
Method Extensions
PyMix includes several theoretical extensions of the basic mixture framework which can be employed to adapt the method for specific applications.
Context-specific independence mixtures
In the case of naïve Bayes component distributions, the component parameterization consists of a set of
parameters θ
kj
for each component k and feature X
j
. This can be visualized in a matrix as shown in Figure 1a). Here each cell in the matrix represent one of the θ
kj
. The different values of the parameters for each feature and component express the regularities in the data which characterize and distinguish the components. The basic idea of the context-specific independence (CSI) [9] extension to the mixture framework is that very often the regularities found in the data do not require a separate set of parameters for all features in every component. Rather there will be features where several component share a parameterization.
This leads to the CSI parameter matrix shown in Figure 1b). As before each cell in the matrix represents a set of parameters, but now several component might share a parameterization for a feature, as indicated by the cells spanning several rows. The CSI structure conveys a number of desirable properties to the model. First of all, it reduces the number of free parameters which have to be estimated from the data. This leads to more robust parameter estimates and reduces the risk of overfitting. Also, the CSI structure makes explicit which features characterize respectively, discriminate between clusters. In the example in Figure 1b), one can see that for feature X1 components (C1, C2) and (C4, C5) share characteristics and are represented by one set of parameters. On the other hand component C3 does not share its parameterization for X1. Moreover, if components share the same group in the CSI structure for all positions, they can be merged thus reducing the number of components in the model. Therefore learning of a CSI structure can amount to an automatic reduction of the number of components as an integral part of model training. Such a CSI structure can be inferred from the data by an application of the structural EM [12] framework. In terms of the complexity of the mixture distribution, a CSI mixture can be seen as lying in between a conventional naive Bayes mixture (i.e. a CSI structure as shown in Figure 1a) and a single naive Bayes model (i.e. the structure where the parameters of all components are identified for all features). A comparison of the performance of CSI mixtures and these two extreme cases in a classical model selection setup can be found in [13].
The basic idea of the CSI formalism to fit model complexity to the data is also shared by approaches such as variable order Markov chains [14] or topology learning for hidden Markov models [15]. The main difference of CSI to these approaches is that CSI identifies parameters across components (i.e. clusters) and the resulting structure therefore carries information about the regularities captured by a clustering. In this CSI mixtures bear some similarity to clustering methods such as the shrunken nearest centroid (SNC) algorithm [16]. However the CSI structure allows for a richer, more detailed representation of the regularities characterizing a cluster.
Another important aspect is the relation of the selection of variables implicit in the CSI structure to pure feature selection methods (e.g. [17]). These methods typically make a binary decision about the relevance or irrelevance of variables for the clustering. In contrast to that, CSI allows for more fine grained models where a variable is of relevance to distinguish subsets of clusters.
Dependence trees
The dependence tree (DTree) model extends the Naive Bayes model (Eq. 2) by assuming first-order dependencies between features. Given a directed tree, where nodes of the tree represent the features (X1,..., X
p
) and a map pa represents the parent relationships between features, the DTree model assumes that the distribution of feature X
j
is conditional on the feature Xpa(j). For a given tree topology defined by pa, the joint distribution of a DTree is defined as [18],
where P(·|·, θ) is a conditional distribution, such as conditional Gaussians [19], and θ
jk
are the parameters of the conditional distribution. See Figure 2 for an example of a DTree and its distribution.
One important question is how to obtain the tree structure. For some particular applications, the structure is given by prior knowledge. In the analysis of gene expression of developmental processes for instance, the tree structure is given by the tree of cell development [20]. When the structure is unknown, the tree structure with maximum likelihood can be estimated from the data [18]. The method works by applying a maximum weight spanning tree algorithm on a fully connected, undirected graph, where vertices represent the variables and the weights of edges are equal to the mutual information between variables [18]. When the DTree models are used as components in a mixture model, a tree structure can be inferred for each component model [21]. This can be performed by applying the tree structure estimation method for each model at each EM iteration.
When conditional normals are used in Eq. 4, the DTree can be seen as a particular type of covariance matrix parameterization for a multivariate normal distribution [21]. There, the number of free parameters is linear to the number of variables, as in the case of multivariate normals with diagonal covariance matrices, while it models first-order variable dependencies. In experiments performed in [21], dependence trees compares favorably to Naive Bayes models (multivariate normal with diagonal covariance matrices) and full dependence models (multivariate normal with full covariance matrices) for finding groups of co-expressed genes and even for simulated data arising from variable dependence structures. In particular, dependence trees are not susceptible to over-fitting, which is otherwise a frequent problem in the estimation of mixture models with full diagonal matrices from sparse data.
In summary, the DTree model yields a better approximation of joint distribution in relation to the the simple Naive Bayes Model. Furthermore, the estimated structure can be useful in the indication of important dependencies between features in a cluster [21].
Semi-supervised learning
In classical clustering the assignments of samples to clusters is completely unknown and has to be learned from the data. This is referred to as unsupervised learning. However, in practice there is often prior information for at least a subset of samples. This is especially true for biological data, where there is often detailed expert annotation for at least some of the samples in a data set. Integrating this partial prior knowledge into the clustering can potentially greatly increase the performance of the clustering. This leads to a semi-supervised learning setup.
PyMix includes semi-supervised learning with mixtures for both hard labels as well as a formulation of prior knowledge in form of soft pairwise constraints between samples [22, 23]. In this context, in addition to the data x
i
, we have a set of positive (and negative) constraints
∈ [0, 1] (
∈ [0, 1]), where x
i
, x
j
, 1 ≤ i <j ≤ N. The constraints indicate a preference for pair of data points to be clustered together (positive constraints), or not clustered together (negative constraints).
The main idea behind the semi-supervised method implemented in Pymix is to find a clustering solution Y where the least number of constraints are violated [22] (see Figure 3). This can be achieved by redefining the posterior assignment rule of the EM (Eq. 3), as
where r
jk
= P (k|x
j
, Θ, W), W is the set of all positive and negative constraints and (λ+, λ-) are Lagrange parameters defining the penalty weights of positive and negative constraint violations.
Hard labels arise as a special case of the above, when the constraints are binary (
∈ {0, 1},
∈ {0, 1}) such that there is no overlap in the constraints, and the penalty parameters are set to high values λ+ = λ-~∞. In this scenario, only solutions in full accordance with the constraints are possible [23].
There are several semi-supervised/constraint-based clustering methods described in the literature [24]. We choose to implement the method proposed in [22], because it can be used within the mixture model framework and supports the use of soft-constraints. Therefore, we can take simultaneous advantage of the characteristics of the probability distribution offered in Pymix and the semi-supervised framework. Moreover, the soft-constraints allow for the inclusion of our prior believes into the constraints, an important aspect in error prone biological data.
Dependencies
PyMix is written in the Python programming language http://www.python.org. It includes a custom written C extension and makes use of the numpy http://numpy.scipy.org array package, the GNU Scientific library (GSL) http://www.gnu.org/software/gsl/ and matplotlib http://matplotlib.sourceforge.net plotting capabilities.
Overall architecture
PyMix represents finite mixture models in an object oriented framework. Figure 4 shows the class tree for the MixtureModel class. MixtureModel represents the standard finite mixture model. All the extensions to the mixture framework implemented in PyMix mentioned previously, are derived as specialized classes from MixtureModel. The CSI structure learning is implemented as part of BayesMixtureModel.
LabeledMixtureModel and ConstrainedMixtureModel represent two different approaches to semi-supervised learning. The former implements semi-supervised learning with hard labels, the latter with pairwise constraints on the data points. Full documentation of the objects hierarchy in PyMix (including the dependence trees) can be found on the PyMix website http://www.pymix.org.
Currently, the framework supports mixtures of Gaussians, discrete distributions and exponential distributions. Furthermore, the framework has an extension that allows hidden Markov models as components by the use of the GHMM library http://www.ghmm.org and is also used by the GQL tool [25]. Moreover, due to the object oriented setup, it is easily extendable and more distributions will be added in the future.
Prior work
Due to the popularity and versatility of the mixture approach, there is a considerable number of software packages available. For instance the R package MCLUST http://www.stat.washington.edu/mclust/ implements algorithms for Gaussian mixture estimation. The MIXMOD http://www-math.univ-fcomte.fr/mixmod/ C++ package contains algorithms for conventional mixtures of Gaussian and multinomial distributions with MATLAB bindings. Another R package for conventional mixture analysis is the MIX http://icarus.math.mcmaster.ca/peter/mix/mix.html package. An example for a rather specialized package would be Mtreemix http://mtreemix.bioinf.mpi-sb.mpg.de/ which allows estimation of mixtures of mutagenic trees. In general it can be said that most of these packages focus rather narrowly on specific model types. The main advantage of PyMix is that the general, object oriented approach allows for a wide variety of mixture variants to be integrated in a single, unified framework. The different advanced models (CSI or semi-supervised) and component distributions (e.g. dependence trees) available in PyMix, make the software applicable for many applications. Also, the object orientation means that the software can be straightforwardly extended with additional model types by advanced users.