Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis [1–3], little attention has been paid to uncertainty in the results obtained. In clustering, the patterns of expression of different genes across time, treatments, and tissues are grouped into distinct clusters (perhaps organized hierarchically), in which genes in the same cluster are assumed to be potentially functionally related or to be influenced by a common upstream factor. Such cluster structure is often used to aid the elucidation of regulatory networks. Agglomerative hierarchical clustering [1] is one of the most frequently used methods for clustering gene expression profiles. However, commonly used methods for agglomerative hierarchical clustering rely on the setting of some score threshold to distinguish members of a particular cluster from non-members, making the determination of the number of clusters arbitrary and subjective. The algorithm provides no guide to choosing the "correct" number of clusters or the level at which to prune the tree. It is often difficult to know which distance metric to choose, especially for structured data such as gene expression profiles. Moreover, these approaches do not provide a measure of uncertainty about the clustering, making it difficult to compute the predictive quality of the clustering and to make comparisons between clusterings based on different model assumptions (e.g. numbers of clusters, shapes of clusters, etc.). Attempts to address these problems in a classical statistical framework have focused on the use of bootstrapping [4, 5] or the use of permutation procedures to calculate local p-values for the significance of branching in a dendrogram produced by agglomerative hierarchical clustering [6, 7].
A commonly used computational method of non-hierarchical clustering, based on measuring Euclidean distance between feature vectors is given by the k-means algorithm [8, 9]. However, the k-means algorithm requires the number of clusters to be predefined, and has been shown to be inadequate for describing clusters of unequal size or shape [10], which limits its applicability to many biological datasets.
Bayesian methods provide a principled approach to these types of analyses and are becoming increasingly popular in a variety of problems across many disciplines: clustering stocks with different price dynamics in finance [11], clustering regions with different growth patterns in economics [12], in signal processing applications [13], as well as in computational biology and genetics [14].
Bayesian approaches to hierarchical clustering of gene expression data have been described by Neal [15], who used a Dirichlet diffusion tree model, and by Heard et al. [16, 17] who describe a Bayesian model-based approach for clustering time series, based on regression models and nonlinear basis functions. In previous work [18] we have also described an approach to the problem of automatically clustering gene expression profiles, based on the theory of Dirichlet process (i.e. countably infinite) mixtures. However, all this work, like most Bayesian approaches, is based on sampling using Markov Chain Monte Carlo (MCMC) methods. While MCMC has useful theoretical guarantees, its applicability to large post-genomic datasets is limited by its speed.
In this paper, we present an R/Bioconductor port of the fast novel algorithm for Bayesian agglomerative hierarchical clustering (BHC) introduced by Heller and Ghahramani [19]. This algorithm is based on evaluating the marginal likelihoods of a probabilistic model, and may be interpreted as a bottom-up approximate inference method for a Dirichlet process mixture model (DPM). A DPM is a widely used model for clustering [20] which has the interesting property that the prior probability of a new data point joining a cluster is proportional to the number of points already in that cluster. Moreover, with a probability proportional to α/n the (n + 1)th data point forms a new cluster. Here α is a hyperparameter controlling the expected number of clusters as a function of the number of data points n. The BHC algorithm uses a model based criterion based on the marginal likelihoods of a DPM to merge clusters, rather than using an ad-hoc distance metric. Bayesian hypothesis testing is used to decide which cluster merges increase the tree quality. Importantly, the optimum tree depth is also calculated, resulting in the best number and size of clusters to fit the data.
Implementation
The BHC algorithm is similar to traditional agglomerative clustering in that it is a one-pass, bottom-up method which initializes each data point in its own cluster and iteratively merges pairs of clusters. However, instead of distance, the algorithm uses a statistical hypothesis test to choose which clusters to merge.
Let
= {x(1),..., x(n)} denote the entire data set, and
the set of data points at the leaves of the subtree T
i
. The algorithm is initialized with n trivial trees, {T
i
: i = 1...n} each containing a single data point
= {x(i)}. At each stage the algorithm considers merging all pairs of existing trees. In considering each merge, two hypotheses are compared. The first hypothesis, denoted by
is that all the data in
were in fact generated independently and identically from the same probabilistic model, p(x|θ) with unknown parameters θ. The alternative hypothesis, denoted by
would be that the data in
has two or more clusters in it.
To evaluate the probability of the data under hypothesis
, we need to specify some prior over the parameters of the model, p(θ|β) with hyperparameters β. We now have the ingredients to compute the probability of the data
under
:
This calculates the probability that all the data in
were generated from the same parameter values assuming a model of the form p(x/θ). This is a natural model-based criterion for measuring how well the data fit into one cluster.
The probability of the data under the alternative hypothesis,
(if we restrict ourselves to clusterings that partition the data in a manner that is consistent with the subtrees T
i
and T
j
, where T
i
and T
j
are the two subtrees of T
k
), is simply a product over the subtrees
where the probability of a data set under a tree (e.g. p(
|T
i
)) is defined below. Combining the probability of the data under hypotheses
and
, weighted by the prior that all points in
belong to one cluster,
, we obtain the marginal probability of the data in tree T
k
:
The prior for the merged hypothesis, π
k
, can be defined such a manner that BHC efficiently computes probabilities of clusterings consistent with the widely used Dirichlet process mixture model. Note that π
k
is not an estimated parameter but rather a deterministic function of α and the number of points in a given subtree. It is computed bottom-up as the tree is built as described in [19].
The posterior probability of the merged hypothesis
is then obtained using Baye's rule:
If this posterior probability r
k
> 0.5 it means that the merged hypothesis is more probable than the alternative partitionings and therefore sub-trees should be left intact. Conversely, if r
k
< 0.5 then the branches constitute separate clusters.
The BHC algorithm is very simple and is shown in the Appendix. Full details of the algorithm and underlying theory, as well as validation results based on synthetic and real non-biological datasets (including comparisons to traditional agglomerative hierarchical clustering using a Euclidean distance metric and average, single and complete linkage methods) can be found in [19].
Evaluating the Quality of Clustering
For a data set which has labelled classes, it is possible to compare the quality of hierarchical clusterings obtained from different methods to these known classes. However, the literature is notably lacking in quantitative measures of dendrogram quality suitable for use with the BHC algorithm.
For instance, most of the quality indices implemented in the clValid package [21] require a distance metric: since BHC does not use a distance metric these indices are unsuitable for our comparisons. Another commonly used index for measuring the agreement between two clusterings is the adjusted Rand index [22]: large values for the adjusted Rand index mean better agreement between two clusterings. A value of unity would indicate a perfect match between the clustering partition and ground truth, with zero being the expected result for a random partition. However, this index is only really of use if the true clustering structure is known. In most real-world applications of clustering to microarray data, the biological ground truth is unknown. Nevertheless, the adjusted Rand index has been used to evaluate the performance of a variety of clustering algorithms on experimental microarray data by Yeung et al [23]. These authors used a subset of the data described by Ideker et al. [24], a set of 997 mRNA profiles across 20 experiments representing systematic perturbations of the yeast galactose-utilization pathway. A subset of 205 of these genes were assigned to four functional categories (biosynthesis, protein metabolism and modification; energy pathways, carbohydrate metabolism, catabolism; nucleobase, nucleoside, nucleotide and nucleic acid metabolism; transport), based on Gene Ontology (GO) annotations. However, in their supplementary material, Yeung et al. note that since this array data may not fully reflect these functional categories, this classification should be used with caution.
For the purposes of comparison, we have applied our BHC algorithm to this data set, treating the four assigned classes as "ground truth", with the caveat above. The BHC algorithm automatically correctly identifies four classes in the data, as shown by the dendrogram [see Additional file 1]. The adjusted Rand index is 0.955, which is in the upper range of those calculated by Yeung et al. [23]. For comparison, standard hierarchical clustering using average linkage and a correlation distance metric gives an adjusted Rand index of 0.866. The shrinkage correlation coefficient (SCC) of Yao et al. [25], which used the same data set as a benchmark, gives an adjusted Rand index of 0.876.
Quality Measures
In order to perform the comparison of two dendrograms produced by different clustering methods, we have devised a new quantitative measure: DendrogramPurity, which takes as input a dendrogram tree structure
and a set of class labels
for the leaves of the tree and outputs a single number measuring how "pure" the subtrees of
are with respect to the class labels
.
DendrogramPurity(
,
): where T is a binary tree (dendrogram) with set of leaves ℒ = {1 ..., L} and
= {c1,..., c
L
} is the set of known class assignments for each leaf. The DendrogramPurity is defined to be the measure obtained from this random process: pick a leaf ℓ uniformly at random. Pick another leaf j in the same class, i.e. cℓ = c
j
. Find the smallest subtree containing ℓ and j. Measure the fraction of leaves in that subtree which are in the same class, i.e. cℓ. The expected value of this fraction is the DendrogramPurity. This measure can be computed efficiently using a bottom up recursion (without needing to resort to sampling). The overall tree purity is 1 if and only if all leaves in each class are contained within some pure subtree.
For each leaf of the tree it also useful to measure how well it fits in with the labels of the leaves in the surrounding subtree. Leaves which do not fit well contribute to decreasing the overall dendrogram purity. These may highlight unusual or misclassified genes, drugs or cell lines. We define the LeafHarmony of a leaf ℓ as a measure of how well that leaf fits in.
LeafHarmony(ℓ,
,
): Pick a random leaf j in same class as leaf ℓ, i.e. c
j
= cℓ, j ≠ ℓ. Find the smallest subtree containing ℓ and j. Measure the fraction of leaves in that subtree which are in class cℓ. The expected value of this fraction is the LeafHarmony for ℓ and it measures the contribution of that leaf to the DendrogramPurity.
For the case of data sets where there are not clearly defined class labels these measures are not applicable so we have defined a third measure, the LeafDisparity, which highlights differences between two hierarchical clusterings (i.e. dendrograms) of the same data. Intuitively, this measures for each leaf of one dedrogram how similar the surrounding subtree is to the corresponding subtree in the other dendrogram. Define the correlation between two sets
and ℛ to be
, where |·| denotes the number of elements in a set.
and
. Note that a tree
can be converted into a set-of-sets representation
= {τ1,..., τ
k
}. For each node j in the tree, τ
j
is the set of the leaves in the subtree descending from j. (Thus in a binary tree with n leaves contains n - 1 non-leaf internal nodes, so k = 2n - 1).
LeafDisparity(ℓ,
,
): Convert each tree into a set-of-sets representation. Align the trees: For each set τ
j
in
, find the set ρ
k
in
such that the correlation is greatest: r
j
= max
k
c(τ
j
, ρ
k
). For each leaf ℓ find the average of r
j
over all sets that contain ℓ, calling this
(ℓ). If the element ℓ appears in both
and
let its disparity be the minimum of 1 -
(ℓ) in either tree. Thus this measure will be symmetric and sensitive to disagreement between the hierarchical clustering given by each tree.
Software Implementation
The R/Bioconductor port consists of two functions, bhc and WriteOutClusterLabels. The bhc function calls efficient C++ routines for the special case of the BHC algorithm as described in this paper. The algorithm has a computational complexity of order N2 for N data points, and runs in about 8 minutes on a Macbook Pro 2 GHz laptop for a data-set of size 880 and dimensionality 31 (i.e. the NASC data set used in this paper). The reverse clustering (i.e. size 31 with dimensionality 880) runs in approximately a minute. For runtimes for data sets of various sizes [see Additional file 2].
The WriteOutClusterLabels function outputs the resulting cluster labels to an ASCII file. Because the bhc function outputs its results in a standard R dendrogram object, a graphical representation of the output can be obtained by calling the standard R plot function. A 2D heat-map visualization of the clustering can be generated using the standard R function heatmap.
In our model the hyperparameters are the concentration parameter, α, which controls the distribution of the prior weight assigned to each cluster in the DPM, and is directly related to the expected number of clusters, and the hyperparameters, β, of the probabilistic model defining each component of the mixture. The concentration parameter, α, was fixed to a small, positive value (0.001). The hyperparameters for the individual mixture component (Dirichlet) priors β are scaled by a single additional hyperparameter, giving the data model greater flexibility. This additional hyperparameter was determined by optimising the overall model Evidence (marginal likelihood), using a combination of golden section search and successive parabolic interpolation (as implemented in the R function optimize). The unscaled β hyperparameters were set by using the whole data-set as a measure of the relative proportions of each discrete value for each gene.
Application to Microarray Data
We illustrate our methods with application to a published data set of GeneChip expression profiles of A. thaliana subjected to a variety of biotic and abiotic stresses, derived from the AtGenExpress consortium (NASC), identical to that used by Torres-Zabala et al. [26]. The expression profiles were selected, normalized and interpreted by the GC content-adjusted robust multi-array algorithm (GCRMA) [27] exactly as in the original manuscript. Continuous transcript levels were discretised into three levels (unchanged, under- or over-expressed) by dividing the levels at fixed quantiles for each given gene. This makes our analyses more robust to any experimental systematics, as well as simplifying the algorithm by using multinomial distributions and their conjugate Dirichlet priors. By discretizing mRNA levels we model the important qualitative changes in mRNA levels without making strong unjustifiable assumptions (e.g. Gaussianity) about the form of the noise in microarray experiments. We note that such an approach has also been used by other workers in the field [28]. In order to identify the optimal discretization thresholds we utilized the following procedure. The discretization threshold is parameterised via the quantiles, x, of the data for a given gene, such that the data counts are distributed in the proportions x : (1 - 2x): x. We can then optimise x jointly over all the genes by running the BHC algorithm for different x values (and hence discretisations) and using the lower bound on the overall model Evidence, modified to account for the above parameterisation by dividing the Evidence by the relevant bin width for each data point. Evidence values and the optimal value for the hyperparameter mentioned in the previous section are shown [see Additional files 3 and 4]. These results indicate that the optimal quantitles for the discretization of this data set are 20/60/20 and 25/50/25 for the experiment and gene clustering, respectively.