Suppose
is the set of all genes for a given microarray experiment. Let
_{1},.....,
_{
F
}be *F* functional classes, not necessarily disjoint. One could use software like [25] or SAGE library tools (see, e.g., [26]) and public databases (e.g., Gene Ontology, Entrez Gene, Unigene cluster) to annotate and organize the expression values from a microarray experiment into families related by the biological characteristics of the genes or of their encoded proteins. Note that not all the genes can be functionally annotated and hence the set of all annotated genes
:= ⋃
_{
i
}⊂
.

### Biological stability index

Next we capture the stability of a clustering algorithm by inspecting the consistency of the biological results produced when the expression profile is reduced by one observational unit. This stability measure is unrelated to the one introduced by [3] which compared the clusters without regard to biological relevance.

In a microarray or SAGE study, each gene has an expression profile that can be thought of as a multivariate data value in ℜ^{
p
}, for some *p* > 1. For example, in a time course microarray study, *p* could be the number of time points at which expression readouts were taken. In a two sample comparison, *p* could be the total (pooled) sample size, and so on. For each *i* = 1, 2,..., *p*, repeat the clustering algorithm for each of the *p* data sets in ℜ^{p-1 }obtained by deleting the observations at the *i*th position of the expression profile vectors. For each gene *g*, let
^{
g,i
}, denote the cluster containing gene *g* in the clustering based on the reduced expression profile. Let
^{g,0 }be the cluster containing gene *g* using the full expression profile. For each pair of genes *x* and *y* in a biological class, we compare the statistical clusters containing *x* based on the original and the statistical cluster containing *y* based on the reduced profile. A stable clustering algorithm would produce similar answers, as judged biologically, based on the original and the reduced data. Thus, the clusters using full and reduced data, respectively, containing two functionally similar genes should have substantial overlaps. This is captured by the following stability measure and larger values of this index indicate more consistent answers:

A successful clustering is characterized by high values of both of these indices. The following subsection describes how to attribute a *p*-value to an observed index *I* for a given clustering algorithm by comparing it with random clustering of genes into the same number of clusters.

### Statistical scoring

By comparing with "random clustering", we can compute the observed level of significance or *p*-value for he above measures and a given clustering algorithm. This can be done by the following Monte Carlo steps:

Step 1. Compute a performance measure *I* for the clustering procedure under consideration.

Step 2. Compute the same performance measure *I* = *I*
_{
obs
}corresponding to a random clustering algorithm that ignores the data and assigns genes to clusters randomly and independently. This can easily be done by generating (*p* + 1) independent simple random samples (with replacement) of size *M* out of {1,..., *k*}, where *k* denotes the desired number of clusters, and making the cluster assignments
^{g,0 }and
^{
g,i
}, 1 ≤ *g* ≤ *M*, 1 ≤ *i* ≤ *p* accordingly. Denote the resulting value of the performance measure *I**.

Step 3. Repeat Step 2 a large number of times, say *B*, yielding
.

Step 4. Compute the p-value as the proportion of times the performance measure by random cluster assignments exceeds (or equals to) the value obtained using the clustering algorithm under consideration

This proportion estimates the probability of obtaining a value as high as *I*
_{
obs
}just by chance (i.e., by "random clustering"). A 95% upper limit of the distribution of *I* under random clustering can be estimated by
where
is the *j*th ordered *I**.

### Range of *k*, the number of clusters

In general, the users will have the flexibility of investigating the performance of a clustering algorithm over a range of cluster numbers of their choosing. Some clustering algorithms such as Fanny or Model based clustering use data based selection of total number of hard clusters even if a larger number of clusters are desired by the user. For others, this choice is subjective. Often times, the biologists conducting the microarray experiment will make this call. For our illustration with the yeast data we have selected a range of *k* values around *k* = 7 which was used in the original analysis by [27].

### Human breast cancer progression data

We illustrate our methods using the expression profiles of 258 genes (SAGE tags) that were judged to be significantly differentially expressed at 5% significance level between four normal and seven ductal carcinoma in situ (DCIS) samples [26]. [26] combined various normal and tumor SAGE libraries in the public domain with their own SAGE libraries and used a modified form of *t*-statistics to compute *p-*values. Further details can be obtained from their paper and its supplementary web-site.

For constructing the functional classes, we have used a publicly available web-tool called AmiGO [25]. We were able to annotate 113 SAGE tags into the following eleven functional classes based on their primary biological functions. They were as follows: cell organization and biogenesis (24), transport (7), cell communication (15), cellular metabolism (48), cell cycle (6), cell motility (7), immune response (7), cell death(7), development (5), cell differentiation (5), cell proliferation (5), where the numbers in parentheses were the numbers of SAGE tags in a class. There were 23 genes that belonged to more than one functional class.

### Yeast sporulation data

As a second illustrative data set, we use a well known data set collected by [27]. This data set records expression profiles during the sporulation of budding yeast at seven time points. The original data set was filtered using the same criterion as in [27]. For our illustration, we look at a further subset of 513 genes (ORF's to be correct) that satisfy ∑ log expression ratio > 0, where the sum is over all the time points. Note that a positive value of the log of the expression ratio at a time point implies that the gene is positively expressed at that time and thus, in a sense, this is a collection of genes whose expression values change in a positive direction overall during the course of the experiment.

We use two separate web-based tools both using the GO ontology to annotate these ORF's. The resulting functional classifications were different although they had some common GO terms. We wanted to see whether the end comparison of the clustering algorithms is sensitive to the choice of the biological classes. To this end, we wanted to compare two different sets of functional classes, both based on the biological processes, with the same set of yeast ORF's.

For the first set of functional classes we mined the yeast genome database using the FatiGO webtool [28] at [29]. We have used the default FatiGO "level 3" GO terms. However it resulted in some very broad functional classes such as "cellular process" or "cellular physiological processes". In the end, we took a subset of the resulting terms which we judged to be more specific. This resulted in 295 annotated genes into the following ten overlapping biological classes: reproduction (14), cell communications (8), sex determination (4), metabolism (197), morphogenesis (13), cell differentiation (48), cell growth (7), cell regulation (85), response to stimulus (37) and localization (51).

The next set of functional classes were obtained using the web-based GO mining tool FunCat [30] available at [31] which did not offer a choice of "level" of the GO terms. Overall, 503 of the 513 genes were annotated into the following seventeen functional classes: metabolism (138), energy (27), cell cycle and DNA processing (152), transcription (50), protein synthesis (10), protein fate (72), protein with binding function or cofactor requirement (81), protein activity regulation (16), transport (63), cell communication (12), defense (36), interaction with environment (33), cell fate (17), development (41), biogenesis (77), cell differentiation (82).

### The clustering algorithms

We consider the following well known clustering algorithms representing the vast spectrum of clustering techniques that are available in statistical pattern recognition and machine learning literature. We evaluate these algorithms using the two biological performance measures BHI and BSI. One minus correlation was taken as the dissimilarity measure for the "distance" based algorithms. In addition, for UPGMA, Diana, Fanny, we also considered the standard Euclidean distance between expression vectors as a dissimilarity measure. Thus, overall, ten clustering schemes were subjected to this comparative evaluation.

#### UPGMA

This is perhaps the most commonly used clustering method with microarray data sets. This is an agglomerative hierarchical clustering algorithm [4] yielding a dendrogram that can be cut at a chosen height to produce the desired number of clusters. It uses a dissimilarity matrix in order to decide if two expression profiles are close or not.

#### K-means

K-means [6] is a partitioning method that is not hierarchical in nature. This algorithm uses a minimum "within-class sum of squares from the centers" criterion to select the clusters. The number of clusters needs to be fixed in advance.

#### Diana

This is also a hierarchical algorithm which is divisive in nature [7]. Thus at each level, a bigger cluster is divided into two smaller clusters that are furthest apart.

#### Fanny

This algorithm produces a fuzzy cluster [7]. Thus, a probability vector for each observation is reported that represents the probability of its cluster membership. A hard cluster can be produced by assigning it to the cluster with highest probability.

#### SOM

Clustering by self-organizing maps [8] is a popular method amongst the computational biologists and machine learning researchers. SOM is based on neural networks and can be regarded as a data visualization technique.

#### Model based clustering

Under this scheme [10], a statistical model is fit to the data. The model is a finite mixture of Gaussian distributions. Each mixture component represents a cluster. The Maximum likelihood method (EM algorithm) is used to fit the group membership and the mixture components. A number of Gaussian component models are compared as well. The number of clusters and the Gaussian models are chosen by the minimum BIC criterion.

#### SOTA

Self-organising tree algorithm or SOTA has received a great deal of attention in recent years and was used to cluster microarray gene expression data in [32]. Originally proposed by [9] for phylogenetic reconstruction, SOTA produces a divisive hierarchical binary tree structure using a neural network. It uses a fast algorithm and hence is suitable for clustering a large number of objects.

UPGMA (hclust) and K-Means are available in the base distribution of R. Diana and Fanny are available in the library "cluster". Model based clustering is available in the R-package mclust. For SOM, we have used an R code written by Niels Waller and Janine Illian [33]. For SOTA, we have used the MeV component of the TM4 package [34]. Servers running SOTA are also available ([35, 36]).