We use a Minimal Spanning Tree (MST) based algorithm [13, 14] for clustering along with the Fukuyama-Sugeno clustering measure. Gene selection is done on the basis of the two-sample t-statistic with pooled variance. In the next three subsections we will look in detail at the clustering and feature selection aspects before presenting the formal algorithm.

### Minimal spanning trees

Let *V* = {*x*
_{1}, *x*
_{2}..., *x*
_{
N
}} be a set of points with distances *d*
_{
ij
}= *d*(*x*
_{
i
},*x*
_{
j
}) defined between all *x*
_{
i
}and *x*
_{
j
}. A tree on *V* is a graph with no loops whose vertices are elements of *V* and edge lengths are *d*
_{
ij
}. A *minimal spanning tree* (MST) is a tree that connects all points such that the sum of the length of the edges is a minimum. An MST can be efficiently computed in O(N^{2}) time (including distance calculations) using either Prim's [13] or Kruskal's [14] algorithm.

Deletion of any edge from an MST results in two disconnected trees. Assuming the length of the deleted edge to be *δ* and denoting the sets of nodes in the two trees as *V*
_{1} and *V*
_{2}, we have the property that there are no pairs of points (*x*
_{1},*x*
_{2}), *x*
_{1} ∈ *V*
_{1}, *x*
_{2}∈ *V*
_{2}such that *d*(*x*
_{
i
},*x*
_{
j
}) <*δ*. Define the smallest distance between any two points, one in *V*
_{1} and the other in *V*
_{2}, as the *separation* between *V*
_{1} and *V*
_{2}. Then we have the result that the separation is at-least *δ*.

The significance of this result is that by deleting an edge of length *δ* we are assured of a partition where the two clusters have a separation of at-least *δ*. This means that if we are interested in looking at all binary partitions with large separations between the clusters, it is sufficient to look at partitions obtained by deleting edges of the MST. Instead of looking at all possible binary partitions (which number 2^{
N-1}-1) our algorithm looks only at partitions obtained by deleting single edges from the MST (which number *N*-1).

Minimal Spanning Trees were initially proposed for clustering by Zahn [15]. More recently, Xu *et al* have used MST for clustering gene expression data [16].

### Clustering measure

To compare the partitions obtained by deleting different edges of the MST, we use the Fukuyama-Sugeno clustering measure [17]. Given a partition *S*
_{1}, *S*
_{2} of the sample index set *S*, with each *S*
_{
k
}containing *N*
_{
k
}samples, denote by *μ*
_{
k
}the mean of the samples in *S*
_{
k
}and *μ* the global mean of all samples. Also denote by
the *j*-th sample in cluster *S*
_{
k
}. Then the Fukuyama-Sugeno (F-S) clustering measure is defined as

Small values of *FS(S)* are indicative of tight clusters with a large separation between clusters.

We have considered various other clustering measures. The ideal clustering measure should show local minima at each viable partition and have good performance even with a large number of noisy features. We have found the Fukuyama-Sugeno (F-S) measure to give the best performance in these two respects (Supplementary data – Additional file 1).

### Feature selection

For a given partition with two clusters, we can ask if a particular gene shows sufficient differential expression between samples belonging to the different clusters. A gene which is very differently expressed in samples belonging to different clusters can be said to be relevant to the partition or to support the partition. There can be many ways of measuring a gene's support for a partition. Here we use the two sample t-statistic with pooled variance. The t-statistic is computed for each gene to compare the mean expression level in the two clusters. Genes with absolute t-statistic greater than a threshold *T*
_{
thresh
}are selected. The percentile threshold parameter *P*
_{
thresh
}
*∈*(0,100) is used to compute *T*
_{
thresh
}. *T*
_{
thresh
}is the *P*
_{
thresh
}/2-th percentile of a random variable distributed according to Student's t-distribution with mean zero and *N*-2 degrees of freedom (*N* is the number of samples). Here we use the t-statistic as a heuristic measure of the contribution of each gene to the selected partition; no statistical significance is implied.

The condition for selection of a gene becomes stricter with each iteration. In the first iteration we choose genes with absolute t-statistic greater than *T*
_{
thresh
}/2. This cutoff increases linearly with the number of iterations until it reaches *T*
_{
thresh
}. This is done so that we do not lose any useful genes by putting a too-stringent selection criterion before the partition has evolved close to its final form.

### The algorithm

Initially, an MST is created using all the genes; then each binary partition obtained by deleting an edge from the tree is considered as a putative partition. The partition with the minimum value of the F-S clustering measure is selected. The t-statistic is used to select a subset of genes that discriminate between the clusters in this partition. In the next iteration, clustering is done in this set of selected genes. This process continues until the selected gene subset converges (remains the same between two iterations), resulting in a set of genes and the final partition. Having identified a partition and the associated set of genes, these selected genes are removed from the pool of genes. This prevents the algorithm from detecting the same partition the next time. The whole process repeats in the pool of remaining genes to find other partitions.

The inputs to the algorithm are the gene expression matrix {*x*
_{
s,g
}}, the maximum number of partitions to be found *MaxN*
_{
p
}and percentile threshold *P*
_{
thresh
}. *P*
_{
thresh
}is used to compute *T*
_{
thresh
}. The outer loop of the algorithm runs as long as the number of discovered partitions is less than *MaxN*
_{
p
}. The set of selected genes *F* is initialized to be the set of all genes *Fset* and the cutoff *t* is initialized as *T*
_{
thresh
}/2. In the inner loop, an MST is created using the genes in *F*, and for all partitions obtained by deleting single edges from this MST, the F-S measure is calculated. For the partition *P** with the lowest F-S measure, genes are selected from *F* based on the t-statistic. These selected genes form the new gene set *F*
_{
new
}. If *F*
_{
new
}≠ *F*, the cutoff *t* is increased and another iteration of the inner loop is performed. If *F*
_{
new
}= *F*, this means that the gene set has remained unchanged between two iterations and the current partition *P** along with the current gene set *F* is output. The number of discovered partitions is increased and another iteration of the outer loop is performed.

Since this is an unsupervised method, the partitions picked might be indicative of biological differences that are relevant, irrelevant (like age or sex of patients) or unknown. We control the detection of chance partitions (*i.e.* generated due to noise and not due to any biological difference) by requiring a minimum of 2*M* (1 - *P*
_{
thresh
}/100) genes in support of a partition (*M* is the total number of genes); the algorithm is terminated if there are fewer.

*P*
_{
thresh
}plays an important part in the kind of partitions that are extracted. A value of *P*
_{
thresh
}close to 100 will preferentially extract partitions that are supported by genes with large differential expression between the two clusters. A smaller value of *P*
_{
thresh
}will pick up partitions that are supported by larger number of genes with lower differential expression between the clusters.

*P*
_{
thresh
}cannot be interpreted as a measure of the statistical significance of the partitioning since we are doing both the partitioning and the feature selection on the same set of samples. Here we only use *P*
_{
thresh
}as a parameter for selecting genes.

**Algorithm 1:** Algorithm for iterative clustering

**Input**
*MaxN*
_{
p
}, *P*
_{
thresh
}, *x*
_{
s,g
};

*Fset* ← {1, 2..., n};

*N*
_{
p
}← 0; /*Number of currently discovered partitions*/

**Compute**
*T*
_{
thresh
};

**While**
*N*
_{
p
}<*MaxN*
_{
p
}
**do**

*F* ← *Fset*;

*T* ← *T*
_{
thresh
}/2;

**While 1 do**

**If** length of *F* < 2 *M*(1 - *P*
_{
thresh
}/100) **then**

/*Not enough genes support partitions*/

**exit;**

**end**

Create MST in feature set *F* with metric *d*;

Delete edges one at a time and calculate F-S measure for each ensuring binary partition;

Find partition *P** with the lowest F-S measure;

Compute t-statistic *t*
_{
g
}for all genes g ∈ F for this partition;

Set *F*
_{
new
}to the set of genes {g**:** |*t*
_{
g
}| >*t*};

**If**
*F*
_{
new
}= *F*
**AND**
*t* = *T*
_{
thresh
}
**then**

/*Feature set has converged */

output *P** and *F*;

/*Remove genes in *F* from *Fset**/

*Fset* ← *Fset* \ *F*;

*N*
_{
p
}= *N*
_{
p
}+ 1;

**break;**

**else**

*F* ← *F*
_{
new
};

Increase *t*;

**end**

**end**