Cluster stability scores for microarray data in cancer studies

Background A potential benefit of profiling of tissue samples using microarrays is the generation of molecular fingerprints that will define subtypes of disease. Hierarchical clustering has been the primary analytical tool used to define disease subtypes from microarray experiments in cancer settings. Assessing cluster reliability poses a major complication in analyzing output from clustering procedures. While most work has focused on estimating the number of clusters in a dataset, the question of stability of individual-level clusters has not been addressed. Results We address this problem by developing cluster stability scores using subsampling techniques. These scores exploit the redundancy in biologically discriminatory information on the chip. Our approach is generic and can be used with any clustering method. We propose procedures for calculating cluster stability scores for situations involving both known and unknown numbers of clusters. We also develop cluster-size adjusted stability scores. The method is illustrated by application to data three cancer studies; one involving childhood cancers, the second involving B-cell lymphoma, and the final is from a malignant melanoma study. Availability Code implementing the proposed analytic method can be obtained at the second author's website.


Background
Due to the advent of high-throughput microarray technology, scientists have conducted global molecular profiling studies in cancer [1][2][3]. One of the scientific goals of these experiments is the discovery of disease subtypes defined by the gene expression data that are more predictive of clinical outcomes (disease recurrence, survival, diseasefree survival, etc.) than usual clinical correlates. Development of such a molecular classification system can potentially lead to more tailored therapies for patients as well as better diagnostic procedures.
Hierarchical clustering has been an important tool in the discovery of disease subtypes in microarray data [4]. Such procedures typically output a dendrogram that groups samples. Determining the reliability of clustering procedures poses a major problem in the interpretation and analysis of microarray data.
One important related question is estimating the true number of clusters in a dataset so that clusters which arise due to random chance can be separated from those which represent "true" clusters. The null hypothesis that is being tested here is that of no structure in the data. This is often referred to as a global hypothesis of clustering. Several methods have addressed this issue: these include the proposals of Hartigan [5], Krzanowski and Lai [6], Tibshirani et al. [7], Ben-Hur et al. [8] and Dudoit and Fridlyand [9].
In addition, there have been alternative clustering methodologies developed for microarray data [10,11]. Still more work has been done on assessing the validity of a clustering procedure based on the jackknife [12] and bootstrap methods [13].
A second hypothesis of interest in clustering problems is testing to determine if particular clusters found represent reliable clusters. In contrast to the global test of clustering described in the previous paragraph, inference regarding particular clusters is local in nature. There has been some recent work focused on assessing the local reliability of clusters [14,15]. While the global and local hypotheses involve clustering are different, it is obvious that the particular clusters found depend on the number of clusters one determines to be in the dataset.
In most microarray studies, the number of samples profiled is much smaller than the number of genes and ESTs represented on the chip. Due to the number of elements spotted on the microarray, it is reasonable to assume that there is redundant information available on them. Consequently, if we cluster samples based on a subset of the spots on the microarray, stable clusters should be replicated on average. This statement heuristically describes our approach to assessing the reliability of clustering analyses of microarray data. It involves performing sensitivity analyses using random subspace methods. The approach is relatively generic and can be applied to any clustering algorithm. We will focus primarily on hierarchical clustering since that is the technique used most often in the analysis of microarray data. While we are primarily interested in clustering samples, these methods can be utilized for clustering genes as well. These techniques have been examined for supervised learning problems [16]; their application to clustering techniques appears to be novel. The issue addressed in this paper is separate from estimating the number of clusters in a dataset. However, the two problems are related; in particular, the sensitivity measures we develop depend on the number of clusters. In System and Methods, we describe the data used, outline hierarchical clustering and summarize the procedure of Ben-Hur et al. [8] for estimating the number of clusters. Two approaches are taken in this paper. For the first, we assume that the number of clusters is known; sensitivity measures using random subspace methods are calculated. In the second situation, the number of clusters is unknown. We address this problem by proposing a twostage procedure in which the number of clusters is estimated at the first stage and sensitivity measures are calculated at the second. These techniques are described in Systems and methods and compared with the methods of McShane et al. [14] and Tibshirani et al. [15]. We have programmed our procedures in the R language; in Implementation, we discuss the software. We use these meth-ods to re-analyze three publicly available datasets in the literature: a childhood cancer study [3], a B-lymphoma study [2], and a cutaneous melanoma study [1]. These analyses are summarized in Results. Finally, in Discussion, we make some concluding remarks.

Data and clustering procedures
We will let x 1 , ..., x n denote the p-dimensional vectors of gene expression profiles; n is the number of samples profiled. In what follows, we assume that the data have been preprocessed and normalized. Thus, our procedures work with both oligonucleotide and cDNA microarrays.
Since we will be primarily applying our methods to hierarchical clustering procedures, we briefly summarize the method here.

Hierarchical clustering
To implement the standard method for the analysis of gene expression data from microarray experiments, one first constructs a similarity measure for each pair of objects. Some examples are given in Table 1. Clustering is based on a pairwise distance matrix between objects, where distance is defined to be one minus the association measure.
Hierarchical clustering methods fall into two classes: agglomerative nesting methods and divisive analysis methods [17]. Agglomerative nesting algorithms proceed in the same general manner: begin with n singleton clusters; the closest pair of distinct clusters is found and merged, leaving (n -1) singleton clusters and one cluster with two distinct objects; the dissimilarity matrix is updated to take into account the merging that has occurred; based on the new dissimilarity matrix, the two closest distinct clusters are found and merged; iterate until one cluster consisting of all n objects remains.
The opposite to agglomerative nesting is a divisive analysis approach. Heuristically, the algorithm begins with one cluster of n objects. The object in the cluster that has the greatest dissimilarity to the other elements (the seed) is then separated to form a so-called splinter group and the remaining elements in the original cluster are examined to see whether or not additional elements should be added to the splinter group. Two clusters result. The diameter of each cluster (the largest distance between observations in the same cluster) is then computed to see which one is greater. The steps above are repeated with the cluster that has the greater diameter. Iterate until there are n singleton clusters. The distance for separate clusters can be defined based on average linkage or one of the other methods described above.
The algorithms described are a fraction of the available methods for clustering gene expression data. Other techniques that have been used include self-organizing maps, minimal spanning trees, spectral analysis and k-means clustering. While the methods described in this paper can be used with any of these clustering procedures, we focus on hierarchical clustering due to its popularity and to facilitate comparisons with other proposals.

Estimating number of clusters
In the Algorithm section, we discuss a two-stage procedure for performing sensitivity analysis of clustering output when the number of clusters is not fixed a priori. The method involves estimating the number of clusters at the first stage and then computing random subspace-based sensitivity measures at the second stage. We looked at the literature for the various proposals of estimating the number of clusters. Based on our experience with real datasets, the best performance seemed to be given by the method of Ben-Hur et al. [8]. We now briefly describe their procedure. It should be pointed out that our approach is relatively generic and that any method for estimating the number of clusters can be used in the first stage.
In the approach of Ben-Hur et al. [8], the samples are partitioned into k clusters. We then rerun the clustering algorithm based on the subsampling a fraction of the samples and group the subsamples into k clusters. We then compute a similarity index of the subsamples, the correlation coefficient between the clusters for the resampled data with those for the original data based on the definition given by Fowlkes and Mallows [18]. We repeat this several times to get a histogram of correlation coefficient values. We then vary k and redo the procedure.

Random subspace methods for known number of clusters
In this section, we assume that the number of clusters is known to be some number, say K. Thus, the samples {1,2, ..., n} are partitioned into K sets A 1 , ..., A K . To apply the random subspace, we randomly choose a subset D of the indices {1,2, ..., p}, where the cardinality of D is d; We the choice of d is discussed later. We then create a new dataset , where is the d-dimensional subvector of x i (i = 1, ..., n). We create a new dissimilarity matrix based on the , i = 1, ..., n and rerun the hierarchical clustering procedure. The resulting dendrogram is cut into K clusters, . We then check to see if A i ⊂ for i, j = 1, ..., K. The random subspace selection is repeated B times. For each of the original sets A 1 , ..., A K , we compute the proportion of B samples in which the set appears. This is our sensitivity measure. If the value is close to 1, then this evidence that the cluster is stable. On the other hand, if the proportion is small, then this provides less evidence of the stability of the cluster.
These sensitivity measures will depend on the choice of d. Larger values of d tend to yield larger sensitivity measures while the converse holds for small d. Our experience has been to choose d to be within between .75 and .85 times p.
While we have presented the procedure from a purely algorithmic point of view, there is some theoretical justification for our procedure. Since we are computing proportions based on B random subsets of (1, ..., p), the sensitivity measure can be thought as a probabilistic quantity that is averaged over B models. This provides an analogue of stacking or combining models [19,20] for unsupervised learning. It might be also possible to calculate sensitivity measures that average both over d as well as over subsets of (1, ..., p), but we will not pursue that here.
It is obvious that the criteria A i ⊂ (i, j = 1, ..., K) will favor smaller clusters. We will also calculate a sizeadjusted cluster stability score. If P i represents cluster stability score for the ith cluster (i = 1, ..., K), then the sizeadjusted score is , where C i = 1/(ln|A i | + 1), |A i | is the size of the ith cluster, and ln(x) represents the natural logarithm of x. For two given clusters that have the same unadjusted cluster stability score, the adjusted cluster stability score will be greater for that with the larger number of clusters.

Random subspace methods for unknown number of clusters
Having developed a method for using random subspace techniques in, we can summarize our method when the number of clusters is not known a priori by the following two-stage method. First, we estimate the number of clusters at the first stage using the method of Ben-Hur et al. (2002). Next, conditional on the number of clusters estimated at the first stage, use the random subspace method developed in the previous section for calculating the sensitivity measures of the K* clusters.

Comparisons with other proposals
Two other techniques for assessing the reliability of individual clusters are R-index procedure of McShane et al. [14] and the cluster scoring method of Tibshirani et al. [15]. We now compare and contrast our method with these two works.
A recent paper by McShane et al. [14] describes the application of the R-index [18] for inference regarding the local hypothesis of clustering. Note that we use the R-index for addressing the problem of number of clusters, which is the global hypothesis of clustering. The authors create new datasets based on adding independent normal random errors to the original dataset and then determine the proportion of pairs of specimens within the original cluster for which the members of pairs stay together in the reclustered perturbed dataset. While the method bears some relationships with ours, there are several operational differences. First, their method requires adding independent noise to the original data. By contrast, our method involves subsampling genes from the expression profiles. Second, while they use the overall experimental variance for data perturbation, this choice is relatively ad hoc. Our method requires no specification of added error variance. Furthermore, the added noise in their procedure is independent across genes, which is not a biologically plausible assumption. Our procedure avoids such independence assumptions.
In the method of Tibshirani et al. [15], a hierarchical clustering is performed on the genes. The average gene expression profile in each cluster is associated with a clinical response. A set of winning clusters is then found, and permutation methods are used to assess the reliability of the winning clusters. One fundamental difference between our method and that of Tibshirani et al. [15] is that their method requires a clinical outcome. Their goals is to correlate gene expression patterns with a response, and a byproduct of their procedure is a score associating each cluster with the clinical response. Our procedure, by contrast, does not require a clinical response and can be used on the gene expression data only.

Implementation
We have written macros in R for implementing the methods we have proposed for genes and samples. They are obtainable from the second author's website at the following URL: http://www.sph.umich.edu/~ghoshd/COMP BIO/.

Results
We now discuss the application of the proposed methodology to three microarray datasets: one from a childhood cancer study [3], one from a lymphoma study [1] and the final is from a cutaneous melanoma study [2].
For each dataset, the Ben-Hur et al. [8] algorithm was applied to hierarchical cluster solutions obtained using average and complete-linkage upon standardization of gene expression values. At each iteration of the algorithm, two data subsets were created by randomly selecting 65% of the available samples. Correlations between the cluster designations for the members of each subset pair were calculated using the Jaccard co-efficient. For each cluster number, k, considered, 100 correlations were computed and the distribution of correlation coefficients was mapped. The distributions obtained for various cluster numbers were compared to determine the best estimate of the true number of clusters. In instances for which the true number of clusters was not obvious, both visual inspection of the original dendrogram and examination of the result obtained using the other linkage method for that dataset were considered.
After estimating the true number of clusters, we then calculated cluster stability scores using d = 0.85 p, 0.75 p, 0.5 p and 0.25 p, where p is the number of genes. For each setting, B = 100 cluster trials were performed. Both unadjusted and cluster size-adjusted scores were calculated.
In the Khan dataset, gene expression values were measured for p = 2308 genes on a total of n = 89 subjects. For these data, application of the Ben-Hur et al. [8] algorithm in addition to other methods described above resulted in estimates of five and seven for the number of clusters in the average-and complete-linkage solutions, respectively. To account for the presence of three singletons, the dendrogram for the average-linkage solution was cut at k = 8 to ensure five non-singleton clusters. The results of random gene subset clustering using average-linkage are shown in Table 2. Similarly, results from the analysis of the complete-linkage solution are presented in Table 3. The average linkage clustering method finds a cluster of 66 childhood cancers which contains cancers of different sites of origins, so it is not very meaningful clinically. Similarly, the clustering results from the complete linkage analyses do not suggest the presence of any meaningful clusters, although the cluster of seven samples with a high stability score are from the same tumor type (Ewing sarcoma).
In the Alizadeh dataset, data were available on n = 96 samples for whom gene expression values on p = 4026 different genes were measured. Application of the Ben-Hur et al. [8] methodology to the average-linkage solution suggested the presence of eight true clusters in the data. A similar estimate was assumed for the complete-linkage solution since no conclusive result was obtained. In both instances, the dendrograms were cut at larger values of k to account for the presence of singletons (eight and five for the average and complete-linkage results, respec-tively). Tables 4 and 5 Tables 6 and 7. While the average linkage results suggest that the melanoma cluster of Bittner et al. [2] should be expanded to include two samples (Cluster 1 in Table 7), this cluster is not found using complete linkage. In addition, the stability of the cluster drops off with decreasing numbers of genes.

Discussion
In this paper, we have developed an approach to statistical validation of clustering results based on subsampling methods. One of the advantages of this approach is that it      exploits the fact that in microarray experiments, the number of spots on the chip is greater than the number of samples profiled. By subsampling the spots on the chip, we are able to determine which clusters are relatively stable on average. It is important to note that an assumption being made is that there is sufficient correlation on the spots with respect to discriminating between clustered samples. For example, if only one gene on a 10 K chip discriminates two cancer subtypes, then the approach described here might give misleading results. However, given the fact that cancer is a complex trait, it is highly unlikely that all discriminatory information will be available in one gene.
Based on the cluster stability score method, we revisited several datasets from cancer studies to explore the stability of clustered samples. The main point of the analyses was to demonstrate the ability of our method to provide a measure of stability for the clusters that were found. In certain cases, the analyses helped confirm was what found in the previous analyses, while in other cases, they led to clinically nonmeaningful results. These results demonstrate the potential pitfalls of clustering analyses [21].
In many cancer studies, there are additional clinical covariates (e.g., survival time, PSA recurrence) available. One potential method of more formal biological validation is to combine the clustering methodology with correlation of the subsequent output to these covariates. Such an approach was taken in Tibshirani et al. [15]. Due to the variability in gene expression data, it may be potentially desirable to incorporate clinical knowledge into such clustering analyses.