Volume 6 Supplement 2
Second Annual MidSouth Computational Biology and Bioinformatics Society Conference. Bioinformatics: a systems approach
Reproducible Clusters from Microarray Research: Whither?
 Nikhil R Garge†^{4},
 Grier P Page^{1},
 Alan P Sprague^{2},
 Bernard S Gorman^{3} and
 David B Allison†^{1}Email author
DOI: 10.1186/147121056S2S10
© Garge et al; licensee BioMed Central Ltd. 2006
Published: 15 July 2005
Abstract
Motivation
In cluster analysis, the validity of specific solutions, algorithms, and procedures present significant challenges because there is no null hypothesis to test and no 'right answer'. It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable. By replicable we mean reproducible across multiple samplings from the same population. Methodologists have suggested that the validity of clustering methods should be based on classifications that yield reproducible findings beyond chance levels. We used this approach to determine the performance of commonly used clustering algorithms and the degree of replicability achieved using several microarray datasets.
Methods
We considered four commonly used iterative partitioning algorithms (Self Organizing Maps (SOM), Kmeans, Clutsering LARge Applications (CLARA), and Fuzzy Cmeans) and evaluated their performances on 37 microarray datasets, with sample sizes ranging from 12 to 172. We assessed reproducibility of the clustering algorithm by measuring the strength of relationship between clustering outputs of subsamples of 37 datasets. Cluster stability was quantified using Cramer's v^{ 2 }from a kXk table. Cramer's v^{ 2 }is equivalent to the squared canonical correlation coefficient between two sets of nominal variables. Potential scores range from 0 to 1, with 1 denoting perfect reproducibility.
Results
All four clustering routines show increased stability with larger sample sizes. Kmeans and SOM showed a gradual increase in stability with increasing sample size. CLARA and Fuzzy Cmeans, however, yielded low stability scores until sample sizes approached 30 and then gradually increased thereafter. Average stability never exceeded 0.55 for the four clustering routines, even at a sample size of 50. These findings suggest several plausible scenarios: (1) microarray datasets lack natural clustering structure thereby producing low stability scores on all four methods; (2) the algorithms studied do not produce reliable results and/or (3) sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results. Further research should be directed towards evaluating stability performances of more clustering algorithms on more datasets specially having larger sample sizes with larger numbers of clusters considered.
Introduction
Cluster analysis is a statistical approach used in microarray research that identifies genes within a cluster that are more similar to each other than genes contained in different clusters. By grouping genes that exhibit similarities in their expression patterns, the function of those genes which were previously unknown may be revealed. There are two groups of clustering methods, hierarchical and nonhierarchical. Nonhierarchical algorithms require the number of clusters (k) be prespecified. Nonhierarchical algorithms can run multiple times with different values of k. The user can then choose the clustering solution that is logical to address the problem of interest.
If we consider each gene as a point in high dimensional space, then "clusters may be described as continuous regions of this space containing a relatively high density of points, separated from other such regions by regions containing a relatively low density of points. Clusters described in this way are sometime referred to as natural clusters" [1].
Despite the use of cluster analysis in microarray research, the evaluation of the "validity" of a cluster solution has been challenging. This is due, in part, to the properties of cluster analysis. Cluster analysis has no null hypothesis to test and hence no right answer, which makes the testing of the validity of specific solutions, algorithms, and procedures difficult [2]. A second challenge encountered is that genes may not "naturally" fall into clusters separated by empty areas of the attribute space in genome expression studies. Hence, genomewide collections of expression trajectories may lack a "natural clustering" structure in many cases [1]. Third, the result of gene clustering may be "method sensitive". That is, gene clustering depends on several methodological choices, including the distance metric used, the clustering algorithm, and the stopping rule in the case of iterative partitioning methods. Hence, it is important to evaluate the stability of any specific derived cluster solution and the general performance of clustering approaches.
According to McShane et al., "Clustering algorithms always detect clusters, even in random data and it is imperative to conduct some statistical assessments of the strength of evidence for any clustering and to examine the reproducibility of individual clusters" [3]. Roth et al. defined stability as "the variability of solutions which are computed from different data sets sampled on the same source" [4]. It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable [5]. The concept of a replicable cluster is defined as reproducible across multiple samplings from the same population. Thus, some methodologists have suggested that the validity of clustering methods could be defined as the extent by which they yield classifications that are reproducible beyond chance levels. Most recently, Tseng et al. [6] identified stability of clusters in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. Famili et al. [7] summarized the related work as follows:
Zhang et al. [8]proposed a parametric bootstrap resampling method (PBR) to incorporate information on variations in gene expression levels to assess the reliability of gene clusters identified from largescale gene expression data...Smolkin et al. [9]assessed the stability of a cluster using their Cluster Stability Score, by which a cluster's stability is calculated through clustering on random subspace of the attribute space...BenHur et al. [10]proposed a stabilitybased resampling method for estimating the number of clusters, where stability is characterized by the distribution of pairwise similarities between clusters obtained from subsamples of the data...Datta et al. [11]formulated 3 other validation measures using the leftoutone condition strategy to evaluate the performances of 6 clustering algorithms...Giurcaneanu et al. [12]introduced a stability index to estimate the quality of clusters for randomly selected subsets of the data.
Clusters that produce classifications with greater replicability would be considered more valid [5]. The objective of this paper is to determine the performance of commonly used nonhierarchical clustering algorithms and the degree of stability achieved using several microarray datasets.
Methods
Data
Real datasets
List of microarray datasets considered for the study. Table 1 contains two columns of datasets. Each dataset is described by its name, source, and sample size (n). Table 1 shows 39 datasets. The first 3 columns list 19 datasets and last three columns describe 18 datasets.
Name of the dataset  Source  Sample size (n)  Name of the dataset  Source  Sample size (n) 

GDS22  GEO  80  Leukemia dataset  [30]  70 
GDS171  GEO  30  Medulloblastoma Data Set  [31]  34 
GDS184  GEO  30  Prostate Cancer dataset  [32]  100 
GDS232  GEO  46  Gaffney Head and Neck data  [33]  60 
GDS274  GEO  80  Affymetrix Hu133A Latin Square  [34]  42 
GDS285  GEO  20  CNGI design experiment  Unpublished  24 
GDS365  GEO  66  Paired pre and post euglycaemic insulin clamp skeletal muscle biopsies  Unpublished  106 
GDS465  GEO  90  GDS156  GEO  12 
GDS331  GEO  70  GDS254  GEO  16 
GDS534  GEO  74  GDS268  GEO  24 
GDS565  GEO  48  GDS287  GEO  16 
GDS427  GEO  24  GDS288  GEO  16 
GDS402  GEO  12  GDS472  GEO  14 
GDS356  GEO  14  GDS473  GEO  12 
GDS389  GEO  16  GDS511  GEO  12 
GDS388  GEO  18  GDS520  GEO  20 
GDS352  GEO  12  GDS564  GEO  28 
GDS531  GEO  172  GDS540  GEO  18 
GDS535  GEO  12 
Simulated datasets
List of simulated microarray datasets. Table 2 show the details of simulated datasets. Each of these datasets has clustering structure k = 6 (six clusters) with correlation ρ set to (0.33)^{1/2}.
Dataset Name  Sample size  Number of genes  Clusters 

Dataset1  20  1200  6 
Dataset2  100  1200  6 
Dataset3  200  1200  6 
Dataset4  500  1200  6 
Dataset5  1000  1200  6 
Dataset6  40  1200  6 
Dataset7  60  1200  6 
Dataset8  80  1200  6 
Preprocessing of data
Microarray datasets may contain unobserved expression levels termed, i.e., missing values. The first stage of our preprocessing handled these missing values and then a second stage standardized the variables to mean zero and unit variance as explained below.
Missing values
If we represent microarray data as a matrix with rows representing genes and columns representing chips or samples, we filtered out all rows which contained at least one null expression or missing value because we do not know the exact source(s) for the missing/null value observation. Missing data can be due to array damage, transcription errors, etc. Conventional algorithms for clustering require complete datasets to run and extending these clustering routines to accommodate missing data was beyond the scope of our inquiry.
Standardization
Variables such as gene expression values measured on different scales can affect cluster analysis [14]. The main purpose of standardization is to convert variables measured on different scales to a unitless standard scale. One might question the reason to standardize genes when microarray dataset represents expression levels of various genes. But a level of mRNA (messenger ribonucleic acid) expression (for a given gene) responsible for triggering specific biological activity can be different for different genes. Therefore each gene vector (expression values of a gene across samples) may be a measurement made on a different functional scale. To address this issue, we standardized each gene vector (expression values of a gene across samples) and replaced expression values by Z scores before clustering genes. Z scores were computed using the following formula [15–17]:
Where Z_{ij} = Z score computed for expression level observed for gene i in sample/subject j, = intensity measured for gene i in sample j, and = mean intensity of gene i across samples, = standard deviation of expressions of gene i across samples.
Clustering methods
 1.
Begin with an initial partitioning of the dataset into a specified number of clusters (k) and thereafter compute the centroids of these clusters.
 2.
Allocate each data point to the cluster that has the nearest centroid (except Fuzzy Cmeans where data points belong to a cluster that is specified by a membership grade).
 3.
Compute the new centroids of the clusters. Clusters are not updated until there has been a complete pass through the data.
 4.
Alternate steps 2 and 3 until no data points change clusters.
We consider the following four iterative partitioning methods, which are commonly used in the literature. The algorithms for them are freely available in R statistical package.
Kmeans
In Kmeans clustering, one decides on the number of clusters and randomly assigns each gene to one of the k clusters. If a gene is actually closer to the center of another cluster, as assessed by variety of similarity metrics (i.e., Pearson's correlation or Euclidean Distance) the gene will be assigned to the closer cluster. After assigning all genes to the closest cluster, the centroids (centers of clusters) are recalculated. After a number of iterations, the cluster centroids will no longer change and the algorithm stops. The Kmeans clustering is described in detail in [18]. However, the efficient version of the algorithm is presented by Hartigan and Wong [19] which is implemented in R (publicly available software). This version of Kmeans assumes that it is not practical to require that the solution has minimal sum of squares against all partitions, except when M (number of genes to be clustered), N (number of chips or samples) are small and k = 2. For details of this algorithm, please refer [19].
Self Organizing Map (SOM)
Self Organizing Map (SOM) is a clustering algorithm [20] used to map high dimensional microarray data onto a twodimensional surface. It is similar to Kmeans, but instead of allowing of centroids to move freely in high dimensional space, they are restricted to a twodimensional grid. Grid maps considered by us are 1 × 2, 1 × 3, 1 × 4, 1 × 5, 1 × 6, 1 × 7, 1 × 8, 1 × 9, 1 × 10 for k = 2 to 10 respectively. We did not assess stability for other grid structures to see if we obtain similar stability scores, because assessing stability on 37 datasets with different set of grid structures for k = 2 to k = 10 involves impractical computations. The grid structure implies a relationship between neighboring clusters on the grid. The resultant map is organized in such a way that similar genes are mapped onto similar clusters (nodes) or to neighboring clusters. Hence, the arrangement of clusters reflects the topological relationships of these clusters.
Clustering LARge Applications (CLARA)
The clustering algorithm PAM (Partition Around Medoids) works effectively for small datasets but does not scale well for large datasets [21]. To deal with large datasets, a samplingbased method, called CLARA (Clustering LARge Applications) can be used. CLARA [22] is carried out in two steps. First it draws a sample of dataset, applies PAM algorithm on the sample and finds k representative objects of the sample. In PAM, one considers possible choices of k representative objects and then constructs the clusters around these representative objects. A set of k representative objects is selected which gives minimum average dissimilarity. PAM algorithm is explained in detailed in [23].
Once the k representative objects are selected, then each object not belonging to the sample is assigned to the nearest of the k representative objects. This yields clustering of the entire dataset and measure of quality of this clustering is obtained by computing the average distance between each object of the dataset and its representative object. After five samples have been drawn and clustered, the one is selected for which the lowest average distance was obtained.
Fuzzy Cmeans
Fuzzy Cmeans is a data clustering technique wherein each gene belongs to a cluster that is specified by a membership degree. Membership degrees between zero and one are used instead of crisp assignments of the data to clusters. This technique was originally introduced by Bezdek [24]. In our methodology we use crisp assignments of genes to clusters. Hence, in Fuzzy Cmeans we assign every gene to a unique cluster – the one showing maximum degree of membership for that gene. One may question why Kmeans is considered different from Fuzzy Cmeans if we do not assign genes to more than one cluster in Fuzzy Cmeans? In Kmeans [19], an early assignment to a given cluster may preclude a gene from being considered to any other cluster. Crisp assignment (in Kmeans algorithm) may prematurely force a gene into a cluster. Fuzzy Cmeans on other hand can be considered more "global" where a gene is assigned to more than one cluster with some membership degree (0 to 1) and then we convert the fuzzy membership into crisp membership by assigning the gene to a cluster showing maximum degree of membership. The above two approaches may produce different clustering solutions and hence Fuzzy Cmeans without fuzziness is not same as Kmeans.
Similarity Metric
The similarity metric allows us to compute the distance between two objects to be clustered. Two of the more common similarity metrics are: Pearson's correlation coefficient and Euclidean distance. A correlation coefficient evaluates the direction of change between two expression profiles. It is described as a shape measurement, which is insensitive to differences in magnitude of the variables. The value of correlation coefficient ranges from 1 to +1, and values of zero indicate a random relationship between profiles [5]. Euclidean distance is a dissimilarity measure, that is, a high distance implies low similarity and measures both magnitude and direction of change between two expression profiles. It can be shown that correlation and Euclidean distance are equivalent after standardization [16]. For our studies, we use Euclidean distance which can be calculated as:
Where, d_{ ij }is the distance between genes i and j (across N samples), and g_{ ik }is the gene expression value of the k^{th} sample/subject for the i^{th} gene.
Pearson's correlation coefficient can be defined as:
Method used to compute cluster stability
We quantify stability/replicability using Cramer's v^{ 2 }. Cramer's v^{ 2 }makes use of χ^{2} statistics. If we classify data by two systems simultaneously, the result is a twoway contingency table. One can analyze data of this type using the classic χ^{2} test, an inferential test of the null hypothesis, which states there is no association between the two classification schemes (for details, refer [25]). One can also compute measures that quantify the degree of association in such tables [26]. One such measure, Cramer's v^{ 2 }is the squared canonical correlation between two sets of nominal variables that define the rows and columns of the contingency table. It indicates the proportion of variance in one classification scheme that can be explained or predicted by the other classification scheme [25]. It ranges from 0 to 1, with 0 indicating no relationship and 1 indicating a perfect reproducibility.
Where χ^{2} = is the ordinary χ^{2} test statistic for independence in contingency tables [27], N = the number of items cross classified (i.e., total number of genes to be clustered), and k = the smaller of rows or columns in a two way contingency table, in our case, k is the number of clusters extracted.
Algorithms and implementation
We implemented the algorithms explained in this section using R, a computer language designed for statistical data analysis. All four clustering techniques are implemented in R.
Approach to compute cluster stability
Algorithm to split dataset into two halves
Results
We evaluated stability performances on 37 real microarray datasets (Table 1) and 8 simulated datasets (Table 2).
Results on real datasets
Table showing stability results produced on a real dataset of sample size 16. Table 3 shows stability scores produced on a given dataset of a sample size of n = 16. We split the dataset into two halves each containing 8 subjects. The left dataset is resampled 6 times producing 6 samples of sample sizes 3 to 8, respectively. Similarly the right dataset is resampled to produce 6 samples. We measured the strength of the association between the clusters produced on every pair of samples (one sample from left and other from right dataset both of same sample size) using Cramer's v^{ 2 }. Columns in the table represent number of clusters (k) and rows represent sample sizes. Stability score quantified for k = 10 and sample size 8 is 0.3699. This table shows there is 37% agreement between the clusters produced (k = 10) on pair of samples (a sample from left dataset and other from right dataset both of sample size 8).
K (CLUSTERS)  

2  3  4  5  6  7  8  9  10  
S A M P L E S I Z E  3  0.5883  0.47091  0.4503  0.4028  0.3809  0.3600  0.3313  0.3107  0.2992 
4  0.5799  0.48045  0.4244  0.3894  0.365  0.3469  0.3132  0.297  0.2858  
5  0.5738  0.48296  0.4297  0.3982  0.3644  0.3430  0.3195  0.3013  0.2790  
6  0.6433  0.54638  0.5142  0.4727  0.4405  0.4066  0.3817  0.3616  0.3396  
7  0.6534  0.54821  0.5250  0.4826  0.4462  0.4211  0.3915  0.3679  0.348  
8  0.6759  0.58447  0.5520  0.5045  0.4700  0.4592  0.4160  0.3975  0.3699 
Results on simulated datasets

Different algorithms showed different stability behaviors until sample size reached n = 100. Kmeans showed high stability at smaller sample sizes as compared to the other methods.

Kmeans, Fuzzy Cmeans and SOM showed fluctuation in scores even at large sample sizes, whereas CLARA showed consistent behavior (constant level of scores) at larger sample sizes.

CLARA maintained 100% stability for larger sample sizes (300–500) whereas, SOM and Fuzzy Cmeans failed to reach 100% stability, even at large sample sizes. Kmeans showed stability scores between 0.7 and 1.0 most of the times for larger sample sizes.
Figure 4 suggests that Kmeans shows replicable performance than other nonhierarchical clustering algorithms considered (SOM, CLARA and Fuzzy Cmeans). Also, CLARA is a good choice for datasets of larger sample sizes.
Discussion
We determined the performance of commonly used nonhierarchical clustering algorithms and the degree of stability achieved using several microarray datasets. We assessed cluster stability as a measure of replicability. We agree that replicability is not the only criteria for measuring cluster stability. However, a useful classification that characterizes some aspect of population must be replicable [2]. The most critical finding of this research was low stability achieved for all four clustering algorithms even at the elevated sample sizes of n = 50. This suggests that in general, given sample sizes up to 50, if the clustering algorithms we studied are applied, it is highly questionable that the results obtained will be meaningful. The extent to which these results apply to other clustering algorithms remains open to question, but we believe that the "burden of proof" is now on those who use clustering algorithms on microarray data and claim that such analysis produce replicable results.
Figure 3 and Figure 4 suggest that Kmeans shows replicable performance than other clustering algorithms considered (SOM, CLARA and Fuzzy Cmeans). Kmeans and SOM showed similar behavior in real datasets because they are closely related to each other. In Kmeans, centroids move freely in multidimensional space while they are constrained to a twodimensional grid in SOM [28]. In SOM, the distance of each input from all reference vectors is considered, instead of just the closest one, weighted by the neighborhood kernel [29]. Thus, the SOM functions as conventional clustering algorithm if the width of the neighborhood kernel is zero [29]. Low stability achieved on all four clustering routines may also suggest that microarray datasets, in general, lack natural clustering structure. We do not claim that these results can predict the exact stability nature of a given dataset of a specific sample size, since these are generalized on a large number and variety of datasets. Nonetheless, the researcher should consider performing cluster analysis on large sample sizes to obtain more stable clustering solutions. Our research suggests a statistical criterion for selecting an appropriate number of clusters (k) for a given microarray dataset. This may be accomplished by computing Cramer's v^{ 2 }on various values of k and selecting that value of k which provides a maximum stability score for a given dataset.
We also evaluated stability performances on simulated datasets. Simulated datasets helped us understand the stability behavior at large sample sizes (300–500). Datasets were structured for 6 clusters with a correlation of (0.33)^{1/2} within clusters. All four clustering algorithms showed similar stability behavior in real and simulated datasets until sample sizes attained n = 50. Kmeans showed greater stability scores as compared to other methods at smaller sample sizes in both real and simulated datasets, indicating that Kmeans appear to be a better choice for datasets of smaller sample sizes. Kmeans and CLARA maintained 100% stability for large sample sizes (300–500), whereas SOM and Fuzzy Cmeans showed stability scores below 1, even at larger sample sizes (refer Figure 5).
Our methodology to compute stability used crisp assignments of genes to clusters. Hence, in Fuzzy Cmeans we assigned every gene to a cluster showing maximum degree of membership. We acknowledge that the above process of crisp assignment may affect the stability scores produced in Fuzzy Cmeans and hence expect it to produce low scores before hand. In SOM, we found that the choice of twodimensional grid structure influences the stability scores produced on simulated datasets. For a same number of clusters (k) considered, we can create a twodimensional grid in more than one way. Choosing the right grid structure for a given value of k to produce stable clustering solutions is beyond the scope of this paper and will address it in future investigations. Currently we limit the value of k (clusters) to 10; hence, if a real dataset has natural clustering structure for k greater than 10 (say k = 17), then this observation is not captured. We will consider measuring stability scores for higher values of k as an extension of this research. In conclusion our research suggests several plausible scenarios: (1) microarray datasets may lack natural clustering structure thereby producing low stability scores on all four methods; (2) the algorithms studied may not be well suited to producing reliable results and or (3) sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results.
Notes
Declarations
Acknowledgements
We thank W. Timothy Garvey for providing the data in human skeletal muscle and biopsies before and after hyperinsulinemic clamp studies. We thank all the members of Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham for giving us some constructive comments and suggestions during the course of our research. This research was supported in part by NIH grant U54CA100949 and NSF grants: 0090286 and 0217651.
Authors’ Affiliations
References
 Bryan J: Problems in gene clustering based on gene expression data. Journal of Multivariate Analysis 2004, 90: 44–66.View ArticleGoogle Scholar
 Mehta T, Tanik M, Allison DB: Towards sound epistemological foundations of statistical methods for highdimensional biology. Nature Genetics 2004, 36: 943–7.View ArticlePubMedGoogle Scholar
 McShane LM, Radmacher MD, Freidlin B, Yu R, Li MC, Simon R: Methods of assessing reproducibility of clustering patterns observed in analysis of microarray data. Bioinformatics 2002, 18: 1462–1469.View ArticlePubMedGoogle Scholar
 Roth V, Braun ML, Lange T, Buhmann JM: Stabilitybased model order selection in clustering with applications to gene expression data. Lecture Notes in Computer Science 2002, 2415: 607–612.View ArticleGoogle Scholar
 Blashfield RK, Aldenderfer MS: The Methods and Problems of Cluster Analysis. In Handbook of Multivariate Experimental Psychology. 2nd edition. Edited by: Nesselroade JR, Cattel RB. New York: Plenum; 1988:447–473.View ArticleGoogle Scholar
 Tseng GC, Wong WH: Tight Clustering: A Resamplingbased Approach for Identifying Stable and Tight Patterns in Data. Biometrics 2005, 61: 10–16.View ArticlePubMedGoogle Scholar
 Famili AF, Liu G, Liu Z: Evaluation and optimization of clustering in gene expression data analysis. Bioinformatics 2004, 10: 1535–1545.View ArticleGoogle Scholar
 Zhang K, Zhao H: Assessing reliability of gene clusters from gene expression data. Functional & Integrative Genomics 2000, 1: 156–173.View ArticleGoogle Scholar
 Smolkin M, Ghosh D: Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics 2003, 4: 36.PubMed CentralView ArticlePubMedGoogle Scholar
 BenHur A, Elisseeff A, Guyon I: A stability based method for discovering structure in clustered data. Pac Symp Biocomputing 2002, 7: 6–17.Google Scholar
 Datta S, Datta S: Comparisons and validation of clustering techniques for microarray gene expression data. Bioinformatics 2003, 4: 459–466.View ArticleGoogle Scholar
 Giurcaneanu CD, Tabus I, Shmulevich I, Zhang W: Stabilitybased cluster analysis applied to microarray data. Proceedings of the Seventh International Symposium on Signal Processing and its Applications Paris, France 2003, 57–60.Google Scholar
 Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 2002, 30: 207–210.PubMed CentralView ArticlePubMedGoogle Scholar
 Han J, Kamber M: Cluster Analysis. In Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers; 2001:339.Google Scholar
 MollerLevet CS, Cho KH, Wolkenhauer O: Microarray data clustering based on temporal variation: FCV with TSD preclustering. Applied Bioinformatics 2003, 2: 35–45.PubMedGoogle Scholar
 Yeung KY, Medvedovic M, Bumgarner RE: From coexpression to coregulation: how many microarray experiments do we need? Genome Biology 2004, 5: R48.PubMed CentralView ArticlePubMedGoogle Scholar
 William Shannon , Robert Culverhouse , Jill Duncan : Analyzing microarray data using cluster analysis. Pharmacogenomics 2003, 4: 41–51.View ArticlePubMedGoogle Scholar
 Han J, Kamber M: Cluster Analysis. In Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers; 2001:349.Google Scholar
 Hartigan JA, Wong MA: A Kmeans clustering algorithm. Applied Statistics 1979, 28: 100–108.View ArticleGoogle Scholar
 Kohonen T: SelfOrganizing Maps. Information Sciences. 3rd edition. Springer; 2000.Google Scholar
 Han J, Kamber M: Cluster Analysis. In Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers; 2001:353.Google Scholar
 Kaufman L, Rousseeuw P: Clustering Large Applications (Program CLARA). In Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons; 1990:126–146.View ArticleGoogle Scholar
 Kaufman L, Rousseeuw P: Clustering Large Applications (Program CLARA). In Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons; 1990:68–123.View ArticleGoogle Scholar
 Pal NR, Bezdek JC, Hathaway RJ: Sequential Competitive Learning and the Fuzzy cMeans Clustering Algorithms. Neural Networks 1996, 9: 787–796.View ArticlePubMedGoogle Scholar
 Agresti A: Introduction to categorical data analysis. John Wiley and Sons, New York; 1996.Google Scholar
 Goodman LA, Kruskal WH: Measures of association for cross classification. Journal of the American Statistical Association 1954, 49: 732–64.Google Scholar
 Wickens TD: Multiway Contingency Tables Analysis for Social Sciences. Lawrence Erlbaum Associates Publishers; 1989:17–48.Google Scholar
 Knudsen S: Cluster Analysis. In A Biologist's guide to Analysis of DNA Microarray Data. John Wiley & Sons, Inc., New York; 2002::44.View ArticleGoogle Scholar
 Kaski S: Data exploration using selforganizing maps. PhD thesis. Helsinki University of Technology, Neural Networks Research Centre; 1997.Google Scholar
 Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science 1999, 286: 531–537.View ArticlePubMedGoogle Scholar
 Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 2004, 101: 4164–4169.PubMed CentralView ArticlePubMedGoogle Scholar
 Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene Expression Correlates of Clinical Prostate Cancer Behavior. Cancer Cell 2002, 1: 203–209.View ArticlePubMedGoogle Scholar
 Ginos MA, Page GP, Michalowicz BS, Patel KJ, Volker SE, Pambuccian SE, Ondrey FG, Adams GL, Gaffney PM: Identification of a Gene Expression Signature Associated with Recurrent Disease in Squamous Cell Carcinoma of the Head and Neck. Cancer Res 2002, 64: 55–63.View ArticleGoogle Scholar
 Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.