- Open Access
Reproducible Clusters from Microarray Research: Whither?
- Nikhil R Garge†4,
- Grier P Page1,
- Alan P Sprague2,
- Bernard S Gorman3 and
- David B Allison†1Email author
© Garge et al; licensee BioMed Central Ltd. 2006
- Published: 15 July 2005
In cluster analysis, the validity of specific solutions, algorithms, and procedures present significant challenges because there is no null hypothesis to test and no 'right answer'. It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable. By replicable we mean reproducible across multiple samplings from the same population. Methodologists have suggested that the validity of clustering methods should be based on classifications that yield reproducible findings beyond chance levels. We used this approach to determine the performance of commonly used clustering algorithms and the degree of replicability achieved using several microarray datasets.
We considered four commonly used iterative partitioning algorithms (Self Organizing Maps (SOM), K-means, Clutsering LARge Applications (CLARA), and Fuzzy C-means) and evaluated their performances on 37 microarray datasets, with sample sizes ranging from 12 to 172. We assessed reproducibility of the clustering algorithm by measuring the strength of relationship between clustering outputs of subsamples of 37 datasets. Cluster stability was quantified using Cramer's v 2 from a kXk table. Cramer's v 2 is equivalent to the squared canonical correlation coefficient between two sets of nominal variables. Potential scores range from 0 to 1, with 1 denoting perfect reproducibility.
All four clustering routines show increased stability with larger sample sizes. K-means and SOM showed a gradual increase in stability with increasing sample size. CLARA and Fuzzy C-means, however, yielded low stability scores until sample sizes approached 30 and then gradually increased thereafter. Average stability never exceeded 0.55 for the four clustering routines, even at a sample size of 50. These findings suggest several plausible scenarios: (1) microarray datasets lack natural clustering structure thereby producing low stability scores on all four methods; (2) the algorithms studied do not produce reliable results and/or (3) sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results. Further research should be directed towards evaluating stability performances of more clustering algorithms on more datasets specially having larger sample sizes with larger numbers of clusters considered.
- Cluster Algorithm
- Real Dataset
- Cluster Solution
- Simulated Dataset
- Microarray Dataset
Cluster analysis is a statistical approach used in microarray research that identifies genes within a cluster that are more similar to each other than genes contained in different clusters. By grouping genes that exhibit similarities in their expression patterns, the function of those genes which were previously unknown may be revealed. There are two groups of clustering methods, hierarchical and non-hierarchical. Non-hierarchical algorithms require the number of clusters (k) be pre-specified. Non-hierarchical algorithms can run multiple times with different values of k. The user can then choose the clustering solution that is logical to address the problem of interest.
If we consider each gene as a point in high dimensional space, then "clusters may be described as continuous regions of this space containing a relatively high density of points, separated from other such regions by regions containing a relatively low density of points. Clusters described in this way are sometime referred to as natural clusters" .
Despite the use of cluster analysis in microarray research, the evaluation of the "validity" of a cluster solution has been challenging. This is due, in part, to the properties of cluster analysis. Cluster analysis has no null hypothesis to test and hence no right answer, which makes the testing of the validity of specific solutions, algorithms, and procedures difficult . A second challenge encountered is that genes may not "naturally" fall into clusters separated by empty areas of the attribute space in genome expression studies. Hence, genome-wide collections of expression trajectories may lack a "natural clustering" structure in many cases . Third, the result of gene clustering may be "method sensitive". That is, gene clustering depends on several methodological choices, including the distance metric used, the clustering algorithm, and the stopping rule in the case of iterative partitioning methods. Hence, it is important to evaluate the stability of any specific derived cluster solution and the general performance of clustering approaches.
According to McShane et al., "Clustering algorithms always detect clusters, even in random data and it is imperative to conduct some statistical assessments of the strength of evidence for any clustering and to examine the reproducibility of individual clusters" . Roth et al. defined stability as "the variability of solutions which are computed from different data sets sampled on the same source" . It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable . The concept of a replicable cluster is defined as reproducible across multiple samplings from the same population. Thus, some methodologists have suggested that the validity of clustering methods could be defined as the extent by which they yield classifications that are reproducible beyond chance levels. Most recently, Tseng et al.  identified stability of clusters in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. Famili et al.  summarized the related work as follows:
Zhang et al. proposed a parametric bootstrap re-sampling method (PBR) to incorporate information on variations in gene expression levels to assess the reliability of gene clusters identified from large-scale gene expression data...Smolkin et al. assessed the stability of a cluster using their Cluster Stability Score, by which a cluster's stability is calculated through clustering on random subspace of the attribute space...Ben-Hur et al. proposed a stability-based re-sampling method for estimating the number of clusters, where stability is characterized by the distribution of pair-wise similarities between clusters obtained from sub-samples of the data...Datta et al. formulated 3 other validation measures using the left-out-one condition strategy to evaluate the performances of 6 clustering algorithms...Giurcaneanu et al. introduced a stability index to estimate the quality of clusters for randomly selected subsets of the data.
Clusters that produce classifications with greater replicability would be considered more valid . The objective of this paper is to determine the performance of commonly used non-hierarchical clustering algorithms and the degree of stability achieved using several microarray datasets.
List of microarray datasets considered for the study. Table 1 contains two columns of datasets. Each dataset is described by its name, source, and sample size (n). Table 1 shows 39 datasets. The first 3 columns list 19 datasets and last three columns describe 18 datasets.
Name of the dataset
Sample size (n)
Name of the dataset
Sample size (n)
Medulloblastoma Data Set
Prostate Cancer dataset
Gaffney Head and Neck data
Affymetrix Hu133A Latin Square
CNGI design experiment
Paired pre and post euglycaemic insulin clamp skeletal muscle biopsies
List of simulated microarray datasets. Table 2 show the details of simulated datasets. Each of these datasets has clustering structure k = 6 (six clusters) with correlation ρ set to (0.33)1/2.
Number of genes
Preprocessing of data
Microarray datasets may contain unobserved expression levels termed, i.e., missing values. The first stage of our preprocessing handled these missing values and then a second stage standardized the variables to mean zero and unit variance as explained below.
If we represent microarray data as a matrix with rows representing genes and columns representing chips or samples, we filtered out all rows which contained at least one null expression or missing value because we do not know the exact source(s) for the missing/null value observation. Missing data can be due to array damage, transcription errors, etc. Conventional algorithms for clustering require complete datasets to run and extending these clustering routines to accommodate missing data was beyond the scope of our inquiry.
Variables such as gene expression values measured on different scales can affect cluster analysis . The main purpose of standardization is to convert variables measured on different scales to a unitless standard scale. One might question the reason to standardize genes when microarray dataset represents expression levels of various genes. But a level of mRNA (messenger ribonucleic acid) expression (for a given gene) responsible for triggering specific biological activity can be different for different genes. Therefore each gene vector (expression values of a gene across samples) may be a measurement made on a different functional scale. To address this issue, we standardized each gene vector (expression values of a gene across samples) and replaced expression values by Z scores before clustering genes. Z scores were computed using the following formula [15–17]:
Where Zij = Z score computed for expression level observed for gene i in sample/subject j, = intensity measured for gene i in sample j, and = mean intensity of gene i across samples, = standard deviation of expressions of gene i across samples.
Begin with an initial partitioning of the dataset into a specified number of clusters (k) and thereafter compute the centroids of these clusters.
Allocate each data point to the cluster that has the nearest centroid (except Fuzzy C-means where data points belong to a cluster that is specified by a membership grade).
Compute the new centroids of the clusters. Clusters are not updated until there has been a complete pass through the data.
Alternate steps 2 and 3 until no data points change clusters.
We consider the following four iterative partitioning methods, which are commonly used in the literature. The algorithms for them are freely available in R statistical package.
In K-means clustering, one decides on the number of clusters and randomly assigns each gene to one of the k clusters. If a gene is actually closer to the center of another cluster, as assessed by variety of similarity metrics (i.e., Pearson's correlation or Euclidean Distance) the gene will be assigned to the closer cluster. After assigning all genes to the closest cluster, the centroids (centers of clusters) are recalculated. After a number of iterations, the cluster centroids will no longer change and the algorithm stops. The K-means clustering is described in detail in . However, the efficient version of the algorithm is presented by Hartigan and Wong  which is implemented in R (publicly available software). This version of K-means assumes that it is not practical to require that the solution has minimal sum of squares against all partitions, except when M (number of genes to be clustered), N (number of chips or samples) are small and k = 2. For details of this algorithm, please refer .
Self Organizing Map (SOM)
Self Organizing Map (SOM) is a clustering algorithm  used to map high dimensional microarray data onto a two-dimensional surface. It is similar to K-means, but instead of allowing of centroids to move freely in high dimensional space, they are restricted to a two-dimensional grid. Grid maps considered by us are 1 × 2, 1 × 3, 1 × 4, 1 × 5, 1 × 6, 1 × 7, 1 × 8, 1 × 9, 1 × 10 for k = 2 to 10 respectively. We did not assess stability for other grid structures to see if we obtain similar stability scores, because assessing stability on 37 datasets with different set of grid structures for k = 2 to k = 10 involves impractical computations. The grid structure implies a relationship between neighboring clusters on the grid. The resultant map is organized in such a way that similar genes are mapped onto similar clusters (nodes) or to neighboring clusters. Hence, the arrangement of clusters reflects the topological relationships of these clusters.
Clustering LARge Applications (CLARA)
The clustering algorithm PAM (Partition Around Medoids) works effectively for small datasets but does not scale well for large datasets . To deal with large datasets, a sampling-based method, called CLARA (Clustering LARge Applications) can be used. CLARA  is carried out in two steps. First it draws a sample of dataset, applies PAM algorithm on the sample and finds k representative objects of the sample. In PAM, one considers possible choices of k representative objects and then constructs the clusters around these representative objects. A set of k representative objects is selected which gives minimum average dissimilarity. PAM algorithm is explained in detailed in .
Once the k representative objects are selected, then each object not belonging to the sample is assigned to the nearest of the k representative objects. This yields clustering of the entire dataset and measure of quality of this clustering is obtained by computing the average distance between each object of the dataset and its representative object. After five samples have been drawn and clustered, the one is selected for which the lowest average distance was obtained.
Fuzzy C-means is a data clustering technique wherein each gene belongs to a cluster that is specified by a membership degree. Membership degrees between zero and one are used instead of crisp assignments of the data to clusters. This technique was originally introduced by Bezdek . In our methodology we use crisp assignments of genes to clusters. Hence, in Fuzzy C-means we assign every gene to a unique cluster – the one showing maximum degree of membership for that gene. One may question why K-means is considered different from Fuzzy C-means if we do not assign genes to more than one cluster in Fuzzy C-means? In K-means , an early assignment to a given cluster may preclude a gene from being considered to any other cluster. Crisp assignment (in K-means algorithm) may prematurely force a gene into a cluster. Fuzzy C-means on other hand can be considered more "global" where a gene is assigned to more than one cluster with some membership degree (0 to 1) and then we convert the fuzzy membership into crisp membership by assigning the gene to a cluster showing maximum degree of membership. The above two approaches may produce different clustering solutions and hence Fuzzy C-means without fuzziness is not same as K-means.
The similarity metric allows us to compute the distance between two objects to be clustered. Two of the more common similarity metrics are: Pearson's correlation coefficient and Euclidean distance. A correlation coefficient evaluates the direction of change between two expression profiles. It is described as a shape measurement, which is insensitive to differences in magnitude of the variables. The value of correlation coefficient ranges from -1 to +1, and values of zero indicate a random relationship between profiles . Euclidean distance is a dissimilarity measure, that is, a high distance implies low similarity and measures both magnitude and direction of change between two expression profiles. It can be shown that correlation and Euclidean distance are equivalent after standardization . For our studies, we use Euclidean distance which can be calculated as:
Where, d ij is the distance between genes i and j (across N samples), and g ik is the gene expression value of the kth sample/subject for the ith gene.
Pearson's correlation coefficient can be defined as:
Method used to compute cluster stability
We quantify stability/replicability using Cramer's v 2 . Cramer's v 2 makes use of χ2 statistics. If we classify data by two systems simultaneously, the result is a two-way contingency table. One can analyze data of this type using the classic χ2 test, an inferential test of the null hypothesis, which states there is no association between the two classification schemes (for details, refer ). One can also compute measures that quantify the degree of association in such tables . One such measure, Cramer's v 2 is the squared canonical correlation between two sets of nominal variables that define the rows and columns of the contingency table. It indicates the proportion of variance in one classification scheme that can be explained or predicted by the other classification scheme . It ranges from 0 to 1, with 0 indicating no relationship and 1 indicating a perfect reproducibility.
Where χ2 = is the ordinary χ2 test statistic for independence in contingency tables , N = the number of items cross classified (i.e., total number of genes to be clustered), and k = the smaller of rows or columns in a two way contingency table, in our case, k is the number of clusters extracted.
We implemented the algorithms explained in this section using R, a computer language designed for statistical data analysis. All four clustering techniques are implemented in R.
Approach to compute cluster stability
Algorithm to split dataset into two halves
Results on real datasets
Table showing stability results produced on a real dataset of sample size 16. Table 3 shows stability scores produced on a given dataset of a sample size of n = 16. We split the dataset into two halves each containing 8 subjects. The left dataset is resampled 6 times producing 6 samples of sample sizes 3 to 8, respectively. Similarly the right dataset is resampled to produce 6 samples. We measured the strength of the association between the clusters produced on every pair of samples (one sample from left and other from right dataset both of same sample size) using Cramer's v 2 . Columns in the table represent number of clusters (k) and rows represent sample sizes. Stability score quantified for k = 10 and sample size 8 is 0.3699. This table shows there is 37% agreement between the clusters produced (k = 10) on pair of samples (a sample from left dataset and other from right dataset both of sample size 8).
S A M P L E S I Z E
Results on simulated datasets
Different algorithms showed different stability behaviors until sample size reached n = 100. K-means showed high stability at smaller sample sizes as compared to the other methods.
K-means, Fuzzy C-means and SOM showed fluctuation in scores even at large sample sizes, whereas CLARA showed consistent behavior (constant level of scores) at larger sample sizes.
CLARA maintained 100% stability for larger sample sizes (300–500) whereas, SOM and Fuzzy C-means failed to reach 100% stability, even at large sample sizes. K-means showed stability scores between 0.7 and 1.0 most of the times for larger sample sizes.
Figure 4 suggests that K-means shows replicable performance than other non-hierarchical clustering algorithms considered (SOM, CLARA and Fuzzy C-means). Also, CLARA is a good choice for datasets of larger sample sizes.
We determined the performance of commonly used non-hierarchical clustering algorithms and the degree of stability achieved using several microarray datasets. We assessed cluster stability as a measure of replicability. We agree that replicability is not the only criteria for measuring cluster stability. However, a useful classification that characterizes some aspect of population must be replicable . The most critical finding of this research was low stability achieved for all four clustering algorithms even at the elevated sample sizes of n = 50. This suggests that in general, given sample sizes up to 50, if the clustering algorithms we studied are applied, it is highly questionable that the results obtained will be meaningful. The extent to which these results apply to other clustering algorithms remains open to question, but we believe that the "burden of proof" is now on those who use clustering algorithms on microarray data and claim that such analysis produce replicable results.
Figure 3 and Figure 4 suggest that K-means shows replicable performance than other clustering algorithms considered (SOM, CLARA and Fuzzy C-means). K-means and SOM showed similar behavior in real datasets because they are closely related to each other. In K-means, centroids move freely in multidimensional space while they are constrained to a two-dimensional grid in SOM . In SOM, the distance of each input from all reference vectors is considered, instead of just the closest one, weighted by the neighborhood kernel . Thus, the SOM functions as conventional clustering algorithm if the width of the neighborhood kernel is zero . Low stability achieved on all four clustering routines may also suggest that microarray datasets, in general, lack natural clustering structure. We do not claim that these results can predict the exact stability nature of a given dataset of a specific sample size, since these are generalized on a large number and variety of datasets. Nonetheless, the researcher should consider performing cluster analysis on large sample sizes to obtain more stable clustering solutions. Our research suggests a statistical criterion for selecting an appropriate number of clusters (k) for a given microarray dataset. This may be accomplished by computing Cramer's v 2 on various values of k and selecting that value of k which provides a maximum stability score for a given dataset.
We also evaluated stability performances on simulated datasets. Simulated datasets helped us understand the stability behavior at large sample sizes (300–500). Datasets were structured for 6 clusters with a correlation of (0.33)1/2 within clusters. All four clustering algorithms showed similar stability behavior in real and simulated datasets until sample sizes attained n = 50. K-means showed greater stability scores as compared to other methods at smaller sample sizes in both real and simulated datasets, indicating that K-means appear to be a better choice for datasets of smaller sample sizes. K-means and CLARA maintained 100% stability for large sample sizes (300–500), whereas SOM and Fuzzy C-means showed stability scores below 1, even at larger sample sizes (refer Figure 5).
Our methodology to compute stability used crisp assignments of genes to clusters. Hence, in Fuzzy C-means we assigned every gene to a cluster showing maximum degree of membership. We acknowledge that the above process of crisp assignment may affect the stability scores produced in Fuzzy C-means and hence expect it to produce low scores before hand. In SOM, we found that the choice of two-dimensional grid structure influences the stability scores produced on simulated datasets. For a same number of clusters (k) considered, we can create a two-dimensional grid in more than one way. Choosing the right grid structure for a given value of k to produce stable clustering solutions is beyond the scope of this paper and will address it in future investigations. Currently we limit the value of k (clusters) to 10; hence, if a real dataset has natural clustering structure for k greater than 10 (say k = 17), then this observation is not captured. We will consider measuring stability scores for higher values of k as an extension of this research. In conclusion our research suggests several plausible scenarios: (1) microarray datasets may lack natural clustering structure thereby producing low stability scores on all four methods; (2) the algorithms studied may not be well suited to producing reliable results and or (3) sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results.
We thank W. Timothy Garvey for providing the data in human skeletal muscle and biopsies before and after hyperinsulinemic clamp studies. We thank all the members of Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham for giving us some constructive comments and suggestions during the course of our research. This research was supported in part by NIH grant U54CA100949 and NSF grants: 0090286 and 0217651.
- Bryan J: Problems in gene clustering based on gene expression data. Journal of Multivariate Analysis 2004, 90: 44–66.View ArticleGoogle Scholar
- Mehta T, Tanik M, Allison DB: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nature Genetics 2004, 36: 943–7.View ArticlePubMedGoogle Scholar
- McShane LM, Radmacher MD, Freidlin B, Yu R, Li MC, Simon R: Methods of assessing reproducibility of clustering patterns observed in analysis of microarray data. Bioinformatics 2002, 18: 1462–1469.View ArticlePubMedGoogle Scholar
- Roth V, Braun ML, Lange T, Buhmann JM: Stability-based model order selection in clustering with applications to gene expression data. Lecture Notes in Computer Science 2002, 2415: 607–612.View ArticleGoogle Scholar
- Blashfield RK, Aldenderfer MS: The Methods and Problems of Cluster Analysis. In Handbook of Multivariate Experimental Psychology. 2nd edition. Edited by: Nesselroade JR, Cattel RB. New York: Plenum; 1988:447–473.View ArticleGoogle Scholar
- Tseng GC, Wong WH: Tight Clustering: A Resampling-based Approach for Identifying Stable and Tight Patterns in Data. Biometrics 2005, 61: 10–16.View ArticlePubMedGoogle Scholar
- Famili AF, Liu G, Liu Z: Evaluation and optimization of clustering in gene expression data analysis. Bioinformatics 2004, 10: 1535–1545.View ArticleGoogle Scholar
- Zhang K, Zhao H: Assessing reliability of gene clusters from gene expression data. Functional & Integrative Genomics 2000, 1: 156–173.View ArticleGoogle Scholar
- Smolkin M, Ghosh D: Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics 2003, 4: 36.PubMed CentralView ArticlePubMedGoogle Scholar
- Ben-Hur A, Elisseeff A, Guyon I: A stability based method for discovering structure in clustered data. Pac Symp Biocomputing 2002, 7: 6–17.Google Scholar
- Datta S, Datta S: Comparisons and validation of clustering techniques for microarray gene expression data. Bioinformatics 2003, 4: 459–466.View ArticleGoogle Scholar
- Giurcaneanu CD, Tabus I, Shmulevich I, Zhang W: Stability-based cluster analysis applied to microarray data. Proceedings of the Seventh International Symposium on Signal Processing and its Applications Paris, France 2003, 57–60.Google Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 2002, 30: 207–210.PubMed CentralView ArticlePubMedGoogle Scholar
- Han J, Kamber M: Cluster Analysis. In Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers; 2001:339.Google Scholar
- Moller-Levet CS, Cho KH, Wolkenhauer O: Microarray data clustering based on temporal variation: FCV with TSD preclustering. Applied Bioinformatics 2003, 2: 35–45.PubMedGoogle Scholar
- Yeung KY, Medvedovic M, Bumgarner RE: From co-expression to co-regulation: how many microarray experiments do we need? Genome Biology 2004, 5: R48.PubMed CentralView ArticlePubMedGoogle Scholar
- William Shannon , Robert Culverhouse , Jill Duncan : Analyzing microarray data using cluster analysis. Pharmacogenomics 2003, 4: 41–51.View ArticlePubMedGoogle Scholar
- Han J, Kamber M: Cluster Analysis. In Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers; 2001:349.Google Scholar
- Hartigan JA, Wong MA: A K-means clustering algorithm. Applied Statistics 1979, 28: 100–108.View ArticleGoogle Scholar
- Kohonen T: Self-Organizing Maps. Information Sciences. 3rd edition. Springer; 2000.Google Scholar
- Han J, Kamber M: Cluster Analysis. In Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers; 2001:353.Google Scholar
- Kaufman L, Rousseeuw P: Clustering Large Applications (Program CLARA). In Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons; 1990:126–146.View ArticleGoogle Scholar
- Kaufman L, Rousseeuw P: Clustering Large Applications (Program CLARA). In Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons; 1990:68–123.View ArticleGoogle Scholar
- Pal NR, Bezdek JC, Hathaway RJ: Sequential Competitive Learning and the Fuzzy c-Means Clustering Algorithms. Neural Networks 1996, 9: 787–796.View ArticlePubMedGoogle Scholar
- Agresti A: Introduction to categorical data analysis. John Wiley and Sons, New York; 1996.Google Scholar
- Goodman LA, Kruskal WH: Measures of association for cross classification. Journal of the American Statistical Association 1954, 49: 732–64.Google Scholar
- Wickens TD: Multiway Contingency Tables Analysis for Social Sciences. Lawrence Erlbaum Associates Publishers; 1989:17–48.Google Scholar
- Knudsen S: Cluster Analysis. In A Biologist's guide to Analysis of DNA Microarray Data. John Wiley & Sons, Inc., New York; 2002::44.View ArticleGoogle Scholar
- Kaski S: Data exploration using self-organizing maps. PhD thesis. Helsinki University of Technology, Neural Networks Research Centre; 1997.Google Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science 1999, 286: 531–537.View ArticlePubMedGoogle Scholar
- Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 2004, 101: 4164–4169.PubMed CentralView ArticlePubMedGoogle Scholar
- Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene Expression Correlates of Clinical Prostate Cancer Behavior. Cancer Cell 2002, 1: 203–209.View ArticlePubMedGoogle Scholar
- Ginos MA, Page GP, Michalowicz BS, Patel KJ, Volker SE, Pambuccian SE, Ondrey FG, Adams GL, Gaffney PM: Identification of a Gene Expression Signature Associated with Recurrent Disease in Squamous Cell Carcinoma of the Head and Neck. Cancer Res 2002, 64: 55–63.View ArticleGoogle Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.