- Methodology article
- Open Access
Visualization methods for statistical analysis of microarray clusters
© Hibbs et al; licensee BioMed Central Ltd. 2005
- Received: 17 December 2004
- Accepted: 12 May 2005
- Published: 12 May 2005
The most common method of identifying groups of functionally related genes in microarray data is to apply a clustering algorithm. However, it is impossible to determine which clustering algorithm is most appropriate to apply, and it is difficult to verify the results of any algorithm due to the lack of a gold-standard. Appropriate data visualization tools can aid this analysis process, but existing visualization methods do not specifically address this issue.
We present several visualization techniques that incorporate meaningful statistics that are noise-robust for the purpose of analyzing the results of clustering algorithms on microarray data. This includes a rank-based visualization method that is more robust to noise, a difference display method to aid assessments of cluster quality and detection of outliers, and a projection of high dimensional data into a three dimensional space in order to examine relationships between clusters. Our methods are interactive and are dynamically linked together for comprehensive analysis. Further, our approach applies to both protein and gene expression microarrays, and our architecture is scalable for use on both desktop/laptop screens and large-scale display devices. This methodology is implemented in GeneVAnD (Genomic Visual ANalysis of Datasets) and is available at http://function.princeton.edu/GeneVAnD.
Incorporating relevant statistical information into data visualizations is key for analysis of large biological datasets, particularly because of high levels of noise and the lack of a gold-standard for comparisons. We developed several new visualization techniques and demonstrated their effectiveness for evaluating cluster quality and relationships between clusters.
- Cluster Algorithm
- Microarray Data
- Singular Value Decomposition
- Visualization Method
- Cluster Quality
Recent high-throughput and whole-genome experimental methods create new challenges in data analysis and visualization. Gene expression and protein microarrays output hundreds of thousands of data points that can be used for prediction of gene function over the entire genome. However, there are serious and fundamental challenges in the analysis of these data. Microarray data contain substantial experimental noise and as our knowledge of biology is incomplete, no perfect gold standard exists for verification of microarray analysis methods.
In order to determine gene/protein relationships and functions from microarray data, methods must be robust to noise and must identify groups of genes that may be functionally related. Statistical methods, such as clustering, attempt to identify data patterns and group genes together based on various distance metrics and algorithms. The lack of a true gold standard makes it impossible to verify the absolute accuracy of any clustering method. Several statistical approaches have been presented for assessing cluster quality [1–4], but these are all either internal validation methods or methods that rely on incomplete external standards such as MIPS  or Gene Ontology  functional protein classifications. Further, these methods do not address the issue of identifying specific problems within clusters of microarray profiles or assessing the relationships between clusters of genes. Well designed visualization methods are capable of aiding in these tasks by helping to bridge the gap between raw data and the analysis of that data . To perform more comprehensive cluster analysis, statistically integrative, dynamic, noise-robust data visualizations are required to complement purely analytical evaluation methods.
Existing visualization tools do not include methods to statistically and dynamically evaluate clusterings of genes. Several tools can display expression data in various static ways suitable for publication  or provide useful dynamic views of tabular data , but are not specifically intended for cluster analysis. JavaTreeView  and the HierarchicalClusteringExplorer  dynamically display hierarchically clustered data for analysis and VxInsight  displays the result of a built-in clustering algorithm in an interactive 3D topology, but none are able to display results of other clustering methods for analysis. TreeMap  provides an innovative way to visualize hierarchically clustered data as well as data organized in the context of the GO hierarchy, but is not intended for cluster analysis. New tools such as GeneXplorer  provide an interactive method for visualization and analysis of microarray data on websites, but do not focus on the task of cluster analysis. Several tools, including the MultiExperimentViewer  and Genesis , provide multiple methods of performing clustering as well as some visualization methods to analyze the resulting clusters. Commercial tools, such as GeneSpring  and SpotFire , offer various statistical and visualization tools for general analysis, but neither offer visual methods specific to analyzing the results of clustering algorithms. Therefore, there is a need for a visualization-based methodology designed specifically to statistically and dynamically evaluate clusters produced by the variety of available algorithms and software tools.
Here we present a suite of interactive microarray analysis methods that integrate relevant statistical information into visualizations for the purpose of assessing the quality and relationships of clusters in a noise-robust fashion. Our methodology is general and can be used to analyze the results of most clustering algorithms performed on either protein or gene expression microarray datasets.
Noise robust visualization
Our method performs a rank transform on each gene by sorting the gene's expression levels, then ranking the experiment for each gene with the lowest expression 0, the next lowest 1, and so on to the highest expression which is ranked N-1, where N is the number of experiments. Each experiment is then displayed as a grayscale percentage of rank/(N-1). In this display, the experiment with lowest expression for each gene is colored black, the experiment with the highest expression is colored white, and the intermediate experiments gradate between them in shades of gray.
Assessing cluster quality
While multiple statistical methods have been developed for assessing the quality of clusters produced by different algorithms [1, 3, 4] the most appropriate clustering algorithm choice depends on the dataset, distance metric, and goal of the analysis . Due to the limitations of these methods, it is important to effectively display clustered data in a manner that allows researchers to examine the variation and consistency of the results of different clustering algorithms. We propose two new visualization techniques that can be used to assess overall cluster quality, and also identify individual outliers and other anomalies in the data quickly and efficiently.
First, to analyze the overall cohesion of each cluster, we developed a "difference display" method. For each cluster, we display the cluster average bar to show the general expression of the cluster as a whole. We calculate the vector of the cluster average from the vectors of expression profiles of each gene, , for each cluster containing M genes with expressions measured over N experiments using the standard formula:
Second, in addition to assessing overall cluster quality and identifying gene outliers, it is important to look at variation of individual experiments within each cluster. We calculate the standard deviation, s, of each experiment, j, within a cluster in the normal manner:
Visualizing clusters in this difference display method allows users to see variations in expression level that may be biologically significant that are not visible in traditional visualization methods. For example, the data shown in Fig. 5 is the glycolysis cluster (2E) from . When viewed traditionally this cluster appears very homogenous and consistent. However, when viewed as a difference from the cluster average, we can observe that in the region of highly under-expressed experiments some genes are more expressed than the average while others are less expressed than average (red and green boxes are shown in this area). This suggests that the cluster could be split into two smaller clusters that would be even more homogenous. In this example 8 of the 9 genes indicated by the red box, but only 3 of the 8 genes indicated by the green box are annotated to glycolysis. The genes in the green box are better categorized as more generally related to alcohol metabolism than to glycolysis in particular (see web supplement to Fig. 5 for details, located at http://function.princeton.edu/GeneVAnD). Traditional visualization is unable to show this type of biologically meaningful variation in highly over or under expressed regions.
Assessing cluster relationships
However, this dendrogram of averages fails to show the relationships between genes in different clusters. It is important to examine gene-to-gene and gene-to-cluster relationships to assess whether or not genes are included in the most appropriate cluster. In order to view the lower level relationships among genes in clusters we can project high dimensional microarray data into a lower dimensional space such that genes with similar expression profiles are spatially closer to each other than genes with different expression profiles. We use Principal Component Analysis (PCA) to define the axes of a three-dimensional space to project the genes and clusters onto. PCA has been used previously in microarray data analysis for dimensionality reduction to facilitate easier analysis and comparisons [4, 20] and to identify patterns of noise . Our method is interactive and navigable which allows users to examine individual genes and view relationships between clusters as they separate out spatially.
To perform PCA on the microarray datasets, we use Singular Value Decomposition (SVD). SVD decomposes an m × n matrix of the full microarray data, X, into three additional matrices:
Where M is the number of genes and corresponds to rows of the matrix, and N in the number of experimental conditions and corresponds to the columns of the matrix. We use the eigengenes, or Principal Compenents (PCs), defined in the rows of V T as the axes for our PCA visualization. The position of each gene in that space is determined by the corresponding column of U Σ. The square of the singular values, contained on the diagonal of Σ, correspond to the variance included by each PC such that the percent of variation, p, captured by the k th PC is determined by:
Multiple simultaneous views and scaleable architecture
Statistical clustering of microarray data is vital for identifying groups of genes that may be functionally related. However the high level of noise in microarray data and the lack of a gold-standard for comparison deeply complicate the evaluation of clustering algorithms. Here we have presented a set of visualization methods geared specifically toward evaluating clustering of microarray datasets. Our rank-based method allows for more noise-robust visualizations of expression levels, our difference display method facilitates visual assessments of general cluster quality as well as outlier detection, and our PC projection method allows for visual assessments of cluster relationships. Our methodology integrates meaningful statistics into an interactive and noise-robust data visualization package for use in analyzing the results of clustering algorithms. Through several examples we have demonstrated the effectiveness of these methods to aid researchers in the analysis of the results of clustering algorithms by facilitating noise-robust assessments of cluster quality and cluster relationships. We believe that more statistically integrative and targeted visualization methods can benefit not only cluster analysis, but many other important data analysis problems in genomics.
Our methodology has been implemented in GeneVAnD (Genomic Visual Analysis of Datasets). GeneVAnD is written in Java and is cross platform for use on Windows, Linux/Unix, and Macintosh operating systems. We use Java3D  to display the PC projections and Piccolo  to display the expression profiles. The JAva MAtrix Library (JAMA)  is used to perform the SVD calculation. The GeneVAnD package is designed in a modular way to allow future extensions and inclusion of additional information and visualizations.
The executables and source code of GeneVAnD can be found at http://function.princeton.edu/GeneVAnD.
This work was funded in part by NSF grants EIA-0101247 and CNS-0406415 and by the Program in Integrative Information, Computer and Application Sciences (PICASso) which is funded by NSF grant DGE-9972930. We wish to thank Chad Myers and Grant Wallace for their help and support of this work. We also thank the Botstein laboratory members for their feedback on the early implementations.
- Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci U S A 2001, 98(16):8961–5.PubMed CentralView ArticlePubMedGoogle Scholar
- Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics 2001, 17(4):309–18.View ArticlePubMedGoogle Scholar
- Mendez MA, Hodar C, Vulpe C, Gonzalez M, Cambiazo V: Discriminant analysis to evaluate clustering of gene expression data. FEBS Lett 2002, 522(1–3):24–8.View ArticlePubMedGoogle Scholar
- Datta S, Datta S: Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 2003, 19(4):459–66.View ArticlePubMedGoogle Scholar
- Munich Information Center for Protein Sequences (MIPS)[http://mips.gsf.de/]
- Gene Ontology Consortium[http://www.geneontology.org/]
- Amar R, Stasko J: A knowledge task-based framework for design and evaluation of information visualizations. IEEE Symposium on Information Visualization 2004, 143–150.View ArticleGoogle Scholar
- Sharan R, Maron-Katz A, Shamir R: CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 2003, 19(14):1787–99.View ArticlePubMedGoogle Scholar
- Johnson JE, Stromvik MV, Silverstein KA, Crow JA, Shoop E, Retzel EF: TableView: portable genomic data visualization. Bioinformatics 2004, 19(10):1292–3. 2003 Jul 1View ArticleGoogle Scholar
- Saldanha AJ: Java treeview – extensible visualization of microarray data. Bioinformatics 2003, 20(17):3246–8.View ArticleGoogle Scholar
- Seo J, Shneiderman B: Interactively Exploring Hierarchical Clustering Results. IEEE Computer 2002, 35(7):80–86.View ArticleGoogle Scholar
- Werner-Washburne M, Wylie B, Boyack K, Fuge E, Galbraith J, Weber J, Davidson G: Comparative Analysis of Multiple Genome-Scale Data Sets. Genome Res 2002, 12(10):1564–73.PubMed CentralView ArticlePubMedGoogle Scholar
- Baehrecke E, Dang N, Babaria K, Shneiderman B: Visualization and analysis of microarray and gene ontology data with treemaps. BMC Bioinformatics 2004, 5(1):84.PubMed CentralView ArticlePubMedGoogle Scholar
- Rees CA, Demeter J, Matese J, Botstein D, Sherlock G: GeneXplorer: an interactive web application for microarray data visualization and analysis. BMC Bioinformatics 2004, 5(1):141.PubMed CentralView ArticlePubMedGoogle Scholar
- Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, Sturn A, Snuffin M, Rezantsev A, Popov D, Ryltsov A, Kostukovich E, Borisovsky I, Liu Z, Vinsavich A, Trush V, Quackenbush J: TM4: a free, open-source system for microarray data management and analysis. Biotechniques 2003, 34(2):374–8.PubMedGoogle Scholar
- Sturn A, Quackenbush J, Trajanoski Z: Genesis: cluster analysis of microarray data. Bioinformatics 2002, 18(1):207–8.View ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95(25):14863–8.PubMed CentralView ArticlePubMedGoogle Scholar
- Raychaudhuri S, Stuart JM, Altman RB: Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput 2000, 455–66.Google Scholar
- Alter O, Brown PO, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A 2000, 97(18):10101–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273–97.PubMed CentralView ArticlePubMedGoogle Scholar
- Bederson BB, Grosjean J, Meyer J: Toolkit Design for Interactive Structured Graphics. IEEE Transactions on Software Engineering 2004, 30(8):535–546.View ArticleGoogle Scholar
- JAva MAtrix Package (JAMA)[http://math.nist.gov/javanumerics/jama/]
- Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, van de Rijn M, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I: Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci U S A 2001, 98(24):13784–9.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.