- Open Access
NeatMap - non-clustering heat map alternatives in R
© Rajaram and Oono; licensee BioMed Central Ltd. 2010
- Received: 19 May 2009
- Accepted: 22 January 2010
- Published: 22 January 2010
The clustered heat map is the most popular means of visualizing genomic data. It compactly displays a large amount of data in an intuitive format that facilitates the detection of hidden structures and relations in the data. However, it is hampered by its use of cluster analysis which does not always respect the intrinsic relations in the data, often requiring non-standardized reordering of rows/columns to be performed post-clustering. This sometimes leads to uninformative and/or misleading conclusions. Often it is more informative to use dimension-reduction algorithms (such as Principal Component Analysis and Multi-Dimensional Scaling) which respect the topology inherent in the data. Yet, despite their proven utility in the analysis of biological data, they are not as widely used. This is at least partially due to the lack of user-friendly visualization methods with the visceral impact of the heat map.
NeatMap is an R package designed to meet this need. NeatMap offers a variety of novel plots (in 2 and 3 dimensions) to be used in conjunction with these dimension-reduction techniques. Like the heat map, but unlike traditional displays of such results, it allows the entire dataset to be displayed while visualizing relations between elements. It also allows superimposition of cluster analysis results for mutual validation. NeatMap is shown to be more informative than the traditional heat map with the help of two well-known microarray datasets.
NeatMap thus preserves many of the strengths of the clustered heat map while addressing some of its deficiencies. It is hoped that NeatMap will spur the adoption of non-clustering dimension-reduction algorithms.
- Principal Component Analysis
- Cluster Result
- Cluster Analysis Result
- Cell Cycle Related Gene
- Average Linkage Hierarchical Cluster
With the advent of high-throughput experiments, whole genome measurements across multiple conditions have become common. Human pattern recognition is still unmatched by computers, making it advantageous to visualize this data. Over the past decade, the clustered heat map has become by far the most popular visualization technique. It has been used in thousands of publications spanning a multitude of organisms and a variety of data types [1–3]; it has even been dubbed  a "post genomic visual icon." There are good reasons for the clustered heat map's popularity. It provides a compact, easy to grasp, depiction of a large amount of data across two variables (e.g., gene and sample) with large contiguous bands of similar colors that encourage the formulation of more general hypotheses between these variables. Still, the clustered heat map has some glaring flaws. As its name suggests, the rows and columns are ordered using hierarchical clustering algorithms (while there are other clustering schemes , they are typically not used to construct heatmaps, so here, clustering should be understood to refer to hierarchical clustering). Distances in a clustering result are measured along the tree branches and not by the proximity in branch tip ordering. While these measures are related (especially for very similar elements), they could be very different . Additionally, during clustering, when objects are assigned to different clusters, further analysis essentially involves these clusters as a whole, and the relationship between the elements themselves is lost (see analysis of human gene atlas in Results). Consequently, clustering does not provide any natural ordering; the rows and columns may be reordered arbitrarily by 'swinging' the arms of the tree at each bifurcation yet preserving the tree structure. The ordering produced by clustering thus does not respect the intrinsic topology (if any) of the data, making it a poor choice for use in a heat map. This is why 'swinging' based reordering using an independent method is often required, post-clustering, to capture the structure of the data. There are two problems with this reordering. Firstly, unlike the clustering schemes, the reordering algorithms, while complex enough to warrant dedicated software packages, are often not elaborated upon or even stated. This reduces the reproducibility of the result. More seriously, this procedure could potentially place (deliberately or otherwise) objects that are distant along the tree in close proximity in the row/column order. Heat maps are commonly read in this order rather than by their dendrogram structure (if this were not the case, such reordering schemes would not be needed). Effectively a spurious pattern could be created, leading to incorrect results (e.g., see clustered heat map for Spellman data in Results).
So far we assumed the clustering results themselves were meaningful. Indeed, when the underlying data is tree-like, or at least some clustering/grouping tendency is present, cluster analysis+reordering performs well. However, this is not always the case. As group separation becomes fuzzier, other data-reduction schemes often outperform cluster analysis. Usually, it is considered good practice to test for clustering tendency before performing clustering or to perform bootstrap-like methods to estimate cluster quality post-clustering . Unfortunately, this kind of information is not typically provided in a heat map. Thus validation is only by visual inspection of the color patterns, and this may be misleading.
Biological data often has a low dimensional structure that may be visualized as a spatial pattern, so direct use of a suitable dimension-reducing algorithm could, in many cases, be more natural and better characterize the data than the current combination of structure destroying clustering + restoring algorithm. There are many such algorithms whose utility in the analysis of biological data has been demonstrated [8, 9]. Multiple packages in R , and otherwise, implement them. Despite this, we believe their use has been limited, at least partially, by the lack of associated visualization methods with the visceral impact of the clustered heat map.
Here we present an R package called NeatMap to meet this need while addressing some of the deficiencies of the clustered heat map. It consists of novel plot-types in two and three dimensions intended to be used in conjunction with any dimension-reduction scheme capable of embedding results in low dimensional Euclidean space (e.g., Principal Component Analysis (PCA) and Multi-Dimensional Scaling (MDS)). This places weaker constraints on the data than does (hierarchical) cluster analysis, which requires the data to exist in a tree space. Like the heat map, and unlike typical visualization schemes for these methods, NeatMap displays the entire dataset underlying the result. It also has provisions to superimpose the cluster analysis results, for mutual validation. This feature is not commonly implemented in software packages, and our implementation is more informative about individual points than existing implementations . Also note that unlike the clustered heat map, the layout of the plot is almost entirely determined by the output of the dimension-reduction scheme, thereby respecting the intrinsic structure in the data more than a clustering based reordering would.
There are a number of alternatives to hierarchical clustering (see, for example, the R package seriation ), designed specifically to produce an ordering that reflects the relative relations between elements. NeatMap is a visualization method, and in general it is not intended to compete with these (in fact they can easily be used in conjunction). However, some of these techniques involve ordering by the first component of PCA/MDS. Unless, this component captures most of the relevant information, NeatMap, which uses 2D embeddings, is likely to better utilize the dimensional reduction results. On the other hand, we do not consider alternate clustering algorithms such as k-means clustering , tight clustering  and various model based clustering algorithms [15–17]. Although these avoid some of the problems faced by hierarchical clustering as outlined above, and have been shown to perform better , they typically just assign (or give probabilities of assigning) objects to clusters. No relations among objects within a cluster are provided, and typically the relations among clusters is not used either. Thus, they do not naturally support the construction of heatmap like plots. Self Organizing Maps (SOM)  used with a small number of nodes/clusters face a similar problem. However, as the number of clusters increases, they essentially involve mapping objects onto points in a low dimensional space much like multidimensional scaling. In this case, it should be possible to use SOMs in conjunction with NeatMap, although we have not considered it in this paper. Methods such as model based clustering do not presently have associated visualization methods, but if their results could somehow be mapped onto points in Euclidean space, they too could be visualized with the help of NeatMap. Note that NeatMap analyzes the rows and columns of the gene expression matrix separately, and is therefore not intended to visualize bi-clustering results.
Principal Component Analysis (PCA)  produces a low dimensional representation of the data using the linear combinations of variables that capture the maximum amount of variance. Being a linear scheme, it is very fast, although this may sometimes be at the expense of quality of result.
non-Metric Multi-Dimensional Scaling (nMDS) [20, 21] is a dimensional reduction scheme that attempts to represent factors as points in a low dimensional Euclidean space such that the (relations among) distances between the points in the low dimensional space are consistent with those in the original data. nMDS is a non-linear scheme that is typically found to outperform PCA, but is slower for large data sets.
heatmap1: This is the traditional heat map, except a dimension-reduction scheme other than clustering (for examples see ) may be used for ordering of rows and/or columns. NeatMap itself provides a novel way to do this from a 2D embedding method: normalize the data, or use an amplitude neutral distance measure such as the Pearson correlation. Then, the embedded result produced by PCA, nMDS, etc., is often annular and can be parameterized, approximately, by a single variable, viz., the angular position (figure 1d). This is a better option than using the ordering based on a single component. The standard cluster dendrogram may be superimposed on the heat map for mutual validation.
circularmap: Similar to heatmap1 except the arrangement is circular (figure 1e) rather than linear to emphasize the periodicity of the angular positions obtained as above (or using other methods  that produce annular results). It is easy to make comparisons across conditions and factors. The factor clustering result may be superimposed on this plot.
lineplot: The 2D dimensionally-reduced factor relationship result is gridded, and the profiles of all the factors within each grid cell are displayed together as line graphs (figure 1c). This provides a global understanding of the nature of the data and its embedding. However, individual factors are harder to pick out, and comparison across conditions is more difficult.
draw.dendrogram3d: Cluster validation of the 2D embedding result for factors (figure 2b) in a 3D environment. The clustering result for both factors and conditions may be superimposed on profileplot3d.
profileplot3d: Addresses the inability of heatmap1 and circularmap to depict radial information by visualizing the profiles in a 3rd dimension using a rotatable 3D environment (figure 3a).
stereo.profileplot3d: A stereo plot where two versions of the same profileplot3d result are shown as viewed from slightly different perspectives to produce the impression of a true 3D view (figure 3b). The plot may be rotated dynamically to provide different views. This plot should also be useful for producing 3D plots for publications where rotation is not possible.
The functions above are dimension-reduction method neutral; dimensionally-reduced results provided by the user are plotted. Convenience wrapper functions make.heatmap1, make.circularmap, make.profileplot3d and make.stereo.profileplot3d are also provided. They take just the raw data as input, perform dimension-reduction using either nMDS or PCA, and finally produce the appropriate plots. All 2D plots were implemented by using ggplot2  and 3D plots using rgl . These libraries have numerous functions for additional customization and modification of the plots produced by NeatMap.
The utility of the plots described above are demonstrated with the aid of two different microarray-based datasets. The 2D plots are illustrated with the help of the Spellman et al.  dataset identifying cell cycle related genes in yeast, while microarray data from the human gene atlas study , profiling gene expression across multiple tissues, is used for the 3D plots.
Spellman et al.  produced genome-wide time course profiles in yeast using micro-arrays under different synchronization methods. Fourier analysis was then used to identify 800 genes, with the correct periodicity, as cell cycle related. We consider only these 800 cell cycle related genes and study their profiles under alpha synchronization. For an example with a larger number of points without such periodicity see Additional File 1. Since a natural time ordering of the measurements exists, we are only interested in the relationship between genes.
Such a ring-like structure is very common when an amplitude-normalized distance measure such as the Pearson correlation is used. In this situation, it is natural to parameterize the position of a gene by a single angle. This is what heatmap1 does. For each gene, its angular position in the nMDS result (figure 1b), with respect to its center of mass, is determined, and the profiles are placed (figure 1d) in a standard heat map ordered according to this angle. The periodic nature of the profiles is now clear, and it is evident that points are arranged by time of up-regulation; essentially the cell cycle phase in which the gene is expressed. While in this case the angular co-ordinate was interpretable as the cell cycle phase, this method works even with non-periodic data when such interpretation is not the possible (see, for example, Additional File 1). Note that heatmap1 also accepts orderings produced by other methods. The R package seriation  offers a variety of these, and heatmap1 plots using them for the Spellman data set are available as Additional File 3. In general, the NeatMap ordering is superior, except for the case of Rank Two Ellipse . This method, like NeatMap, uses angular ordering based on normalized profiles (the correlation matrix itself in this case). heatmap1 also allows the superimposition of clustering results. Evidently, the local arrangements in nMDS and clustering are consistent. Large scale rearrangement, produced by incorrect 'swinging', however, makes the clustered heat map result seem poor.
There are some long lines in the gene clustering result in figure 1c spanning the entire length of the heat map. This is a consequence of the periodicity of the angular variable, which results in the two opposite ends of the heat map being almost identical. To avoid artifacts from this periodicity, one may use circularmap (figure 1e). The ordering of profiles is identical to heatmap1, except they are placed along a circle according to their angular positions in figure 1b. One additional advantage of this format is that the non-uniformity in the phase distribution stands out more clearly. It is much harder to gain this type of information from a traditional heat map display.
Figure 1c shows the lineplot based on the nMDS result in figure 1b. As explained earlier, each cell in the grid in figure 1c shows the time course profiles of all the genes in the corresponding cell in figure 1b. The sinusoidal nature of the profiles is much clearer in this plot. It also emerges that the radial coordinate in this case is a measure of 'cyclicity', with the genes close to the centre being less cyclic.
Thus, lineplot emphasizes the overall nature and change in profiles with position. However, compared to heatmap1 and circularmap, comparison of expression at a fixed time across genes is more difficult. It is also more difficult to quickly look up a specific gene. On the other hand, heatmap1 and circularmap are intended for essentially one dimensional results. To deal with the more general case we must use 3D rotatable plots.
pos.nMDS<-nMDS(alpha.profiles)$x;# Perform nMDS embedding
To use PCA instead of nMDS, a single parameter specifying this would need to be added to each of these plots.
We illustrate the 3D plots using the gene atlas dataset. Su et al.  used microarrays to analyze the expression profiles of genes in a variety of tissues in both humans and mouse. There is no natural ordering of the genes or tissues, but the relationships between tissues are more easily understood. We therefore primarily focus on these.
A 2D embedding of the same data using nMDS with Pearson correlation was also performed. The cluster analysis result was superimposed on the 2D nMDS result in a rotatable 3D environment using draw.dendrogram3d (figure 2b). The same three clusters are present, and there is broad agreement between the clustering and nMDS results. Unlike the clustering result, however, the relationship between the brain and nervous system tissues is much clearer. The nervous system genes are also quite similar to the central cluster of tissues in figure 2b. Apparently, cluster analysis assigns them to this cluster, and in doing so their relationship to the proper brain tissues is lost.
Assuming the data is stored in matrix form (with genes along the rows and tissues along columns) in atlas.profiles, the cluster analysis result for tissues in atlas.cluster, and the three groups are color coded in atlas.group.colors the code to produce the plots in figure. 2 and 3 are:
The clustered heat map, an immensely popular means to visualize large amounts of data, is encumbered by its dependence on cluster analysis. Many alternative dimension-reduction schemes have the potential to do better, but have so far lacked effective means to visualize whole datasets in the way the heat map can. NeatMap is an R package that addresses this need. Using the well-known Spellman yeast cell-cycle and human gene atlas microarray datasets, we have shown that a dimension-reduction method (nMDS was used in this paper for illustration) in conjunction with NeatMap is more informative than the clustered heat map. It is hoped that this package will increase the popularity of these methods and spur the development of novel visualization schemes.
Project name: NeatMap
Project home page: http://cran.r-project.org/web/packages/NeatMap/index.html
Operating system(s): Platform independent
Programming language: R
Other requirements: R, R packages(ggplot2 and rgl)
The authors are grateful to Tim Gernat for suggesting this package be created, and for feedback about early versions. The authors would also like to thank Lisa Stubbs for introducing them to the atlas dataset. The comments and suggestions of the anonymous reviewers are greatly appreciated; they helped us improve the paper considerably. This research was partly supported by Center of Excellence Grant for Department of Mathematics, Keio University.
- Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 1998, 95(25):14863. 10.1073/pnas.95.25.14863View ArticleGoogle Scholar
- Brauer M, Yuan J, Bennett B, Lu W, Kimball E, Botstein D, Rabinowitz J: Conservation of the metabolomic response to starvation across two divergent microbes. Proceedings of the National Academy of Sciences 2006, 103(51):19302. 10.1073/pnas.0609508103View ArticleGoogle Scholar
- Schmid M, Davison T, Henz S, Pape U, Demar M, Vingron M, Scholkopf B, Weigel D, Lohmann J: A gene expression map of Arabidopsis thaliana development. Nature Genetics 2005, 37(5):501–506. 10.1038/ng1543View ArticlePubMedGoogle Scholar
- Weinstein J: BIOCHEMISTRY: A Postgenomic Visual Icon. Science 2008, 319(5871):1772. 10.1126/science.1151888View ArticlePubMedGoogle Scholar
- Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng G: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 2006, 22(19):2405. 10.1093/bioinformatics/btl406View ArticlePubMedGoogle Scholar
- Baum D, Smith S, Donovan S: The tree-thinking challenge. Science(Washington) 2005, 310(5750):979–980. 10.1126/science.1117727View ArticleGoogle Scholar
- Handl J, Knowles J, Kell D: Computational cluster validation in post-genomic data analysis. Bioinformatics 2005, 21(15):3201–3212. 10.1093/bioinformatics/bti517View ArticlePubMedGoogle Scholar
- Raychaudhuri S, Stuart J, Altman R: Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput 2000, 5: 455–466.Google Scholar
- Taguchi Y, Oono Y: Relational patterns of gene expression via non-metric multidimensional scaling analysis. Bioinformatics 2005, 21(6):730–740. 10.1093/bioinformatics/bti067View ArticlePubMedGoogle Scholar
- R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2008. [ISBN 3–900051–07–0]. [http://www.R-project.org] [ISBN 3-900051-07-0].Google Scholar
- Hibbs M, Dirksen N, Li K, Troyanskaya O: Visualization methods for statistical analysis of microarray clusters. BMC bioinformatics 2005, 6: 115. 10.1186/1471-2105-6-115View ArticlePubMedPubMed CentralGoogle Scholar
- Hahsler M, Buchta C, Hornik K:seriation: Infrastructure for seriation. 2009. [R package version 1.0–0]. [http://CRAN.R-project.org/package=seriation] [R package version 1.0-0].Google Scholar
- Tavazoie S, Hughes J, Campbell M, Cho R, Church G: Systematic determination of genetic network architecture. Nature genetics 1999, 22: 281–285. 10.1038/10343View ArticlePubMedGoogle Scholar
- Tseng G, Wong W: Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 2005, 61: 10–16. 10.1111/j.0006-341X.2005.031032.xView ArticlePubMedGoogle Scholar
- McLachlan G, Bean R, Peel D: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 2002, 18(3):413–422. 10.1093/bioinformatics/18.3.413View ArticlePubMedGoogle Scholar
- Qin Z: Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics 2006, 22(16):1988. 10.1093/bioinformatics/btl284View ArticlePubMedGoogle Scholar
- Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 2002, 18(9):1194. 10.1093/bioinformatics/18.9.1194View ArticlePubMedGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences 1999, 96(6):2907–2912. 10.1073/pnas.96.6.2907View ArticleGoogle Scholar
- Jolliffe I: Principal component analysis. Springer verlag; 2002.Google Scholar
- Kruskal J: Nonmetric multidimensional scaling: a numerical method. Psychometrika 1964, 29(2):115–129. 10.1007/BF02289694View ArticleGoogle Scholar
- Taguchi Y, Oono Y: Nonmetric multidimensional scaling as a data-mining Tool: new algorithm and new targets. Geometrical Structures of Phase Space Multidimensional Chaos, Special Volume of Adv Chem Phys 2004, 130: 315–351. full_textGoogle Scholar
- Alter O, Brown PO, Botstein D: Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. PNAS 2003, 100(6):3351–3356. 10.1073/pnas.0530258100View ArticlePubMedPubMed CentralGoogle Scholar
- Chen C: Generalized association plots: Information visualization via iteratively generated correlation matrices. Statistica Sinica 2002, 12: 7–30.Google Scholar
- Wickham H:ggplot2: An implementation of the Grammar of Graphics. 2008. [R package version 0.8]. [http://had.co.nz/ggplot2/] [R package version 0.8].Google Scholar
- Adler D, Murdoch D:rgl: 3D visualization device system (OpenGL). 2009. [R package version 0.84]. [http://rgl.neoscientists.org] [R package version 0.84].Google Scholar
- Fink G, Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell 1998, 9(12):3273–3297.View ArticleGoogle Scholar
- Su A, Wiltshire T, Batalov S, Lapp H, Ching K, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al.: A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences 2004, 101(16):6062–6067. 10.1073/pnas.0400782101View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.