Visualization and analysis of microarray and gene ontology data with treemaps
© Baehrecke et al; licensee BioMed Central Ltd. 2004
Received: 31 January 2004
Accepted: 28 June 2004
Published: 28 June 2004
The increasing complexity of genomic data presents several challenges for biologists. Limited computer monitor views of data complexity and the dynamic nature of data in the midst of discovery increase the challenge of integrating experimental results with information resources. The use of Gene Ontology enables researchers to summarize results of quantitative analyses in this framework, but the limitations of typical browser presentation restrict data access.
Here we describe extensions to the treemap design to visualize and query genome data. Treemaps are a space-filling visualization technique for hierarchical structures that show attributes of leaf nodes by size and color-coding. Treemaps enable users to rapidly compare sizes of nodes and sub-trees, and we use Gene Ontology categories, levels of RNA, and other quantitative attributes of DNA microarray experiments as examples. Our implementation of treemaps, Treemap 4.0, allows user-defined filtering to focus on the data of greatest interest, and these queried files can be exported for secondary analyses. Links to model system web pages from Treemap 4.0 enable users access to details about specific genes without leaving the query platform.
Treemaps allow users to view and query the data from an experiment on a single computer monitor screen. Treemap 4.0 can be used to visualize various genome data, and is particularly useful for revealing patterns and details within complex data sets.
Genome sequencing has presented biologists with new challenges in data analysis. This advance coupled with the advent of methods to empirically analyze whole genome changes in RNA levels, protein levels, and protein activities [1–5], presents difficulties in visualizing summaries of data while obtaining meaningful details. The use of colored mosaics and hierarchical clustering to query relative RNA levels revolutionized DNA microarray analyses , and has been the prominent mechanism for assessing this data. While continued development of this data analysis platform is useful, these methods limit the ability to simultaneously visualize multiple data attributes including the analysis of qualitative information about either gene families or biological function and quantitative information such as RNA level and p-value simultaneously.
The Gene Ontology (GO) consortium has established a vocabulary that provides a hierarchical structure for the analysis of genome data [7, 8]. GO provides a classification of gene products into molecular functions, biological processes, and cellular components. Therefore, GO classification is particularly useful for getting overviews of data such as the percentage of genes transcribed within each category or node, and also provides a rapid mechanism for researchers to classify genes that are often given non-descript numerical names during genome annotation. The dynamic nature of GO data, which is updated weekly for active genome projects, however, challenges researchers to be vigilant in the analysis and re-analysis of data. Ideally, researchers would be able to obtain information about both qualitative attributes such as GO category, and quantitative attributes such as RNA level for an entire experiment, and query these data sets for details without losing the overview of the entire data structure.
Several computational approaches have been developed to visualize and query microarray data including Spotfire  and Genespring . While both of these platforms are capable of analyzing both qualitative and quantitative data, neither provides an ideal platform to visualize multiple attributes simultaneously while allowing dynamic queries of data in the context of the GO classification. Further, limited mechanisms exist for merging quantitative attributes such as RNA level with GO categories. Several programs have been developed to edit, browse, and facilitate studies of GO . Among these applications, FatiGO , GoMiner , MAPPFinder , and GoSurfer  provide useful platforms for the analysis of microarray data in the context of the GO hierarchy, but their use of typical windows-style browsers and tree diagrams lacking quantitative data limits the ability to rapidly see patterns and obtain details on demand.
Treemaps were developed to facilitate visualization of both hierarchical and quantitative information [15, 16]. This technique has been used to visualize and query several forms of data including the stock market  and electronic product catalogs . While treemaps have been recognized as a strategy that can be used to visualize clusters and GO annotation data , previous use of this strategy has been limited to static approaches of pre-selected data and have failed to take advantage of several important strengths of Treemap 4.0.
Here we extend Treemap 4.0 to visualize and query microarray data in the GO framework, and use studies of programmed cell death during development  as an example. We present the advantages of the use of size and color to represent attributes of the genome. By using code that merges up-to-date GO assignments with various user-defined attributes such as RNA level, p-value, and others, we utilize treemaps to get overviews of data, and to query for details. Treemap 4.0 provides users with rapid responses to queries (usually a fraction of a second), and the results can be saved and exported such that they can be analyzed using other software. A link between Treemap 4.0 and organism-specific web sites enables users to get details about specific genes as needed. Thus, Treemap 4.0 fills a critical void for genome researchers who want to integrate and query GO information with various quantitative data.
Results and Discussion
Treemaps enable visual overviews of complex genome data with details on demand
We have used treemaps to visualize DNA microarray data that examine changes in RNA levels during steroid-triggered programmed cell death in Drosophila . RNA was extracted from salivary glands dissected from animals that were staged 6 and 12 hours following puparium formation; before and after the rise in steroid that triggers cell death. Three independent salivary gland RNA samples were collected from each stage and used to hybridize Drosophila Affymetrix oligonucleotide Genechips. These Genechips contain 13,197 unique gene transcripts, and 2,876 gene transcripts were consistently detected in all 3 samples of either 6- or 12-hour salivary glands.
Treemap 4.0 was used to analyze the 2,876 gene transcripts that were consistently detected in dying salivary glands. The first step in this analysis involved parsing the 2,876 gene transcripts in the GO framework. This was accomplished using a Perl script that assigns each gene transcript a GO path in the molecular function, biological process, and cellular component categories, and then writes an output file for use in Treemap 4.0 (software and demonstration data available at HCIL ). An advantage of this parser is that it enables the analysis of data in the context of up-to-date GO classification, but it also has disadvantages because of limitations associated with Perl. Perl is an interpreted language that runs slowly on large data sets, and 5 minutes were required to parse and write the data to a file with 2,876 genes containing 11,638 GO assignments. The positive aspect of Perl is that comments are provided internally so that competent programmers can optimize this parser. While implementation of this parser in a language such as C++ would be faster and is a future objective, this may provide an impediment to biologists that use Perl because of user friendly attributes. The second limitation of this parser is that users must define their own column headings and types after the treemap file is written, as the column definition and data type must match for Treemap 4.0 to read the file, and this will differ between users and data files. This is one of many possible parser improvements that are future objectives including accounting for missing values, offering extended error checking with error messages, and providing weighting functions. However, our primary objective was to implement and test the utility of Treemap 4.0 for dynamic queries of microarray data in the context of up-to-date GO classification, and this parser enabled us to accomplish this goal and will allow others to use treemaps to query genome data.
Tools that facilitate visualization and queries of genome data
Treemap 4.0 enables users to zoom on details by double-clicking on an area of interest, and this results in a rapid update of the area selected (a right mouse click enables the user to zoom out one level). For example, programmed cell death involves the degradation of cell components, and this is known to require the activity of peptidase (protease) enzymes. By double-clicking the molecular function GO category that contains 2,726 gene nodes, and then selecting enzymes, the user can immediately see that enzymes are one of the largest categories within molecular function with 1,186 gene nodes (Figure 3A). Furthermore, by double-clicking and zooming on the peptidase activity category and then selecting cysteine-type peptidase within that category, the user can immediately see in the "Details of selected node" window that 23 gene nodes are present, and that most of the boxes are large and red. Closer examination reveals the presence of Nc (Dronc) and Ice (Drice) (Figure 3B), which encode two cysteine-type peptidases that are commonly known as Caspases, and these are critical regulators of programmed cell death in higher animals including Drosophila and humans . Focused evaluation of the proteases that are present in dying cells, such as Caspases, provides one rapid mechanism to determine if either expected or predicted genes are transcribed in dying cells. The association of non-Caspase proteases that may not have been previously implicated in programmed cell death, however, enables users to test if these genes are involved in cell death by conducting additional biological experiments.
Alternatively, salivary gland cell death may either be regulated at the level of protein, or perhaps cell death genes may not have been classified within this category of the GO because they may not have been identified to function in this process at this time. The Caspase co-factor Ark (Dark/Dapaf-1/Hac-1) is easily recognized upon closer examination of the genes within the cell death category because of its large box 50.8333 fold-change in RNA level and red 0.0036 p-value (Figure 4B). Several other interesting cell death genes including the Caspases Nc (Dronc) and Ice (Drice), the Bcl-2 family member Buffy, the inhibitor of apoptosis th (Thread/Diap1), the CD36 family member crq, and the serine/threonine kinase Akt1 are present with significant red to orange p-values. In contrast, the presence of black insignificant p-values for the known cell death regulators rpr (Reaper), W (Wrinkled/Hid) and Eip93F (E93) are surprising (Figure 4B). By selecting either rpr, W, or Eip93F, it was rapidly determined that these insignificant p-values must be due to biological variation in RNA levels in cells or different salivary glands, as all three of these genes are detected at reasonable levels 12 hours after puparium formation (Figures 4C,4D,4E). It is not surprising to have such variation in the RNA levels for these 3 genes, as these RNAs are transcribed in a very transient manner in dying salivary glands [24–27].
Genome researchers require rapid access to details about genes such as map position within the genome, nucleotide and protein sequence, and literature published to name a few examples. Treemap 4.0 was adapted to contain a direct link to organism-specific websites within the main window of the lower right control panel. By enabling this link, users who select a gene node while holding the control key will be directly connected to the Flybase page for the gene of interest. While we have only implemented this for use in Flybase at this time, we intend to make this function flexible and enable all GO organism databases to be selected in the future. Finally, Treemap 4.0 is not meant to serve as a stand-alone tool for genome analysis, and any queried file that is selected while holding the control key will be saved to a tab-delimited file that can then be used in other software such as hierarchical clustering.
Treemaps is a space-filling visualization technique that facilitates exploration of hierarchical structures. While treemaps have been widely used in the evaluation of business performance, the stock market, inventory management, and product catalogs, it has not been widely used in genome research. By presenting attributes of genes by size (RNA level) and color-coding (p-value), we demonstrate how treemaps can facilitate queries of DNA microarray data in the context of Gene Ontology categories. User-defined filtering within Treemap 4.0 allows the selection of data of greatest interest that can then be exported for secondary analyses with other software. For simplicity, we have only used Affymetrix DNA microarray data in the examples provided. Treemaps can be used to visualize all forms of DNA microarray data, however, and this platform should also enable researchers to query for patterns within a variety of complex genome data sets. For example, Treemap can also be used to query other qualitative and quantitative attributes of genomes including map position, degree of conservation between species, size of a gene family, extent of alternative splicing, and protein levels. In addition, genome annotation may be facilitated by treemap presentations, as it allows users to see complex information in a single display. For example, a single gene is represented many times within the GO, and this information may not be apparent in either large tables or genome profiling summary statistics. Comprehensive GO assignments are intended to provide researchers with maximum information, but such results can also be misleading as the same gene may appear many times within a single category (see Figures 3,4 for examples). Although treemaps allows users to see the number of nodes (genes) in each category, presentation of the complete data allows informed decisions and critical evaluation of GO assignments.
DNA microarray data were taken from previously described experiments aimed at understanding how RNA levels change during steroid-triggered programmed cell death , and the data is available at http://www.umbi.umd.edu/~cbr/baehrecke/data.htm. RNA was extracted from salivary glands dissected from wild-type Canton S Drosophila melanogaster that were staged 6 and 12 hours following puparium formation. Three independent salivary gland RNA samples were collected from each stage, converted into double-stranded cDNA, and used to synthesize biotinylated cRNA . Affymetrix Drosophila Genome Arrays were hybridized, washed, scanned, and analyzed with Affymetrix Micro Array Suite 4.0. Genes were excluded from this study if their hybridization signal was not consistently detected above background (determined by algorithms in Affymetrix Micro Array Suite 4.0) in all 3 replicates of a single time point within an experiment. The fold change values were averaged across the 3 replicates, and p-values were determined by conducting a t-test using average difference values (determined by algorithms in Affymetrix Micro Array Suite 4.0).
Treemap 4.0 (run-treemap.bat) was implemented in Java 2 (J2SE v. 1.4.1 or later) and the parser (run-parser.bat) was written in Perl. Treemap and the parser can be run on both Windows and Macintosh computers, but Macintosh requires a recent version of system 10 to utilize a suitable version of Java. This software is freely available for non-commercial use (e.g. academic use, evaluation by a single individual in a company, etc..) and can be licensed for commercial use . All data files are parsed and processed into a single tab delimited text file. When started, Treemaps will first load all input data into main memory and subsequently process any necessary computations. For data with less than 20 thousand nodes with each node containing 10 or fewer attributes, the memory requirement is moderate (less than 256 MB on a 32-Bit processor at 700 MHz). The most time consuming step is the initial parsing of the microarray data, the FlyBase gene association flat file, and the file derived from the molecular function, biological process, and cellular component categories of the GO. While the Perl scripts may need at least five minutes to finish parsing all input data and format the tab delimited format, Treemap 4.0 will need only seconds to display the application.
We thank Catherine Plaisant, Harry Hochheiser, Steve Mount, Jinwook Seo, Emily Clough, Alvaro Godinez, Louisa Wu, and members of the Baehrecke laboratory for helpful comments and discussions. Studies of genome regulation of cell death are supported by NIH grant GM59136 to E.H.B..
- Schena M, et al.: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270: 467–470.View ArticlePubMedGoogle Scholar
- Link AJ, et al.: Direct analysis of protein complexes using mass spectrometry. Nat Biotechnol 1999, 17: 676–682. 10.1038/10890View ArticlePubMedGoogle Scholar
- Lockhart DJ, et al.: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 1996, 14: 1675–1680.View ArticlePubMedGoogle Scholar
- MacBeath G, Schreiber SL: Printing proteins as microarrays for high-throughput function determination. Science 2000, 289: 1760–1763.PubMedGoogle Scholar
- Zhu H, et al.: Global analysis of protein activities using proteome chips. Science 2001, 293: 2101–2105. 10.1126/science.1062191View ArticlePubMedGoogle Scholar
- Eisen MB, et al.: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Consortium GO: Creating the gene ontology resource: design and implementation. Genome Res 2001, 11: 1425–1433. 10.1101/gr.180801PubMed CentralView ArticleGoogle Scholar
- Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 2004, 20: 578–580. 10.1093/bioinformatics/btg455View ArticlePubMedGoogle Scholar
- Zeeberg BR, et al.: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 2003, 4: R28. 10.1186/gb-2003-4-4-r28PubMed CentralView ArticlePubMedGoogle Scholar
- Doniger SW, et al.: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol 2003, 4: R7. 10.1186/gb-2003-4-1-r7PubMed CentralView ArticlePubMedGoogle Scholar
- Shneiderman B: Tree visualization with tree-maps: A 2-dimensional space filling approach. ACM Transactions on Graphics 1992, 11: 92–99. 10.1145/102377.115768View ArticleGoogle Scholar
- Bederson B, Shneiderman B, Wattenberg M: Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies. ACM Transactions on Graphics 2002, 21: 833–854. 10.1145/571647.571649View ArticleGoogle Scholar
- McConnell P, Johnson K, Lin S: Applications of Tree-Maps to hierarchical biological data. Bioinformatics 2002, 18: 1278–1279. 10.1093/bioinformatics/18.9.1278View ArticlePubMedGoogle Scholar
- Lee C-Y, et al.: Genome-wide analyses of steroid- and radiation-triggered programmed cell death in Drosophila . Curr Biol 2003, 13: 350–357. 10.1016/S0960-9822(03)00085-XView ArticlePubMedGoogle Scholar
- Shi Y: Mechanisms of Caspase Activation and Inhibition during Apoptosis. Mol Cell 2002, 9: 459–470. 10.1016/S1097-2765(02)00482-3View ArticlePubMedGoogle Scholar
- Baehrecke EH, Thummel CS: The Drosophila E93 gene from the 93F early puff displays stage- and tissue-specific regulation by 20-hydroxyecdysone. Dev Biol 1995, 171: 85–97. 10.1006/dbio.1995.1262View ArticlePubMedGoogle Scholar
- Jiang C, Baehrecke EH, Thummel CS: Steroid regulated programmed cell death during Drosophila metamorphosis. Development 1997, 124: 4673–4683.PubMedGoogle Scholar
- Jiang C, et al.: A steroid-triggered transcriptional hierarchy controls salivary gland cell death during Drosophila metamorphosis. Mol Cell 2000, 5: 445–455. 10.1016/S1097-2765(00)80439-6View ArticlePubMedGoogle Scholar
- Lee C-Y, et al.: E93 directs steroid-triggered programmed cell death in Drosophila . Mol Cell 2000, 6: 433–443. 10.1016/S1097-2765(00)00042-3View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.