Improvements in technology have resulted in the ability to perform increasingly complex experimental research, generating large amounts of multi-dimensional data. Research in biology today involves the exploration and analysis of the large, shared databases available to the wider scientific community. This reduces duplication of effort and allows more extensive research to be done, by providing independent data sources for verifying hypotheses formulated.
Conversely, technology has been unable to manage the data it has spawned; data management, analysis and visualisation tools struggle to meet requirements [1]. This is due largely to the constantly evolving information needs of the biologists working with the increasingly large amounts of heterogeneous data generated [2]. Bioinformatics developed to exploit information technology in biological data analysis [3, 4].
This paper looks at the development of visual solutions for intuitive analysis of anatomy ontologies. The primary data source is the Edinburgh Mouse Atlas Project (EMAP), developed at Edinburgh's Medical Research Council's (MRC) Human Genetics Unit (HGU), documenting the developmental stages of the mouse embryo.
Prototypes are being developed for visualisation in 2D (two-dimensions) and 3D, to highlight relationships between and within different anatomy components. This will help uncover knowledge about gene expression, structure and function as cells and tissues evolve into defined organs, to track normal development and evolution.
Visual solutions for effective analysis of biological data
At least a basic understanding of the structure of the anatomy ontology data is required to provide effective design and development with mappings to spatial representations that capture users' mental models of the data. Visualisations generated should provide biologists with data overviews, followed by the ability to study regions of interest (ROIs) in detail, within the context of the overall data set, to highlight patterns within the data [5, 6]. Support for interactive data exploration should be provided; functionality for browsing and searching and for manipulation of data structures allows analysis from multiple perspectives.
Visual encoding of textual data exploits humans' highly developed perceptual abilities to decrease the cognitive load associated with complex data analysis [6, 7]. A number of visualisation methods and techniques already exist for complex data analysis, both within and outwith the field of bioinformatics, including 2D and 3D scatter plots, self-organising maps (SOMs), parallel coordinates, 2D and 3D hierarchical graphs, information maps, murals and cubes, perspective walls, virtual landscapes, cityscapes, and physical space metaphors such as rooms, windows and desktops. Hyperbolic or fish-eye views and lenses, magic and semantic lenses, and dynamic query systems aid detailed study of regions of interest (ROIs) especially in large data sets. [3, 6, 8–10]
In order to ascertain what would provide, individually or in concert, optimal visual data analysis solutions for the study of anatomy ontologies, it is necessary to assess existing tools and techniques to determine their applicability to the data sets of interest and the tasks biologists perform. It is important to provide analysis solutions that the different target users with varying research backgrounds are able to use to interpret data required for their work effectively. [11]
The Edinburgh Mouse Atlas Project
The working EMAP browser (see Figure 1) employs a collapsible, indented text index with mappings to corresponding anatomical components in 2D and 3D digital models for the developmental stages of the mouse embryo.
Extensions required to EMAP browsers
More intuitive methods are required for data analysis, especially where comparisons between multiple data sets, such as lineage across stages, are to be made. A major advantage in visual representation of the data over EMAP's text indices is the availability of an overview in addition to the ability to study ROIs in detail.
The ontology data is structured hierarchically from a root, with unique sub-components together forming complete (super) components [12], using part-of relationships. A graphical layout that exploits this structure should aid discovery of relationships within the data [6, 8]. Visualisations can make use of physical attributes such as shape, colour and size to encode properties of and relationships between data elements. The following sections discuss advantages in visual data analysis:
1. Representation of complex relationships
As all the developmental stages for a specified anatomy are combined into the complete, abstract organism persistence of components across stages may result in non-unique entries, with different paths to the root.
An illustration from the mouse anatomy ontology would be the component second polar body, which exists on the first level below the root for the first three stages of development. However in the 4th stage the second polar body forms a part of the extraembryonic component. Figure 2 shows the two occurrences of the second polar body in the ontology for the abstract organism, on the first and the second levels below the root.
The second polar body could be regarded as having multiple parentage in the abstract organism, as demonstrated for the component G in the index in Figure 3.
2. Grouping
Grouping of anatomical components based on user-specified criteria, to provide different perspectives of the data structure, is a function required for which no graphical support exists in the EMAP browsers. Grouping enables users to focus on relationships within data based on different classification methods for structuring data than those used for constructing the ontologies. A typical example would be the skin that covers the entire organism. Skin is not represented as a super-component in any one stage; however there may be components in each stage that together make up the skin of the organism at that point in development. A group component skin could be created that links to all components that comprise the skin in a stage, using the default part-of relationship again to form a complete whole, while preserving the uniqueness of data components [12] (see Figure 15). Note that this will result in multiple parentage for unique nodes, similar to that occurring in Figures 2 and 3.
3. Lineage
Lineage across stages is currently displayed using plain text descriptions of a component's ancestors and descendants in consecutive text boxes arranged along a horizontal plane. The main disadvantage associated with this is the need to scroll through up to 28 text boxes to identify all the Theiler stages (TS) of the mouse embryo, for example, through which a specified component persists (see Figure 4).
4. Visual querying
It is necessary to extend the sub-string searching provided in the EMAP browsers to retrieve related information from external data sources. Lack of integration between databases and the different search and query tools provided for these data sources [1, 4] however present problems for information retrieval (IR). Differences in data quality resulting from varied methods for collection and storage, and different data annotation techniques may result in large semantic and syntactic differences in information stored, increasing difficulty in the retrieval and use of data from different sources [13].
Transparent mappings to external data sources that hide their underlying complexity should aid searching and IR. Visual, dynamic querying, illustrated in Figure 5, employing semantic lenses and dynamic query sliders, encourages data exploration by removing the need for users to learn high-level query languages and syntax. Complex queries can be formed intuitively using an interface that provides immediate, visual feedback and allows simplified modification and/or reversal of actions [14], without increasing cognitive load.
Searching currently makes use of free text fields for input, highlighting search hits within the overview. Modification to use a semantic lens could extract information of interest and use this as input for a refined search. The sub-tree containing the component parts of the heart, say, could be extracted using a filter that takes heart as input and node property print name to filter out non-relevant data. Related information from other data sources could then take the results of this search as input to extract gene expression, say, on the components of interest.
5. Simultaneous analysis of multiple anatomy ontologies
Determination of lineage and the comparison of ontologies for different organisms require simultaneous visualisation of multiple data sets. Intuitive comparison between different data sets requires visualisations that highlight mappings between related data elements (see also Figure 14).
Related work in hierarchical data visualisation
Several applications already exist for the visualisation of large, hierarchically structured data sets. It is important to examine specific applications to determine if users' requirements cannot be satisfied with existing visualisation solutions. Those applications found to most closely approach the requirements of this project are summarised below, detailing features they provide and their limitations for the data analysis required.
1. Protégé
A Java-based knowledge modelling tool, Protégé incorporates multiple hierarchical visualisation applications to aid the construction, editing and visualisation of ontologies. These include OntoViz, which makes of the GraphViz visualisation libraries for graphical representations of hierarchical data. OWLViz, specifically developed to visualise OWL ontologies, also makes use of GraphViz. A major problem with extending GraphViz is that it requires use of the non-standard languages DOT and lefty.
TGViz uses Touchgraph, developed in Java, for dynamic layout of nodes in a connected network graph. TouchGraph encodes node properties using colour, and provides clustering of like data, as well as geometric and hyperbolic zoom. Functionality for searching and for saving graphs as image files is also provided.
SHriMP, providing modular components built using Java Beans, is combined with Protégé to form Jambalaya, a tool that provides fish-eye views that make use of a continuous zoom for overviews of large data sets. Data abstraction, employing nesting and hiding of data, is followed by extraction of sub-sets to separate windows to allow focus on detail in ROIs. Encoding of data nodes using colour and depth cueing in 3D helps to distinguish more important data. [15]
2. Piccolo
Developed using Java2D and Swing, Piccolo provides support for zoom and the use of multiple cameras or viewpoints. Piccolo has been customised for visualisation of network and hierarchical data. GINY, the Graph Interface Library, extends Piccolo to visualise protein and gene interaction and expression in Cytoscape. GINY uses colour coding of gene expression to aid comparison of data sub-sets. Data reduction is achieved by clustering related data into encapsulating, composite nodes.
SpaceTree extends Piccolo to produce rooted, node-link hierarchical graphs that combine (physical and semantic) zoom, panning and folding of sub-graphs to provide maximum screen space to ROIs. Main disadvantages of SpaceTree include the loss of the overview when analysing ROIs. Also, though the provision of (textual) detail in nodes is useful this reduces screen real estate available for the overview.
Zoomgraph extends Piccolo's zoomable interface to provide semantic zoom into ROIs. An advantage in Zoomgraph is the ability to define the properties of data nodes and links. [16]
3. Walrus
Using Java3D, Walrus employs a hyperbolic projection for the visualisation of directed graphs in 3D space. Although its hyperbolic layout provides an ideal method for displaying the kind of hierarchical data under study Walrus is fairly specialised; it uses its own non-standard file format. Further, the structure of graphs once loaded cannot be altered, and only one graph can be loaded at a time. [17]
4. Hypergraph
This Java application provides a hyperbolic layout in 2D that allows interactive repositioning of nodes to provide more magnification to ROIs, with hyperlinks to further detail in external files. [18]
5. VRMLgraph
Developed using Java, VRMLgraph draws arbitrary node-edge graphs in 3D. Very little functionality is implemented beyond the drawing of nodes and links connecting them; the main benefit of this application is that it can take advantage of built-in navigation cues and capabilities in VRML for 3D perspective and cameras/viewpoints. [19]
Motivation for project
The applications and toolkits described above incorporate visualisation techniques for exploration of overviews, and with the exception of Walrus, detailed analysis in ROIs. Given the proven capability of hyperbolic layouts for navigation and exploration of large data sets it would be useful to harness the simple but effective implementation of the hyperbolic layout in Hypergraph. Protégé as it stands provides functionality that would still need adaptation to allow creation of groups as required for the anatomy ontology data. Functionality for folding trees, searching, and highlighting of user paths, and encoding of data properties require further development to satisfy user requirements.
Harnessing existing technology that performs effective analysis of complex data, applied in the tools studied above, will provide some of the functionality required to aid visual data analysis. However it is still necessary to develop novel techniques for analysis of the anatomy ontology data, building on existing methods that have proven useful for visualisation of complex data. Intuitive comparison of multiple data sets and the tracing of lineage through the anatomy ontologies cannot be obtained using the functionality available in 2D tools; occlusion would be too high to allow useful analysis at any level of detail. 3D tools would reduce the problem of occlusion significantly. However 3D visualisations typically distribute data throughout the space available, clustering related data nodes around focal points. Distinguishing individual nodes and the data sets to which they belong is difficult. Drawing physical links between related nodes across data sets may result in crossing of links, reducing the ability to recognise these relationships.
The next section describes the 2D visualisation browser developed that builds on existing techniques to provide intuitive visual data analysis. This is followed by an evaluation carried out to determine if the visualisations provided improve the textual analysis currently performed. Finally a 3D browser develops novel techniques that provide intuitive recognition of relationships across data sets, described in the sub-section Novel techniques for visual analysis of the ontology data under Implementation.