The goal of the enRoute visualization technique is to jointly visualize experimental data and pathways in a way that addresses all five requirements discussed. We identify the topology-attribute coexistence requirement (R V) as the most critical requirement to address, as current techniques usually either support only topology-based or attribute-based tasks. Only small-multiples and direct on-node mapping are able to address requirement R V, however, both neither scale to many experiments (R I), nor do they allow to simultaneously present heterogeneous data (R II). Our solution to this problem makes use of an observation we made in discussions with our collaborators: they usually reason about and analyze experimental data associated with a single path at any given time in detail, while the rest of the network merely informs them about the context of this path. They of course continuously change the path of interest, but do not need to see detailed data for multiple paths at the same time. This temporal separation of high-level topology-based tasks and low-level attribute-based tasks allowed us to create a solution that meets all five requirements. The enRoute visualization technique, as depicted in Figure 2, is a dual-view approach consisting of the pathway view, showing the pathway map in its original graph layout (meeting R IV), and the enRoute view where a user-selected path is shown in a linear fashion together with a potentially large number of experimental data from multiple sources (R I and R II). Due to the linear arrangement of the nodes from top to bottom, it is possible to encode multi-mappings (R III) by giving them more vertical space. enRoute thus makes use of the temporal separation of analysis focus by presenting an overview in one and the details of a selected path in another view. In the following, we discuss the components of our approach and their interplay in more detail.
Pathway view
The pathway view supports two tasks that are an integral part of our approach. First, it is the primary view for conducting topology based tasks. Second, it is used to interactively select the path that is then shown in the accompanying enRoute view along with the associated experimental data. To facilitate identifying interesting paths, the pathway view also shows averages and variances of the mapped experimental datasets. In this section we provide details about our design of the pathway view and its features.
Selecting and visualizing the path
An integral part of the pathway view is to allow analysts to determine the path that shall be investigated in the context of experimental data using the enRoute view. In this section we describe methods to select and visualize the paths.
The obvious way for visualizing selected paths in pathway maps is to simply highlight the edges along the path, by, for instance, changing their color or width. Instead of highlighting the edges, however, we decided to use the Bubble Sets technique [24] to convey selected paths. Bubble Sets is a method to highlight sets of spatially distributed data points. The elements of each set are wrapped with a continuous iso-contour. We use a slightly modified version of Bubble Sets, as we need to highlight paths instead of sets. Figure 1(a) shows an example of a highlighted path.
Compared to simple edge highlighting, the contour-based Bubble Sets are more salient and can therefore be perceived faster. Furthermore, due to their curve-shaped outline, Bubble Sets can be easier discriminated from the mainly orthogonal structures in the pathway maps [25].
For selecting a path, analysts can choose between two methods: the iterative approach and the start-stop approach, which can be combined at will. In the iterative approach the analyst can directly select a series of connected nodes that should be part of the path of interest. After selecting an initial node, the analyst can interactively extend the path in both directions by holding the control key while clicking connected nodes. Figure 3(a) shows a selected path in orange, which is extended to include one additional node in Figure 3(b). In the second path selection method, the start-stop approach, analysts pick a start and end node between which all possible alternative paths are highlighted. We use a slightly adapted version of the Bellman-Ford algorithm [26] to find the paths between the two user-selected nodes. The shortest path is selected by default, as shown in orange in Figure 3(a), however, analysts can switch to all possible alternative paths by either using the mouse wheel or by directly clicking a path representation. Figure 3(c) demonstrates a switch to an alternative path with respect to the path selected in Figure 3(b).
While the iterative approach allows analysts to determine paths that cover various kinds of topological structures like, for instance, cycles, the start-stop approach makes it possible to investigate multiple alternative paths between nodes without the need to find and select the route by hand. Additionally, the start-stop approach is more efficient for selecting longer paths.
However, pathway maps are often very complex and sometimes it is not obvious which choices are available for a path. To address this we provide an interactive preview mode for selecting paths on user request. Starting at the end of the current selection, we highlight possible extensions. For example, in Figure 4(a) all edges and nodes are highlighted which extend the end of the current selection at PDGFR.
In some cases, the information of pathway maps is not complete or simply outdated. As a consequence, they may not reflect the true process, especially not for all experimental conditions. Additionally, pathway databases can also contain errors that users are aware of. In order to cope with such incomplete or outdated pathway descriptions we provide a force mode for selecting paths. This mode enables analysts to add an edge to the pathway, which does not exist in the database. Notice that the second to last edge of the selected path in Figure 4(b) does not exist in the pathway map, neither in the image, nor in the underlying graph representation. By using the force mode during path selection, analysts are able to extend the current path by arbitrary nodes within the pathway map.
Visualizing experimental data on pathways
As discussed, directly mapping experimental data on pathway nodes using color-coding does not scale to more than a few experimental values, due to the small size of the nodes in the pathway maps. Despite this limitation, direct on-node mapping is valuable in two scenarios: First, it allows analysts to gain an overview of the main trends in the pathway. Having this overview can be helpful additional information for finding interesting paths. In the second scenario analysts want to investigate a condition (a group of samples) or a single sample in its high-level topological context. This allows analysts to consider experimental data associated with nodes that are not in the currently extracted path. For this purpose, the pathway view can be configured to show only the mapping of selected samples.
To address the overview task where analysts want to get a rough indicator of the mapped experimental values, we calculate the average of all experimental sample values and multi-mappings, if applicable, and color-code the nodes accordingly. If multiple data types are available, the analyst can choose which of them should be mapped. Figure 5(a) shows the Glioma pathway with on-node mappings of mRNA data, while Figure 5(b) shows the same pathway overlaid with copy number data.
For numerical and ordinal data we use a blue-white-red color map. We decided to use white as a neutral base of the color map to be able to intuitively represent data that has a neutral base, as, for example is the case with copy number data, which has a "normal" status. In addition, the blue-white-red color map avoids the drawbacks of the common red-black-green color map for red-green color blind users. A two-color gray-red color map is used for nominal data with two categories, such as mutation status data. To indicate cases where experimental data is missing, we show a small rectangle in the lower left corner of the node, as can be seen, for example, in the mTOR node in the lower right part of Figure 5(b).
Since the aggregation of all samples and possible multi-mappings into an average value hides all variation, we additionally provide the standard deviation encoded as a green bar below each node, as shown in Figure 5. This indication of variance is very valuable for the overview task. High variation (corresponding to an almost full bar), as can be seen for instance for the PDGFR gene in Figure 5(b), is an indicator for potentially interesting experimental data that is worth to be investigated in detail using the enRoute view.
enRoute view
Once a path has been selected in the pathway view, it can be analyzed in detail in context of experimental data in the enRoute view. The path is displayed in a linear, top-down layout, which is ideally suited to show rows of experimental data (data rows) right next to the nodes they are associated with. As a node can have multiple mapped data rows, we adapt the spacing between nodes of the path so that all rows can be shown with a uniform height. Such multi-mappings or the occurrence of complex nodes (nodes that consist of multiple subnodes) in the path make it very hard, if not impossible, to determine which data row belongs to which node using their position alone. Therefore, we connect each node with corresponding data rows using ribbons, as shown in Figure 2. To make the association between data rows and nodes even more obvious, we alternate the shade of gray in the data rows' backgrounds for each node. Figure 11(b) illustrates an example where these alternating shades of gray allow us to disambiguate the mappings of multiple subnodes of a complex node to corresponding data rows.
Following the divide-and-conquer visualization strategy [27], we group experimental data in the enRoute view based on a homogeneity criterion. For example, experiments can be grouped by the species they belong to (homogeneity with respect to semantics), or a grouping can be obtained by clustering (homogeneity with respect to statistics). As illustrated in Figure 2, the groups are depicted as columns resulting in an overall tabular layout. We address the heterogeneity requirement (R II) by allowing the individual groups to originate from different datasets. However, all experiments within a group must be from a single dataset.
Visualizing the path
In addition to showing the extracted path top-down in the enRoute view, we also display branches that join or leave the path in order to preserve some of the topological information present in the pathway maps. We indicate a branch by showing its first node relative to the node where the branching occurs in the extracted path. In order to maintain a compact path representation, multiple branches that join or leave a single node of the path are abstracted into expandable nodes, one for all joining and one for all leaving branches, as shown in Figure 6(a). These abstract branch nodes indicate the number of branches they represent and also show labels for them, if sufficient space is available. Abstract branch nodes can be expanded at any time to reveal the individual branch nodes, which display previews of associated experimental data, as shown in Figure 6(b). When expanding a node, its content is rendered on top of the other branches, which are grayed out.
As illustrated in Figure 6(c), an analyst can interactively switch to a branch by selecting the corresponding branch node. A selected branch replaces all nodes in the extracted path above or below the node where the branching occurs, depending on whether it is a joining or leaving branch. All nodes of the branch are added to the path until either a new branch or a dead end is reached. As the enRoute visualization technique synchronizes all corresponding elements among its components, any changes to the path caused by branch switching are propagated back to the pathway view, thus keeping the highlights of the selected path up-to-date. Also, the synchronization of node highlights facilitates the association of branches shown in the enRoute view with corresponding branches in the pathway maps.
Visualizing experimental data
Being able to display large amounts of heterogeneous experimental data is an integral part of the enRoute visualization technique (see requirements R I and R II). enRoute supports the visualization of quantitative, ordinal, and binary categorical data. As previously mentioned, we organize experimental data in rows and columns. Each row shows data that maps to a certain node in the path and columns group the data by a homogeneity criterion. Different groups may also have overlapping experiments. The captions of the individual groups are displayed at the top and at the bottom of the corresponding columns. Their background color indicates the dataset they belong to. For example, in Figure 1(b) the background of groups showing mRNA expression data is turquoise, whereas the background of copy number data groups is blue and the background for mutation data is light violet.
In molecular biology, heat maps are the standard way to visualize quantitative and ordinal data. However, it is well known that hue or value are inferior to other encodings with respect to communicating changes in the data. For both quantitative and ordinal data, encodings in position are a better choice and for quantitative data, length encodings are also superior [28]. Recently, Meyer et al. [23] also showed that a mirroring effect in expression data was much more apparent when it was visualized using line plots compared to when using heat maps. Heat maps or any other pixel-based visualization techniques are superior in terms of space efficiency and therefore scalability. enRoute, however, only requires the visualization to be scalable with respect to experiments, since the number of genes is typically small, as it is limited by the number of nodes in the path. Therefore, we prefer bar charts over heat maps for the representation of quantitative data as well as for ordinal data.
In the bar charts used for quantitative data, each bar represents one value of a single experiment, as shown in Figure 7(a). In order to make the borders of adjacent bars apparent without having to waste space for drawing outlines, we color the bars using a gradient from left to right. As shown in Figure 1(b), tooltips are used to show the numerical values of the underlying data. In some cases it might be desirable to see an abstract and more compact visualization of a group of quantitative data. For this purpose, we use one horizontally aligned bar that represents the mean value of a group together with error bars, encoding the standard deviation, as shown in Figure 7(b). In contrast to the detailed representations, where the width adapts to the number of experiments in the group and available display space, the width of abstract group representations is fixed. This constant width and the horizontal alignment of the abstract bars allows analysts to compare values of the same group across rows along the path more easily. However, for tasks that require comparisons across multiple groups, the detailed representation with vertical bars are preferable.
As copy number data commonly occurs either in ordinal or quantitative form, we use an optimized encoding that can deal with both of them. Ordinal copy number data is often categorized into high and low increase of gene copies, a normal copy number, deletion on one allele, and deletion on both alleles. As shown in Figure 7(c), our encoding of this data redundantly uses the length, color, and orientation of bars. For highly increased copy numbers, we show long, dark red bars pointing upwards from a base line. For low increases we use shorter, light red bars. Similarly, deletions are represented by dark and light blue bars pointing downwards. No bar is shown for normal copy numbers. The same encoding can be used for quantitative copy number data. The higher the increase in copies, the longer and darker the red bar is. The same concept applies to deletions. Just like for general quantitative data, we also employ an abstract representation for groups of copy number values. As shown in Figure 7(d), we use a horizontal histogram, which makes use of the same color coding as the detailed copy number representation.
For binary categorical data, such as data on whether a gene is mutated or not, we use a matrix visualization where each cell corresponds to a sample, as shown in Figure 7(e). For the mutation status example we color samples that are mutated in red, while non-mutated samples are shown in the background color. While the matrix layout deviates from the convention used for numerical and ordinal data of placing all samples side-by-side, we found it to be significantly more space-efficient compared to presenting mutation data in line with the bar-techniques. Space efficiency is important for mutation data since mutated genes are scarce in many datasets. Also, since only binary information is encoded, the redundant encoding using length and color is obsolete. For the abstract summary representation we use a histogram, similar to the one used for copy number data as shown in Figure 7(f).
The previously mentioned data previews, shown on-demand for branch nodes, use an encoding similar to the abstract data representations, as can be seen in Figure 6(b). For each group of mRNA data one bar indicating the group's mean value is drawn. For copy number and mutation data, we show one stacked bar per group.
The enRoute visualization technique makes use of synchronized highlighting of corresponding elements across all its components but also within all components. The latter case is especially useful in the experimental data display. By highlighting a set of experiments in one group, we allow analysts to identify these experiments in other groups, even for different data types. For example in Figure 10(b), all cell lines with an increased copy number are highlighted, which allows analysts to relate the increase in copy number with mRNA expression. As evident in this figure, scattered selections make it difficult to quantify the number of selected experiments. To alleviate this problem, we add tooltips to the groups' captions showing the total number of experiments and the number of currently selected experiments of each group.
Choosing experimental data and groupings for enRoute
Up to this point, we have assumed that decisions on which datasets and which groupings of the datasets to show are already made. However, given a large set of datasets and a variety of alternative groupings to choose from for every dataset, this presumably easy task is in fact not trivial. To support analysts in the task of selecting datasets and groupings and assigning them to views, Caleydo provides a dedicated view, the Data-View Integrator (DVI) [15]. As shown in Figure 8, the DVI view uses a graph representation that shows all loaded datasets at the bottom and all open views at the top. Each dataset is associated with a unique color that is the same as the one used in the background of the dataset labels in the enRoute view. For tabular datasets it is quite common to have groupings, such as clusterings, of both rows and columns available. These alternative groupings are represented in the matrix layout shown when exploring a dataset in detail, where alternative row groupings are shown in the rows and column groupings in the columns. An analyst can assign grouped datasets to enRoute by dragging the blocks onto the view representation at the top. Connection bands between datasets and views help analysts to understand the association of the data to the view. This is in particular helpful for highly heterogeneous configurations with multiple datasets, groups, and views. Users can switch to the DVI view at any time during a pathway analysis in order to refine the mapped experimental data. Additionally, new datasets and groupings can be added at runtime.