Epiviz: a view inside the design of an integrated visual analysis software for genomics

Background Computational and visual data analysis for genomics has traditionally involved a combination of tools and resources, of which the most ubiquitous consist of genome browsers, focused mainly on integrative visualization of large numbers of big datasets, and computational environments, focused on data modeling of a small number of moderately sized datasets. Workflows that involve the integration and exploration of multiple heterogeneous data sources, small and large, public and user specific have been poorly addressed by these tools. In our previous work, we introduced Epiviz, which bridges the gap between the two types of tools, simplifying these workflows. Results In this paper we expand on the design decisions behind Epiviz, and introduce a series of new advanced features that further support the type of interactive exploratory workflow we have targeted. We discuss three ways in which Epiviz advances the field of genomic data analysis: 1) it brings code to interactive visualizations at various different levels; 2) takes the first steps in the direction of collaborative data analysis by incorporating user plugins from source control providers, as well as by allowing analysis states to be shared among the scientific community; 3) combines established analysis features that have never before been available simultaneously in a genome browser. In our discussion section, we present security implications of the current design, as well as a series of limitations and future research steps. Conclusions Since many of the design choices of Epiviz are novel in genomics data analysis, this paper serves both as a document of our own approaches with lessons learned, as well as a start point for future efforts in the same direction for the genomics community.


Other visualization tools for genomics
A number of existing tools approach some of the capabilities targeted by Epiviz, but fall short. For example, the Human Epigenome Browser [1] provides powerful visualization and data integration capability, but is not directly integrated with a computational environment. Epiviz, can not only communicate with computational frameworks like R and Python, but the AnnotationHub resource accessible through Bioconductor also features integration of curated data sources in a uniform manner supporting interactive exploratory workflows not offered by the Human Epigenome Browser. Galaxy [2][3][4] is another tool that provides excellent support for pipeline workflows, but providing a limited set of visualization options. The design of Epiviz facilitates the creation of user-defined visualizations, which can immediately be used alongside the existing ones. Also, the code of existing visualizations can be customized directly in the UI. A related tool, cBio Portal [5] allows querying and visualizing data in a uniform manner, but is targeted just towards cancer genomics. In contrast, Epiviz provides an extensible framework which works for generalized workflows.
Current tools for the analysis of large genomics datasets usually target one of two audiences: either programmers and data analysts, who interact with data mostly through code and scripting, or biomedical scientists, who usually interact with data through graphical user interfaces. Conversely, there has been a significant push from educational and scientific institutions, including the NIH, to increase the computational literacy of biomedical scientists in preparation for work in a data-intensive field like high-throughput genomics [6,7]. By continuing to target graphical user interfaces as the only way in which biomedical scientists can interact with data, current tools are not in tune with this trend. As a core design component of Epiviz, we added code as an exploration interface to genomics data at various different levels which we describe in detail in this paper.

Data abstraction and standardization
One of the most important design decisions in Epiviz was to introduce a series of data structures built on top of an open-ended standardized data format, used to represent a most common known genomic data types. The necessity for this design decision stems from the heterogeneity of genomic data types and the rate at which new ones are released. The unique data format permits different modules of Epiviz to interact with one another without predefining a protocol specific to just those modules. This simplifies the creation of new components, plugging them into Epiviz, and extending existing ones. The standardized data format is the key to our tool's customizability and extensibility, yielding benefits at four different levels: a) allows Epiviz to integrate and aggregate data from different sources, b) allows representing different measurements into the same visualization, for comparison analysis, c) allows representing the same measurement in different visualizations simultaneously in order to explore different aspects of its features, and d) allows Epiviz to expose an API that defines a plug-and-play fully featured chart interface, that users can build on top of in order to create new visualizations. In this subsection we briefly present the data standard and some of the more important structures built on top of it.
The standardized data format draws from the three-table design for genomic data [8] (Sup. Fig. 1). This design is capable of modelling the vast majority of data present in functional genomics experiments. Based on this design, we derived a data structure abstraction called, generically, data source. The data source acts as a table with metadata annotating both rows and columns. Each row in a data source has a genomic coordinate, and each column corresponds to a measurement. Thus, each cell in the table represents the measurement value at a particular coordinate. Data source tables tend to be large enough for it not to be feasible to store them entirely in memory. For this reason, Epiviz retrieves only chunks of these tables in memory as needed. Data sources are treated as single tables in the Epiviz logic; however in practice, their corresponding data often comes from different physical tables or even different data providers. For this reason, coordinate information, common to all measurements in a data source, is separated from the actual measurements and retrieved independently, so redundant operations are avoided and no more data than needed is loaded in memory at any given time.
An important substructure of data sources is constituted by genomic arrays. These encode all information corresponding to measurements' data objects. There are two types of genomic arrays in Epivizgenomic range arrays, used to store coordinate information, and feature value arrays, used to store feature values for individual measurements. Genomic arrays are designed to store fragments of the data sets they represent, so that although not all data is loaded in memory at once, Epiviz modules can treat them as if they did. For this purpose, each row in a data source table has an identifier denoted as global index, which represents the index of a row in the entire data source, ordered by genomic coordinate. Global indices are passed in increasing consecutive order between modules, which simplifies operations on the UI that only require logarithmic time for searches and constant time for merging fragments of data.

Overview of performance optimizations
Epiviz implements a number of simple optimizations meant to improve user experience, which we discussed in our previous work, along with benchmarks underlining their effects over the overall user experience. These constitute our preliminary steps in this direction. In the following paragraphs, we expand over limitations and future research directions associated with this topic. The most important optimization consists of a predictive caching mechanism also depicted in Sup. Fig. 2. Due to this feature, the user only has to wait when opening Epiviz for the initial request to be fulfilled, most subsequent requests using the lag between user operations to load data likely to be requested on following actions. The other type of optimization is individual to each visualization, and consists of binned aggregation [9], also discussed in our previous work. In Chelaru et al. [10] we presented the extent of our current efforts in performance optimization.

Performance limitations
There are a number of performance limitations brought by the design choices of Epiviz. Some reflect the target use cases we built the software for. Others however, are issues we mean to address in future releases of the tool. In this section we underline some which we find more important.
Sup. Fig. 1. Epiviz Data Sources. A standardized data format based on the three-table design. The left schematic shows the general structure, while the right one shows an example for microarray gene expression data. The three tables are 1) measurements, 2) feature coordinates stored as genomic range arrays, and 3) feature values, stored as sets of feature value arrays. Each data point is a measurement of a particular feature from a specific sample. Features occupy a coordinate space and have associated metadata. Samples have annotations associated as well.
First of all, the design of Epiviz restricts all optimization decisions to those that can be done on the client side. This is because our software, apart from a small server component that stores user data and workspaces information, runs entirely on the client machine, and proposes to aggregate data from any number of external sources. As we have no control over these data sources, optimization on the server side is not an option. This puts any complex data optimization techniques such as those discussed in [11] out of the scope of our current work.
Choosing JavaScript as the main programming language for the Epiviz framework comes with a number of advantages, but also a few limitations. The main advantage is that JavaScript runs natively in all major web browsers, on all operating systems, requiring no special installation. In addition, being optimized for online applications, it has a number of useful features that make it convenient to access online data sources. The fact that it is an interpreted language that can evaluate strings of text into executable code is also an important feature, which in Epiviz we make great use of. Finally, over the years, all major browsers have taken steps in isolating JavaScript, so that it is natively prevented from making changes to the local file system. This is essential for Epiviz in its intent to allow users incorporate third-party code to extend software functionality.
The main drawback of using JavaScript for Epiviz is its restriction to running into a single thread of execution. Although HTML5 introduces web workers [12] as an alternative to multi-threaded computing, in practice, because they are treated as separate processes and do not share the context of the main program, they are hard to use for optimizations closely related to the visualizations. For this consideration, operations on relatively large amounts of data tend to impact the performance of Epiviz and lower user experience. Workspaces with a large amount of visualizations, each showing a large number of data objects present latencies in responsiveness, as all objects in all visualizations are drawn using a single processing thread.
Another design decision that adds to the performance overhead is the use of vector graphics (SVG) for rendering visualizations. This is essential for some of the most important Epiviz features, such as brushing, tooltips, etc. At the same time, vectorial displays do not benefit from the limitation over the number of objects on the screen corresponding to the number of pixels available in the screen resolution, which comes with raster images. This means that the actual number of objects rendered on the screen will be proportional to the number of data records behind the visualization. In other words, even when objects overlap, each of them is still rendered and has a corresponding element in the HTML document. This, combined with the limitation of single-thread processing in JavaScript, leads to performance hits when a lot of visualizations with many objects are in view. In addition, because of the diverse nature of visualizations, it is impossible to predict which records will correspond to overlapping visual objects at the visualization API level, which moves the burden of visual optimization to each individual chart.
In the Future research directions section, we expand on ways in which we plan to address these performance limitations in future versions of Epiviz.
Sup. Fig. 2. The Epiviz predictive caching mechanism. On load or navigation to a new region in the genome, the data manager makes two requests: first for the data needed to update the visualizations on screen, and second for data in its vicinity. Once this data is loaded, subsequent pan and zoom operations are executed immediately, as all data necessary to fulfill them is already stored in memory. These subsequent operations also trigger a small additional request per operation, to keep the data in the cache consistent.

Software security through code sanitization
The Caja library establishes a connection to a Caja server hosted at http://caja.appspot.com. When users of Epiviz supply custom code, by using JavaScript dynamic extension, our framework asks Caja to construct a virtual DOM to which objects in the script are limited to. Epiviz also supplies a series of defensive objects to Caja. Defensive objects are the only objects within the framework that are accessible to the script and their implementation assumes that their clients can be malicious. Caja transforms the code to make it safe to run, by sending a GET request to the Caja server, wrapping the raw user code. The server returns the transformed code.
The creators of the library call this process cajoling. Cajoling involves adding inline checks to make sure the code does not break the invariants Caja needs, and ensuring that the code cannot refer to variables in the host page that are not explicitly given to it. It also makes sure that the user code only uses the API published by the Epiviz framework.
From the viewpoint of the user code, it runs with what seems to be a W3C DOM compliant document object and an ECMAScript 5 compliant JavaScript virtual machine. Its document is confined to the boundaries of this virtual DOM, and its JavaScript global types and objects, like Object and Array, are its own and do not affect code outside it. The defensive objects are visible to user code as additional global variables in its top-level JavaScript context.
Using this library, Epiviz is able to restrict third-party scripts from any access to sensitive information such as the user cookie, opening pop-up windows or browser tabs, contacting third-party web servers, as well as the entire workspace data.

VCF VRanges
Through this set of data types, AnnotationHub and Epivizr provide integration of a variety of data through uniform data types, based on community standards, regardless of data source.