DNAism: exploring genomic datasets on the web with Horizon Charts
© Rio Deiros et al. 2016
Received: 20 March 2015
Accepted: 13 January 2016
Published: 27 January 2016
Computational biologists daily face the need to explore massive amounts of genomic data. New visualization techniques can help researchers navigate and understand these big data. Horizon Charts are a relatively new visualization method that, under the right circumstances, maximizes data density without losing graphical perception.
Horizon Charts can be an effective visual tool to explore complex and large genomic datasets. Researchers can use our library to leverage these techniques to extract additional insights from their own datasets.
Sharing and communicating about large and intricate datasets produced by high throughput sequencing can be a challenging task. Visual channels are an effective way to explore data. However, the accelerating increase in data quantity is pushing the limits of current approaches for representing these datasets visually without sacrificing accuracy or graphical perception. Overall data volume is growing: both the amount of data per study and the number of subjects. Thus, more effective visualization techniques are needed to understand the most challenging genomic sequencing datasets.
State of the art
Future development and enhancement of genome browsers should see Horizon Charts as one flexible and efficient answer to the challenges faced when displaying large amounts of data. In the appropriate circumstances, this approach will provide significant benefits to browser developers. Greater effectiveness in the display of data will in turn help researchers explore that information more efficiently and conveniently.
Contrary to time series data, in genomic datasets, the variable under study (x-axis) is associated with chromosomal coordinates instead of timestamps. We have modified an existing time series data visualization library (based on D3 ) called Cubism to support genome coordinate data. This makes DNAism a flexible and effective tool to explore multi-sample genomic datasets using Horizon Charts.
To visualize genomic datasets, we have modified most of the software components of the original Cubism library (http://drio.github.io/dnaism). The two major components are ‘context’ and ‘source’. The ‘context’ component performs several functions. Most importantly, it defines the region of the genome we want to explore. This component also specifies, in pixels, how much vertical space we have available for the visualization. The ‘source’ component parses the genomic raw data and generates the data points necessary for visualization. Our library provides two sources: ‘bedfile’ and ‘bedserver’. Once the sources are created we can use the metric component to instantiate metrics associated with specific samples. Finally, the horizon component encapsulates the functionality necessary to create the visual elements.
One of the crucial features of DNAism is the ability to efficiently parse and load the genomic data for visualization. We have provided two alternatives via the bedfile and the bedserver sources. A bedfile is a simple solution that loads all the genomic information in memory and returns the relevant data when queried. However, this approach is not adequate for larger datasets, especially those involving multi-sample data. To handle such cases, the bedserver source can be used. A bedserver is a dedicated server that implements a RESTful API interface. The client’s code running in the browser can send queries to this server to obtain the data of interest. The server uses pre-indexed  data to speed up random access and returns only the necessary information for the visualization back to the client. Hence, this approach becomes much more scalable even with large sized genomic data sets. We have implemented bedserver as a Python package (https://github.com/drio/bedserver) although we expect users will create their own sources and backends to interact with the specific details of their environments.
The source code of our library (and the original Cubism) has a decoupled interface that facilitates the extension of this library to new data sources. DNAism is data agnostic. As a result, users can create new sources to capture their specific backend peculiarities.
We consider in this section the two main aspects of reproducibility: first the ability of the software to generate the same results given the same input sets, and second the requirements for our users to install and use our software, that is the ability of new users to reproduce and exploit the capabilities we intend.
The main goals of our library are to visually encode data points that capture the value of some variable under study for a series of genomic locations and to display those values on a computer screen. This makes validation rather simple. We can inspect a small area of the genome and check the actual data points displayed against our input files. Once the interesting patterns and behavior are discovered in the datasets, the user can proceed to manually confirm the results by looking back at the raw data.
Results and discussion
The main function of DNAism is to expose the power of Horizon Charts while abstracting the inner details. Exposing the functionality as a library provides flexibility to the user to allow them to incorporate these visualization techniques within their projects. We believe that this technology is ideal for developing visualizations that will help the community to better understand their genomic datasets.
We are not aware of any other tools that use Horizon Charts to explore the genomic data.
The library is intended for exploring genomic data. It is ideal for aiding quality control on genomic datasets by visualizing different encoded metrics, typically in BED format.
In the future, we will be adding new sources to allow the users to load data from different types of backend services. We want to extend the library to make it easier to use, especially for the users that are not well-versed with web ecosystem.
We introduce a powerful visualization technique previously used in the time series data domain. This visual tool facilitates the identification of similarities or abnormalities in patterns across multi-sample datasets. In addition, this approach helps to explore and visualize high density datasets more effectively, thereby helping the researchers to understand their data more easily.
Our library keeps the effective and elegant interface of the original, while allowing users to leverage its power for genomic data. By providing a library, we maintain flexibility regarding how researchers can use these resources. Users can build full applications or use the library within their existing ones.
The companion lightweight server will facilitate the exploration of large genomic datasets without affecting user experience, by using indexed datasets. Alternatively, users can create their own data sources to reflect the details of their own environments.
Availability and requirements
We want to thank Mike Bostock for his remarkable contributions both with D3 and Cubism. The authors also thank Muthuswamy Raveendran, R., Alan Harris and Gloria Fawcett for helpful comments. This work was supported by NIH grant U54-HG003273 to RAG.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Few S. Time on the horizon. Visual Business Intelligence Newsletter. 2008. http://www.perceptualedge.com/articles/visual_business_intelligence/time_on_the_horizon.pdf.
- Saito T, Miyamura HN, Yamamoto M, Saito H, Hoshiya Y, Kaseda T. Two-tone pseudo coloring: Compact visualization for one-dimensional data. In: Proceedings of the Proceedings of the 2005 IEEE Symposium on Information Visualization. INFOVIS ’05. Washington, DC, USA: IEEE Computer Society: 2005. p. 23. http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?arnumber=1532144.Google Scholar
- Heer J, Kong N, Agrawala M. Sizing the horizon: the effects of chart size and layering on the graphical perception of time series visualizations. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Association for Computing Machinery: 2009. p. 1303–1312.Google Scholar
- Wang J, Kong L, Gao G, Luo J. A brief introduction to web-based genome browsers. Brief Bioinformatics. 2013; 14(2):131–43.View ArticlePubMedGoogle Scholar
- Kuhn RM, Haussler D, Kent WJ. The UCSC genome browser and associated tools. Brief Bioinformatics. 2012; 038:bbs038.Google Scholar
- Bostock M, Ogievetsky V, Heer J. D3 data-driven documents. IEEE Trans Vis Comput Graph. 2011; 17(12):2301–9.View ArticlePubMedGoogle Scholar
- Li H. Tabix: fast retrieval of sequence features from generic tab-delimited files. Bioinformatics. 2011; 27(5):718–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang R, Perez-Riverol Y, Hermjakob H, Vizcaíno JA. Open source libraries and frameworks for biological data visualisation: A guide for developers. Proteomics. 2015; 15(8):1356–74.PubMed CentralView ArticlePubMedGoogle Scholar