Over seventy five years ago, Dr. Kenneth Raper described the awesome life history of Dictyostelium discoideum [1]. This social amoeba grows vegetatively while subsisting on bacteria in the soil, until it exhausts the food supply. Starvation triggers a coordinated process of chemotaxis, aggregation and multicellular development and differentiation of tens of thousands of individual cells. Dictyostelium, over the decades, has become a genetic model organism for myriad biological phenomena, including multicellular development, kin recognition, bacterial discrimination and innate immunity [2].
Dictyostelium has also been at the leading edge of genomics era research. The genome of D. discoideum was among the first eukaryotes to be queued for (Sanger) sequencing [3], and the developmental transcriptome was explored in the early days of gene expression microarrays [4]. Since then, next-generation RNA-sequencing (RNA-seq) has vastly increased the ease and resolution of transcriptome studies [5–7]. And now, researchers are using ChIP-seq to define gene regulatory networks and short-read whole genome sequencing of chemical mutants to dissect genetic pathways [8, 9].
These technological and experimental advances continue to drive the need for new and better approaches to data management and analysis. The sheer volume of NGS output requires data management that is stable and scalable. Scientific best practices dictate that analyses should be rigorous, reproducible and traceable. Software solutions to these challenges typically are designed for data scientists and computational experts. However, these designs often fail to consider the needs, but also the limitations, of many non-computational life scientists who generate and consume the data. To foster the most creative research and efficient collaborative environment, life scientists should be engaged in the entire process; know where their data resides and how it has been processed; and be empowered to explore their data themselves, to ask questions and test hypotheses as they arise.
In collaboration with the Dictyostelium group at Baylor College of Medicine, University of Ljubljana developed the original dictyExpress (1.0), a web application designed for exploration of transcriptomics datasets [10]. dictyExpress (1.0) allowed users to select among experiments and specify genes to analyze; visualize the expression time courses of those genes; identify gene clusters; examine pre-processed differential expression datasets; and perform Gene Ontology (GO)-term enrichment analysis.
The distinguishing feature of dictyExpress (1.0) was its interactivity. Each visual analytics module was linked to the others, such that selecting a gene or genes in one module propagated to the others, triggering new analyses where necessary. For example, when the user selected differentially expressed genes in the Volcano Plot, the temporal profiles of these genes appeared in the Time Course module, and GO enrichment terms updated automatically. Gene selection was supported in all visualization modules of dictyExpress, and in this way enabled a variety of workflows and entry points to exploring the data.
The original dictyExpress was developed in Flash (client side) and relied on an ad-hoc Python-based backend for data access. Addition of new data was not supported for the user and required manual changes of the database on the server side. End users were precluded from developing new pipelines, as well as tracing the results of bioinformatics analyses. Further, extending the platform to include other species was complicated by inflexibility on the server side.
In this paper we report dictyExpress (2.0), a reinvention of the original with an entirely new software architecture and extended functionality (Fig. 1). From the original version [10] we retain the name, several data presentation modalities and the concept of interactive visual exploration. Everything else has changed. The new dictyExpress is bundled with GenBoard, a data management and preprocessing web application. The entire suite has been rewritten in JavaScript, HTML5 and CSS3 on the client side and a high-level Python web framework (Django, version 1.8.6, https://github.com/django/django, https://www.djangoproject.com; PostgreSQL, version 9.4.11, https://github.com/postgres/postgres, https://www.postgresql.org; and MongoDB, version 2.4.8, https://github.com/mongodb/mongo, https://www.mongodb.com) and in-house data flow engine on the server side. The user may now upload raw next-generation sequencing data, trigger the computational pipeline for mapping, estimation of transcript abundance and computation of differential gene expressions, and then use dictyExpress to explore and share the results. Once published, or upon the user’s preferences, results may be marked as public and immediately made available to the general audience.
The new dictyExpress has been adopted as a tool of choice to analyze gene expression data among many prominent labs in the Dictyostelium community. As of this submission, the web app has been viewed by over 3700 unique visitors and stores the data from over 800 Dictyostelium (and related) experiments. Access to dictyExpress is reciprocally linked to dictyBase, the home page of the central repository for Dictyostelium genome data and experimental resources (http://dictybase.org). Every individual gene details page at dictyBase includes a link to dictyExpress, facilitating access to expression profiles, and each gene selection in dictyExpress is linked to the corresponding page in dictyBase. Below, we provide essential details of our implementation framework and describe the functionality of the new dictyExpress. We pay particular attention to the interactive data analysis, and how this feature promotes exploration, discovery and insight generation. We also discuss how the framework could be extended to support other organisms, projects and data types, some of which is already underway.