Next generation tools for genomic data generation, distribution, and visualization
© Nix et al. 2010
Received: 2 April 2010
Accepted: 9 September 2010
Published: 9 September 2010
Skip to main content
© Nix et al. 2010
Received: 2 April 2010
Accepted: 9 September 2010
Published: 9 September 2010
With the rapidly falling cost and availability of high throughput sequencing and microarray technologies, the bottleneck for effectively using genomic analysis in the laboratory and clinic is shifting to one of effectively managing, analyzing, and sharing genomic data.
Here we present three open-source, platform independent, software tools for generating, analyzing, distributing, and visualizing genomic data. These include a next generation sequencing/microarray LIMS and analysis project center (GNomEx); an application for annotating and programmatically distributing genomic data using the community vetted DAS/2 data exchange protocol (GenoPub); and a standalone Java Swing application (GWrap) that makes cutting edge command line analysis tools available to those who prefer graphical user interfaces. Both GNomEx and GenoPub use the rich client Flex/Flash web browser interface to interact with Java classes and a relational database on a remote server. Both employ a public-private user-group security model enabling controlled distribution of patient and unpublished data alongside public resources. As such, they function as genomic data repositories that can be accessed manually or programmatically through DAS/2-enabled client applications such as the Integrated Genome Browser.
GNomEx was developed to track samples for experimentation in our microarray and next generation sequencing core facility, associate raw data with biological samples, and link downstream computational analysis with the generated data. It is both a genomic LIMS and analysis project center designed for use by institutional core facilities and large research laboratories. Our installation of GNomEx  currently hosts ~7000 experiment requests, ~30,000 raw microarray and next generation sequencing datasets, and ~130 processed genomic analyses.
Adobe's open framework rich client Flex interface is used to provide a front-end graphical interface in one's preferred web browser using the Flash media player.
Particular attention was made to achieve platform independence for all aspects of the software. These include a client-side Flex/Flash interface, Java programming language, an open source object-database mapping (Hibernate) that supports most relational databases (e.g. MySQL, Microsoft SQL Server, Oracle), and the deployment of the web-based applications using an open access J2EE application server (Orion). These choices allow other groups to install and use these applications within their existing infrastructure.
GNomEx is built around the concept of projects in which individual experiments are grouped. Users are encouraged, through a wizard-like interface, to associate annotations with their projects and experiments. Where appropriate, MGED ontologies , have been used to populate these annotation categories to assist in organizing, grouping, and searching of projects and experiments.
Experiment annotations, data files, and associated data analysis files are safeguarded by a robust security manager that restricts access to authenticated users. The visibility of an experiment is set to either public, members, or members and collaborators. Following publication, researchers are encouraged to make their raw and analyzed data publicly available by changing their visibility settings. This will allow guest users to browse, search, and download published data.
Experiments are organized within project folders that can be browsed according to experiment platform, submission date, or by name of the researcher or lab. Simple text searches as well as advanced, criteria-based searches can be performed on experiments, protocols, and associated analyses. Text searching relies on the high-performance, open source Apache Lucene text search engine . GNomEx keyword searching uses Lucene indexes, built nightly, that contain all text associated with experiments and downstream analysis, including free-form descriptions, structured annotations, sample names, and protocols. Post-search processing culls the results so that only view-permissible data are returned.
Often the best person to analyze genomic data is the person who submitted the samples to the genomics core facility. They typically have an intimate knowledge of the biology behind the project, have a list of key questions to address, and are aware of potentially confounding issues associated with the experiment. Moreover, when they perform their own genomic analysis, they become aware of the various choices made in generating the processed data that limit and bias its contents. As such, a key goal in our bioinformatics shared resource is to enable users to analyze their own data. For some genomic datasets users can choose from a variety of mature, open access, user friendly, GUI based applications for data processing. (e.g. gene expression, SNP genotyping). For other more recently emerging datasets, such as those derived from tiling microarray and next generation sequencing platforms, sophisticated well characterized analysis tools do exist but are often challenging to use given their command line interface. This is to be expected. Analysis software evolves from minimalistic command line scripts, to integrated command line packaged tools, to web and stand alone GUI applications. When novel analysis approaches change frequently, designing and updating GUIs is often viewed as unproductive by application developers. On the other hand, many scientists avoid command line programming. To break this impasse, web based wrapper applications such as Galaxy  and GenePattern  have proven useful. Users upload their data to a remote server, use web forms to execute command line applications, and download their analysis all in the framework of a web browser. Although effective, it can be less than ideal for processing large tiling microarray and next generation sequencing datasets. The gigabyte size of these datasets poses problems for timely data upload and download, for data storage on a central server, and requires extensive computational resources to process one dataset, let alone multiple datasets from multiple users. Lastly, from a developer standpoint, creating the web forms for each command line application and keeping them up to date requires effort that is often better spent improving the underlying algorithms.
Another key issue associated with effective use of genomic experiments in laboratories and clinics is the difficulty in efficiently distributing analyzed data. Too often, analyzed data are placed in a supplemental data folder on an author's or journal's web site where annotation of the analysis is non-standard and typically incomplete. Determining which methods were used in generating the data, or even the genome build, is often difficult. Submission of analyzed data to a public repository such as GEO  or ArrayExpress  is an improvement but is rarely done except when publishing the original unprocessed data. Some bioinformatic groups such as UCSC Genome Bioinformatics [12, 13] will host external datasets provided one can convince them it is of interest to their users. In all cases, the data cannot be integrated in a subsequent analysis without extensive manual file downloading, filtering, and reformatting. Making a simple visual comparison between different datasets from different data sources in a genome browser requires considerable effort. Hundreds of genomic datasets are currently buried in web archives or customized databases. As such they are effectively inaccessible. Ideally, a researcher would distribute their own data on the internet using a common protocol so that other groups could see it and could programmatically download portions of it for subsequent comparison with other datasets.
A solution to this problem exists and has been in development for more than ten years. It makes use of a Distributed Annotation System (DAS) protocol, and a DAS server [14–19]. DAS is a communication protocol developed to exchange annotations on genomic and protein sequences between servers and client applications over the internet. Hundreds of DAS/1 servers are in use at bioinformatic data centers such as WormBase, UCSC, Ensembl, FlyBase, TIGR, and UniProt. Unfortunately, the DAS/1 protocol is not amenable for distributing large genomic datasets given its requirement that datasets be formatted using verbose text based DAS XML. DAS/2  is a recent extension of the DAS/1 protocol and is optimized for distributing large genomic datasets in both text and binary formats (e.g. bed, gff3, wig, bar, fasta, useq, dasXML, sam, bam). The difference in file size and corresponding download time between gzip compressed DAS XML and a binary format like useq is typically >100 fold (e.g. 85 MB vs 0.6 MB for the ENCODE's wgEncodeBroadChipSeqSignalGm12878Ctcf chIP-seq graph data for chr21). Any dataset that can be associated with a specific genome build and genome coordinates (e.g. gene expression, SNP, CNV, chIP-chip, chIP-seq, RNA-seq, chromosomal rearrangements) can be efficiently shared between DAS/2 servers and DAS/2-enabled clients such as IGB  and GBrowse  or incorporated into data objects from the Cancer Biomedical Informatics Grid (caBIG).
We have adopted DAS as our genomic data distribution model and have been working with the GenoViz open source project [1, 19, 22] to extend the functionality of the GenoViz Genometry DAS/2 server in three key areas. The first improvement was to implement a user-group public-private security model using http md5 digest authentication to enable restricted access of designated genomic datasets to particular users. Researchers need to be able to compare their unpublished data with public datasets. Clinicians working with patient data require controlled access under all situations. If needed, these servers can leverage other internet based security protocols such as secure socket layers and virtual private networks used by banks and hospitals for securing internet data exchange.
A second improvement was to develop a compressed, pre-indexed, binary data format called useq, that would support the majority of high throughput genomic text based data formats (e.g. bed, gff, gtf, wig, sgr, gr) in a manner that would not require indexing upon server start up nor loading of the data into memory. The GenoViz DAS/2 server was built using an in memory data distribution model. This is appropriate for reference annotations and enables a rapid response to DAS/2 requests. The useq data format provides a mechanism for hosting a large number of high-density datasets limited only by disk space. Tools for generating and extracting information from useq archives are distributed with the USeq package. A detailed description of the format is included in the USeq documentation .
Presented here are three software applications developed to assist with generating, annotating, analyzing, organizing, distributing, and visualizing genomic data. GNomEx is the first published open source genomic LIMS that supports next generation sequencing and microarray platforms. It is an enterprise level application built for integrating multiple university core facilities and dovetails with the Bio Sample Tracking database in use at the University of Utah and Huntsman Cancer hospitals. Unlike most other LIMS, GNomEx contains an analysis project center where multiple users can upload, annotate, and associate analysis with the raw data archived in GNomEx. This is a critical feature needed to maintain a chain of custody type tracking of patients to samples to raw data to analyzed data. To efficiently distribute this processed data, we developed an easy to use web application called GenoPub. GenoPub associates and distributes meta data with each analyzed dataset through the GenoViz DAS/2 server. Analysis can be organized under multiple views (e.g. by patient, disease, or factor) and restricted to particular users enabling the controlled distribution of patient and unpublished data alongside public datasets. To obtain analysis, users either manually download it to their local computer or access it programmatically through DAS/2-enabled client applications such as IGB.
These tools provide critical infrastructure for efficiently managing and distributing genomic data for use in the laboratory and the clinic and return the focus of genomic bioinformatics to data analysis. The development of novel analysis methods is accelerating as fast as next generation sequencing costs fall. Unfortunately, making these cutting edge analysis tools accessible to a wide spectrum of users is proving difficult. One solution presented here makes use of a stand alone GUI, GWrap, to convert 120 command line applications found in two widely used next generation sequencing and tiling microarray analysis packages, USeq and TiMAT2, into a user friendly GUI without placing a burden on developers nor compromising the command line interface. GWrap can be incorporated into other analysis packages with minimal effort. In summary, we believe these next generation tools are well suited for making the best use of datasets from the post-genomic era.
Project names: GNomEx, GWrap, GenoPub
Operating systems: Platform independent
Programming languages: Java
Other requirements: Java 1.6+, a relational database (e.g. MySQL, Microsoft SQL Server), object/relational database mapping tool Hibernate 3.2+ https://www.hibernate.org, a Java servlet container (e.g. Apache Tomcat, Orion)
Licenses: GPLv3 for GNomEx, BSD for GWrap and USeq, Common Public License for GenoPub
Restrictions: For profit organizations are required to obtain a commercial license before deploying GNomEx in whole or part. No such restrictions are in place for USeq, GWrap, or GenoPub. See the licence.txt document in the individual package downloads for details.
Laboratory Information Management System
Graphical User Interface
Integrated Genome Browser
Distributed Annotation System
Microarray Gene Expression Databases
single nucleotide polymorphism
The authors would like to acknowledge the tremendous resources provided by both the open source GenoViz project  (Gregg Helt, Steve Chervitz, Ed Erwin, Allen Day, Brian O'Connor, Ehsan Tabari, Hiral Vora, Ido M Tamir, Marc RJ Carlson, Nomi Harris) and Ann Loraine's BioViz group  (University of North Carolina at Charlotte: Ann Loraine, John Nicol, Steve Blanchard, Hiral Vora, Archana Raja; most of whom are GenoViz developers). The authors also thank the Huntsman Cancer Institute and National Institutes of Health (grant P01CA24014) for funding and releasing GNomEx, GenoPub, and GWrap to the non-profit community as free open source software.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.