eXframe: reusable framework for storage, analysis and visualization of genomics experiments
© Sinha et al; licensee BioMed Central Ltd. 2011
Received: 16 July 2011
Accepted: 21 November 2011
Published: 21 November 2011
Skip to main content
© Sinha et al; licensee BioMed Central Ltd. 2011
Received: 16 July 2011
Accepted: 21 November 2011
Published: 21 November 2011
Genome-wide experiments are routinely conducted to measure gene expression, DNA-protein interactions and epigenetic status. Structured metadata for these experiments is imperative for a complete understanding of experimental conditions, to enable consistent data processing and to allow retrieval, comparison, and integration of experimental results. Even though several repositories have been developed for genomics data, only a few provide annotation of samples and assays using controlled vocabularies. Moreover, many of them are tailored for a single type of technology or measurement and do not support the integration of multiple data types.
We have developed eXframe - a reusable web-based framework for genomics experiments that provides 1) the ability to publish structured data compliant with accepted standards 2) support for multiple data types including microarrays and next generation sequencing 3) query, analysis and visualization integration tools (enabled by consistent processing of the raw data and annotation of samples) and is available as open-source software. We present two case studies where this software is currently being used to build repositories of genomics experiments - one contains data from hematopoietic stem cells and another from Parkinson's disease patients.
The web-based framework eXframe offers structured annotation of experiments as well as uniform processing and storage of molecular data from microarray and next generation sequencing platforms. The framework allows users to query and integrate information across species, technologies, measurement types and experimental conditions. Our framework is reusable and freely modifiable - other groups or institutions can deploy their own custom web-based repositories based on this software. It is interoperable with the most important data formats in this domain. We hope that other groups will not only use eXframe, but also contribute their own useful modifications.
In the past two decades, numerous repositories have been developed for data management and analysis of genomics studies. The largest and most notable are the public repositories Gene Expression Omnibus  and ArrayExpress  which store data from variety of different platforms, but allow users to query gene expression only. There are a few efforts to archive the raw data from next generation sequencing runs . However most genomics repositories are still limited to microarrays - examples include the Stanford Microarray Database , mAdb , Genopolis , MiMiR  and several others which are compared in a useful review by Gardiner-Garden and Littlejohn .
Most of these microarray databases follow the Minimum Information About a Microarray Experiment (MIAME) standard  that specifies the minimum required information needed to enable the interpretation of the results of the experiment. However, they often have heterogeneous sample annotation and use free text rather than a controlled vocabulary, making it difficult to perform integrative meta-analysis across experiments. Several repositories were developed to specifically address this issue, including M2DB - a microarray meta-analysis database of over 10,000 experiments annotated with disease states and organism parts with terms from controlled vocabularies ; Oncomine - a web-based data management and mining platform for cancer datasets ; GCOD - GeneChip Oncology Database -which has curated human cancer datasets  and Genevestigator, which provides annotation on variety of biological contexts .
Although structured annotation of samples has allowed researchers to query expression across biological contexts, the actual application of these systems is limited to expression data. To accommodate other types of genomics data (for example from ChIP-Seq or RNA-Seq assays), standardized metadata on experimental design, measurement type and assay technology need to be captured. The ISA software suite (which consists of the ISA-Tab format and supporting tools) was the first successful effort devised to annotate studies with heterogeneous high-throughput assays using standard ontologies [14, 15]. While the ISA infrastructure offers significant improvements in the structured annotation of diverse assays, as a metadata format/store, it does not of course provide tools for processing, analysis or visualization of data.
Further, and very importantly, most databases are not available as open-source software to allow local installation and/or customization. This has led to inefficiencies, duplication of effort and creation of numerous databases. Swetrz et al. reviewed a dozen of these genomics databases for maintainability, extensibility and interoperability . Only a few were found to be configurable and for most, the software wasn't readily available for reuse. In reaction to these findings, MOLGENIS was developed as a local experimental genomics database [16, 17]; however, it isn't designed or optimized for integrative analysis.
We have developed eXframe, a reusable framework that addresses the issues of standardized annotation, multiple data types and analysis tools in a single platform. Our framework allows storage of gene expression, histone modification and transcription factor binding data from both microarrays and next generation sequencing technologies. The samples and assays are annotated with controlled vocabularies/ontologies and all data is processed and stored in a consistent way. This enables queries across species, experimental conditions and assay types, thus allowing the researcher to compare their data with others. The software is currently being used for two repositories, one containing hematopoietic stem cell data and the other Parkinson's disease patients' data.
In this section we describe the implementation of eXframe and its various components.
Web-based systems support ease of distribution, platform independence and scalable architecture. We implemented our system as a web-accessible database built on the LAMP (Linux, Apache, MySQL and PHP) technology stack. All components are available under open-source software license. We leverage the added convenience, power, and extensibility of a widely used open source content management system and social networking tool, Drupal . Drupal is built on the PHP web scripting language, and its persistence store is a MySQL database. Drupal has a large developer community, allows ready customization and is highly scalable.
Several basic modules, such as the user login system, a caching module for fast access of pages, and SOLR  based search are pre-packaged with Drupal. It also has a large number of contributed modules that are easily integrated, thus speeding up the development process. Browsing, searching, and filtering capabilities are provided as part of the general Drupal framework.
Drupal also allows granular permissions and security based on user roles and group memberships. We used the granular permissions capability to implement flexible data publication. Users of our system can choose to publish just the experiment metadata and keep the raw or processed data hidden. The experiment metadata allows users to be aware of an experiment that has been performed by another user of the repository, and can thus foster collaboration while still protecting pre-publication data. The raw or processed data can be made public at a later stage when it has been accepted for publication.
We developed several Drupal custom content types to fully describe the experiment metadata. The basic unit of content in Drupal is called a node; nodes are classified by type, and custom modules define new types. The experiment metadata was designed to support multiple types of biomedical experiments and comprises of three primary content types i) Experiment which contains one or more ii) Bioassays that are linked to iii) Biomaterials.
The attributes of an Experiment are title, researcher and the study design details. The Experiment can be linked to publication(s). The Bioassay content type describes the assays performed and has these attributes - type of measurement, technology, platform and the raw output data file produced by the assay. These attributes guide the processing and analysis scripts as well as assist the users to locate their data of interest. The measurement types can be easily extended as new requirements develop. The framework has been designed from the ground up to incorporate new measurement types (such as DNA methylation measurement) or new technologies (such as high throughput qPCR). We capture the technology (such as microarray) as well as the particular platform (such as Affymetrix HG-U133) used in the Bioassay; this enables us to process the raw data in a standard pipeline specific to the type of assay. Bioassays from the same Experiment can be grouped into specific sample and control groups for comparison. We have also developed an intuitive user interface to group Bioassays into the sample and control groups
Bioassays are linked to the Biomaterial content type where sample properties are captured in detail. The default configuration allows the specification of the organism, development stage, tissue and cell types of samples using controlled vocabulary terms. The user can enter the data using either drop down forms or type-ahead fields. Genetic modifications, treatment and disease state of the Biomaterial are also captured as structured annotation where applicable.
We use the Drupal taxonomy system for the controlled vocabulary terms, which are then mapped to various ontologies or taxonomies (Ontologies used and Linked Data generated from experiments will be discussed in a separate paper). The structured annotation of experiments allows enhanced searches - for example a user can find all the data from a particular cell type where histone modification has been measured. Our framework, eXframe, enables a site administrator to customize the set of fields available to the user for annotation. Thus eXframe can be deployed and configured to support new contexts, such as that of clinical data, and important patient characteristics can be acquired in a structured manner.
All the experiment metadata described above can be easily entered into the database using a user-friendly web form (see project website for details). The structured experimental metadata is subsequently processed and made available in several standard formats. This eliminates the need for crafting complex formats by a biologist or curator to generate structured annotation.
To enable query by genomic entities and integrate the data, we designed a set of tables that represent the data associated with genomic features such as genes, transcripts and loci as well as their relationships with each other.
The data produced in an experiment is primarily stored in two types of tables. Data from a microarray experiment is stored in a data table (rtype_data_matrix) and is associated with a bioassay and a probe. This generic data table can also be used for other technologies which have a data point associated with a probe such as qPCR. Sequencing data, on the other hand, is associated with an arbitrary genomic region with a defined start and end and is stored in the rtype_locus table. Computed values such as fold change are stored in the rtype_fc_matrix table. The complete genomics database schema is available as Additional File 1.
Genomics data often needs to be described using heterogeneous entities. We designed the database in a flexible manner to accommodate genomics data that is associated with a gene, transcript, probe or genomic interval. For example, each microarray represented in the database has multiple probesets, each probeset may by associated with a transcript, each transcript is associated with a gene. Affymetrix probesets, gene transcripts, etc. are linked to the gene which allows the users to query based on gene symbol and pull the relevant data from different assays. For sequencing assays, resulting values are linked to genomic features, e.g., peaks from a ChIP-Seq assay are linked to the nearest transcript.
Genes may have multiple symbols and orthologs. Orthologous genes are grouped using information from the NCBI HomoloGene database  and the homolog id was applied as the group identifier (see Additional File 1). Storing the ortholog information and gene aliases allows the user to query by any gene symbol and retrieve results across species.
For microarray data, the user uploads CEL files; the data is background-corrected, normalized and summarized using the GCRMA algorithm . Expression fold change between the case and control groups is computed for each probeset and stored in the database along with associated statistics including p-value, false discovery rate, t-statistics, lower & upper confidence intervals, standard deviation (SD) and case and control means. This information enables users without any programming experience to query for a gene fold change across all experiments from various species, disease states and cell types using an easy to use interface. The query results can be filtered by various attributes of the experiment.
Next generation sequencing technologies can be used for measuring RNA expression (RNA-Seq), transcription factor or any protein binding to DNA (ChIP-Seq), histone modification (ChIP-Seq), DNA methylation (RBBS), or protein binding to RNA (RIP-Seq). Users upload FASTQ files for all next generation sequencing assays and the data is consistently processed through the pipeline. The common first step for all next generation sequencing assays is to align the reads to the genome. Subsequent processing and analysis is done depending on the assay/measurement type.
To quantify the histone modification for a gene locus, first reads are aligned using the bowtie program  and then the fragments per kilobase per million fragments mapped (FPKM) abundance measure is calculated for the region of interest. For example, the window used was 1Kb upstream to 1Kb downstream of the transcription start site for H3K4me3 and H3K27me3. For transcription factor binding assays, peak identification is done using the MACS program  and then peaks are assigned to the gene in whose promoter region it is located. The peak score for each gene is stored in the rtype_locus table in the database. For RNA-Seq data, reads are aligned using tophat  to identify splice junctions and further processed using cufflinks . The FPKM abundance measure for each transcript is stored in the rtype_locus table. The intermediate files - BAM from bowtie/tophat, FPKM from cufflinks and BED/WIGGLE from MACS are also stored for use with other genome browsing tools.
The advantage of assigning all measurements to a gene is that it allows us to compare features (such as DNA methylation, expression, transcription factor binding in promoter region) across experiments using query and visualization tools described in the next section. Further documentation for the pipeline can be found at our project website. Tools for analyzing DNA-methylation and RIP-Seq assays, as well as for SOLiD sequencing platforms are under development and will be available shortly.
We provide various analysis and visualization tools to probe the genomic data and present an integrated platform for genomic discovery. We provide two different forms of visualizations. First, we allow users to query a list of genes and visualize the result as a heatmap illustrating gene expression across all samples (Bioassays) in the Experiment. The second type is a scatter plot of the data - we integrated the iCanPlot tool  into eXframe for this purpose. Users can choose the x-axis, y-axis, color and size of the points in the scatter plot from any of the available experiments. Using the scatter plot tool, users can do integrative analysis such as investigating the relationship between histone modifications or transcription factor binding and gene expression.
The experiment information and genomic data can be downloaded in various formats, including the original raw data file, NCBI GEO SOFT , ISA-Tab  for the experiment metadata and GCT files for microarray expression data. In future we will also allow download of the processed files, such as the aligned reads (BAM) or peaks (BED/WIGGLE) through the web interface. If researchers enter their data and annotation on the website, they can easily submit the experiment to GEO  using the SOFT format, thus providing an incentive for data entry. We also allow import from SOFT files and thus allow users to upload publicly available data from GEO into the database.
We illustrate the features and benefits of eXframe using two different use cases and present various queries and visualization examples.
The repository contains both data generated at HSCI as well as public data of interest to the community. We downloaded data from the NCBI GEO repository  and imported it into the repository using the SOFT format. The data in the repository can be downloaded as various formats including ISA-Tab and SOFT. The format of the resulting ISA-Tab files was independently validated by the ISA-Tab Validator. The repository also makes the data available as a SPARQL endpoint, which will be described in a separate paper.
Researchers are often interested in a family of genes and hence multiple gene queries are also supported. Users can paste a list of genes in a text area and the results are visualized as a heatmap. HOX gene family expression in MEP (Megakaryocyte-Erythroid progenitor cell), GMP (Granulocyte-Macrophage Progenitor), CMP (Common Myeloid Progenitor), L-GMP (GMP-like leukemic cells) and HSC (Hematopoietic Stem Cells) cells is illustrated in Figure 6B. The expression values are quantile normalized for the heatmap visualization.
To illustrate the processing of next generation sequencing assays, we chose a publicly available RNA-Seq dataset from the NCBI GEO database (GSE30995). In this study, Gabut et al investigated the transcriptional effect of alternative splice forms of the FOXP1 transcription factor on the H9 embryonic stem cell-line . To study the transcriptional differences of 2 mutually exclusive splice forms of the FOXP1 gene, they used custom siRNA pools to knock down (KD) exon 18 and 18b of the FOXP1 gene. Control siRNAs were also used and all 3 samples were profiled using RNA-Seq on the Illumina Genome Analyzer platform.
We have developed a Drupal-based, reusable, open-source framework - eXframe - that has allowed us to deploy the same software distribution for two widely different use cases and communities. One of them contains transcriptional profiles, histone modifications and transcription factor binding experiments on hematopoietic cells and another on primary tissue derived from Parkinson's disease patients. For both cases, eXframe was used to provide (a) institutional memory of experimental results, (b) cross-dataset comparison, (c) expedited and simplified integration with public databases, and (d) metadata-enabled cross-experiment and cross-laboratory dataset discovery. In the future, other scientific communities or research institutions are encouraged to configure and deploy this highly useful, reusable toolkit for their custom use.
The consistent processing and storage of the experiments enable users to integrate data across labs, species, technologies and measurement types. All data is mapped back to the relevant region of the genome, transcript or gene and thus allows researchers, for example, to investigate the effect of histone modification on the transcription of the gene. It allows cross species or experimental model comparisons. In future, we would like to research document-oriented databases such as MongoDB or implement caching mechanisms to allow scaling for larger data sets.
Structured annotation and use of controlled vocabularies to describe the biological samples, assays and experiment promotes reuse of data. Such an approach allows us to leverage the Semantic Web technologies. Semantic Web produces machine-readable content that allows data reuse and integration with other knowledge resources - eXframe provides the ability to generate Linked Data and SPARQL endpoints for the experimental metadata. The easy to adopt system lowers the barrier of entry and provides the benefits of the Semantic Web, while effectively hiding the complexities of the technology. These features will be fully described in a forthcoming paper.
Open-access, standardized annotation allowing interoperability and analysis ready data repositories are required for integrative genomics . We believe that use of our framework will encourage data sharing, integration and meta-analysis of genomics data, which will ultimately lead to the understanding of complex biological processes and pathogenesis of diseases. This toolkit supports, we believe, a broader and more comprehensive feature set than any other genomics experiment repository code available for general re-use under open source license. We encourage both use and collaborative extension of eXframe by other researchers and informaticians.
Project Name: eXframe
Project Home page: http://sciencecollaboration.org/exframe
Operating System: Platform independent
Programming Language: PHP & R
Other requirements: LAMP stack
Availability: freely available under a GNU 2.0 license without any restrictions for commercial use The web application is supported on the following browsers - Firefox 4, Safari 5, Chrome 10, IE 9 or higher.
We would like to acknowledge the Harvard Stem Cell Institute (HSCI) for funding and support. We thank Dr. Daniel Tenen, Dr. David Scadden, Dr. Len Zon, Dr. Stuart Orkin and Dr. Clemens Scherzer for helpful discussions and for providing requirements. Lastly, we thank Siavash Safarizadeh of We Web Workers (http://www.wewebworkers.com/) for all his contributions to Drupal programming.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.