Adaptable data management for systems biology investigations
© Boyle et al; licensee BioMed Central Ltd. 2009
Received: 31 October 2008
Accepted: 06 March 2009
Published: 06 March 2009
Within research each experiment is different, the focus changes and the data is generated from a continually evolving barrage of technologies. There is a continual introduction of new techniques whose usage ranges from in-house protocols through to high-throughput instrumentation. To support these requirements data management systems are needed that can be rapidly built and readily adapted for new usage.
The adaptable data management system discussed is designed to support the seamless mining and analysis of biological experiment data that is commonly used in systems biology (e.g. ChIP-chip, gene expression, proteomics, imaging, flow cytometry). We use different content graphs to represent different views upon the data. These views are designed for different roles: equipment specific views are used to gather instrumentation information; data processing oriented views are provided to enable the rapid development of analysis applications; and research project specific views are used to organize information for individual research experiments. This management system allows for both the rapid introduction of new types of information and the evolution of the knowledge it represents.
Data management is an important aspect of any research enterprise. It is the foundation on which most applications are built, and must be easily extended to serve new functionality for new scientific areas. We have found that adopting a three-tier architecture for data management, built around distributed standardized content repositories, allows us to rapidly develop new applications to support a diverse user community.
To enable the adaptive behaviour that is required when developing software for research an "informal" data management strategy is often needed. By informal we mean there is a need to rapidly develop and adapt software infrastructures to unforeseen and (typically) unspecified requirements. We have found that the use of a distributed data management system (consisting of remote interlinked content repositories) gives us the required flexibility, while still allowing for the development of the level of formalization that is required for robust software development.
Advances in computer science have pushed what can be achieved with data management systems, and conversely these advancements have driven the increase in demands for richer functionality. The computer science research advancements have involved both hardware and software, with faster processor speeds enabling other innovations to become feasible. The way in which data management systems are built, and extended, has also changed. These changes in software engineering and design include: the methodology through which software is constructed (e.g. components leading to frameworks, and frameworks leading to aspects ); the technology used to allow for distributed computing (e.g. object brokers evolving pass-by-value mechanisms, and these being replaced by stateless Web Services); and the ideology that is used to define the process through which software is built (e.g. the "rational" processes being replaced by agile programming). These advances are continuing to occur, and will have an effect on the next generation of data management and distribution tools (e.g. cloud computing becoming mainstream through the use of Google App Engine or similar).
One common characteristic of these "technology first" efforts is that they developed solutions that were designed to work with static "finished" data, not research information. This style of system works well within publishing scenarios, where information is to be made available throughout an enterprise as a static resource. When actually working within a research institution (or life science company), where new technologies and ideas are continually being developed, such static publishing approaches are not appropriate. Instead, a flexible analysis and access system is required that allows for the rapid introduction and integration of many types of data. With the advent of systems biology, the recognized need for integration solutions has reached a high level of urgency.
This article is principally focused on the design of software to support systems biology investigations, rather than a discussion of a specific application. The software that has been developed, and made available to the community, is discussed to illustrate the advantages of the advocated designs. The importance of design can be over looked in scientific computing , due to the complexities with the development of software for research which arise due to the constantly changing requirements and individual project-centric approaches. In this article we discuss how we designed a flexible and maintainable data management solution which can be readily adapted to meet the requirements of a complex research environment. The reasoning behind this non-traditional design is that a research software infrastructure must be highly adaptable to change. The required change could arise from new technologies being introduced, a change of research focus, or a change in resourcing (project funding). How we integrate the data management system with other tools and data sources has been discussed previously .
There are two major requirements on software design within a rapidly changing research environment: the ability to develop and integrate software rapidly, and the assurance that the resulting systems are adaptable to new and unpredictable needs. The methodologies and technologies used within research environments must be able to satisfy these requirements. Methodologies that depend on formalized specifications for processes and data structures are not appropriate for research software infrastructure. An understanding of current informatics technologies, and their suitability, is essential to ensure that research driven software projects can be delivered on schedule. Over the last few years there have been many advances in enterprise computing which means that data management technologies can now effectively be used in rapidly evolving research environments. These advances mean that, through an appropriate choice of technologies, it is now possible to deliver, within a minimal timeframe, a distributed system which is robust, standardized, loosely coupled and interoperable. Enterprise components can now be rapidly configured to deliver a range of rich functionality (e.g. content management, messaging, state based processing, dynamic discovery, high level orchestration). However, inappropriate technology choices can result in an architecture which is not adaptable and difficult to maintain.
Technologies that require a static structure (e.g. schema definitions) are constrained in their usage as they require "top-down" design. This means that the usage of traditional application server based technologies may not be appropriate for many aspects of research, as they are largely designed for tasks where: there exists reasonably stable information which can be structured in a DBMS or similar; business logic is required to operate closely on the data; or integration logic or data transformation is required. Alternatively, less structured data management, such as content repositories, can be suitable within research environments. Content repositories serve a different purpose to application servers, but can be used to serve out similar types of information using the same protocols. Content repository technologies can be more appropriate (than an application server) as they provide a high level of flexibility when dealing with unstructured content (e.g. experiment information).
We are not suggesting that more formalized representations of data do not have their place in research environments. Many data sources can be represented as static snapshots, and when it comes to presenting the research results, the use of more standard software development practices and tools is essential. These static approaches become unsuitable in situations where software is being developed to directly support on-going research projects, with unpredictable life cycles and unforeseen requirements.
Layered content management
The content can be organized into a (typically) hierarchical structure. The nodes in the hierarchy can have an arbitrary complexity (depending upon the requirements) and also have extensible property sets associated with them. The content system also provides for the usual array of horizontal services (e.g. data querying, browsing, management, transactions, concurrency control, security).
We extended the Java Content Repository (JCR) standard (using the Apache Jackrabbit implementation ) to provide a distributed data management system that consists of a series of interlinked content management systems (CMS). Such a federated content repository solution offers three major advantages over a standard database/application server solution:
Easy to adapt. Within a research environment it is generally difficult to hammer down requirements. The requirements change over time, and new functionality is often required at short notice. Content repository systems have a high degree of flexibility, as the content graph can be extended to meet new requirements, and the annotations can be dynamically updated. This means that when new requirements become apparent, the structure can be changed or modified easily, without having to rewrite a schema or change an object layer implementation. It is interesting to note that other disciplines, in particular the design community, have found that such flexibility is invaluable in enabling creativity .
Easy to understand. Any data management solution will involve a high level of complexity, especially in a distributed research environment. This complexity, however, does not mean that the principles of how the system operates cannot be easily understood. Allowing the end-users to build up a good mental model of how the system works removes many obstacles with adoption. By portraying the content management as "an intelligent file system" positive transfer of knowledge can occur, so that the system is natural and intuitive to use.
Easy to access. Within a research environment there is little time to be spared for learning (largely transient) informatics systems. This fact coupled with the low level of formal software training of most researchers (whether computationally inclined or not), means that convenient access to the data is of paramount importance. Content management systems in general, and the system discussed in this paper in particular have little complexity in terms of object models and data access protocols, as the storage structure is simply a hierarchy and a large number of access protocols can be supported (e.g. direct file I/O, RMI, WEBDAV, REST).
Unfortunately, a single content repository system does not offer the level of flexibility that is required. Research groups interact with different facets of research information in diverse ways and with a plethora of goals. This diversity means that the data must be organized in different ways, for example: when capturing the data it may be appropriate to organize information by group and date; when processing the information it can be organized by runs and tool information; and when exploring the information it can be organized by biological significance (e.g. by gene, by strain, by condition).
The main aim of the distributed content repository system is to enable the researchers, and associated research tools, to easily be able to store, manipulate and retrieve the information they want in the form they desire. One repository can span multiple installations, and multiple repositories can reside within one instance. The system we use is designed around a three-layer architecture:
Instrumentation Layer. This layer is used to capture experimental information. The layer models information in a way that makes storage of the information simpler, so that laboratory scientists can easily add and annotate information that is captured from a variety of instruments. Typically each type of experiment has a distinct (and differently structured) repository.
Conceptual Layer. This layer is designed to provide a means to generically interact with the information through the use of high level abstract operations. These operations include the aggregation and retrieval of information, and do not necessitate an understanding of the actual information content. Information which is required for manipulating the data is provided as metadata through properties attached to the data files.
Organizational Layer. This layer provides a project (or researcher) based view on the information, and therefore is designed to have a "biological focus". Typically the content is organized by factors such as disease, organism or molecule. Each different research, or research group, can individually organize and annotate the data to suit their individual requirements.
The instrumentation layer is designed to be the interface point with high throughput equipment and laboratory instrumentation. Data is captured by laboratory scientists via equipment and is stored in a repository that is structured for their convenience. In this way the repository is closely aligned to the experiment and experimenter.
The repositories are mainly used to automatically capture information from high throughput experiments. We are currently using instrument repositories to capture information for high throughput imaging, proteomic and genomic experiments. The instrumentation repositories generally have relatively simple structures as their purpose is to facilitate data collection. For example, when capturing information for genomics, the data is organized in the repository in a fairly flat hierarchy containing a simple name and time stamp mechanism.
We specify a limited set of properties (based upon an ontology with an associated controlled vocabulary) which are used to describe the metadata associated with the experiment. This controlled metadata is used in conjunction with any ad-hoc or free text annotations which may be added at a later time by the experimenter. Where possible the metadata is gathered directly from the instrumentation (e.g. by parsing data files generated by microscopy control equipment such as IPLAB) otherwise, we interface with bespoke sample tracking systems (e.g. for genomics we bridge with the sample tracking SLIMArray  software). The ontologies used are based upon community standards (e.g. OME , MAGE ).
URI based retrieval – all referenceable items within the repository can be retrieved using a HTTP URI. This interface was built both to provide convenient access to the data and also to allow for the construction of semantic web based applications on top of a JCR instance.
Standardized access components for retrieving, searching and publishing information to a JCR instance. We have used these components in a variety of applications, including: within a servlet based web application; within a standalone application(s); and as part of a GenePattern  tool set.
The conceptual layer is designed to aid in tasks that are fundamental in many different research investigations. The layer provides a high level of abstraction for tasks associated with data processing (the "analysis system") and data manipulation (the "aggregation system"). These systems provide common (horizontal) functionality, and are designed to be used by analysis and data processing subsystems.
The genomics normalization pipeline (Figure 6) uses the analysis system directly. As the analysis system uses a standard pattern of nodes (input, output and status hierarchies), any analysis pipeline can make use of it. The genomics normalization pipeline simply retrieves the required input parameters, executes the job (and updates the status), and then writes the results. By using a pattern of nodes, applications can be loosely coupled so that they can be developed independently of each other with few dependencies e.g. a job submission application can write to the input nodes (Figure 6), and a separate processing application can receive notifications that are triggered through changes to the status nodes (Figure 7). The loose coupling provides a high level of reliability and allows for rapid application development. The content system also provides for a means to ensure that we are able to achieve reliable auditing of each analysis run. The basic functionality for reading/writing information is provided through standardized modules.
One of the aims of the computing advances over the last few years has involved the concept of run time aggregation. This approach is epitomized by the semantic web, where people can mash information from a variety of data sources into a single graph. We provide a run time aggregation system specifically for the aggregation of data from different analyses.
The Web Portal (Figure 8) uses the standardized service to retrieve information from a number of repositories. The service can query multiple repositories, so that the portal makes a request to retrieve URIs for specific types of information that match a specific gene id. This information is then displayed in each of the portlets. The aggregation system is used to allow for each part of data files (e.g. a specific row in a matrix that corresponds to a particular gene/probe) to be uniquely identified by URIs.
Repository level operations that are provided through http encoding.
Providing a repository PATH parameter (e.g, /analysis/runs) returns a generic representation of the underlying resources.
Create new resources by appending '/new' and providing parameters for a repository PATH and a resource NAME. Files can be uploaded as part of a multi-part HttpRequest.
Search by query terms by appending '/search' and providing parameters for a repository PATH. Query terms are matched against underlying resource properties (e.g. Organism = Mouse).
HTTP URI based operations are provided through a REST API for operating on individual objects.
Retrieve JSON representation of a resource, and an array of its child resources
Retrieve a file system representation of the resource with directories and files.
Retrieve the domain model representation of the resource.
Retrieve underlying resources matching provided query parameters.
Updates domain model representations, properties, files or contents of a file.
Permanently removes the resource, its model, properties and files
Discussion and conclusion
The adaptive data management system we have discussed has been designed to meet the competing requirements for data management that arise within a research organization. As with many aspects of research informatics, flexibility and rapid development are key issues as requirements change unexpectedly and frequently. We advocate the view that, to be of lasting use to research, any data management system must be able to:
Support the evolution of ideas
The development of integration strategies for systems biology is problematic due to both the nature of science and the organization of scientists. It is typical that the means to which a specific experimental technology will be used, and the methods used to analyze the resulting data, is unknown at software design time. This is because scientific understanding continually evolves. Any scientific data management system has to support such evolution, meaning that traditional approaches to development and design are frequently inappropriate. As neither the data usage nor analysis is known a priori an easily adaptable solution is needed.
Allows for flexible working
There is a fundamental requirement for scientists to easily access, query and manipulate data to suit their needs. Science led investigations require an infrastructure to support ongoing data driven discovery processes. This means that: the data should be accessible through a variety of mechanisms, including multiple computer languages and applications; the data should be detached from the system, so that scientists are not "tied down" to any specific object model or way of working; and flexible navigation through the data using a variety of approaches should be supported.
Delivers the salient information
Many aspects of biomedical research can be thought of as an information centered science . Information must be established in the context in which it has been requested, that is to say, the view on the information changes depending upon the question that was asked. Therefore, the structure of the data management system must change depending upon the requirements. In a research environment, the views on the data typically involve: high throughput instrumentation, where information must be pushed from machines quickly and reliably; computational informatics processing applications, which require auditing and annotations; and research projects, which require domain oriented and flexible views upon the data. This requirement for adaptable data management is encountered in many avenues of research across all of the biomedical sciences.
Rather than build a de novo system, we have extended standardized open source systems. By extending community standards we are able to deliver production systems quickly and are able to overcome boundaries due to resource limitations. We have found that the standardized systems we have adopted are generally not directly suitable for research informatics, and therefore had to be extended to allow for the required flexibility. These extensions are designed to: ensure that the system integrates well within a research enterprise; allow for customization of the system to ensure it has a richer semantics; and provide workflows to ensure the system can work robustly with high throughput instrumentation.
We have found that the three-tier distributed data management system provides the flexibility and adaptability that we required. As we adopted and extended standards, the system was put together with a minimal of FTE effort, while still providing a robust and scalable architecture. Most importantly, as science moves along at a rapid pace, and opinions are always multi-faceted and divided, this system allows us to evolve the structure of the data management to meet new requirements without having to continually rewrite or wrap out of date or inappropriate legacy code.
Availability and requirements
We have made a number of tools available to the community. These tools are built upon the Web Services we have developed that allows for the interoperability and distribution of content across a number of JCR instances. For these tools to function one or more JCR compatible instances must first be installed, and then the REST services must be deployed within tomcat. Further instructions are given below, and more details (including the relevant template files) are given on the main download site.
Project home page: http://www.systemsbiology.org/informatics;
Operating system(s): Platform independent;
Programming language: Java; Other requirements: Java 1.5 or higher, Tomcat 6.0 or higher, working JCR instance 1.4.0 or higher.
Any restrictions to use by non-academics: no restrictions.
Ensure you have Java 1.5 installed and a working Tomcat (6.×) instance
Download the main services file (the addama distribution) from the downloads site.
Install a JCR instance to create a repository (Jackrabbit from Apache is the recommendation). To install Jackrabbit: download the jcr-instance-jar-with-dependencies.jar into an execution working directory (JCR_HOME); edit the jcr-instance.properties file and copy the file into the JCR_HOME.
Start the repository by executing the jar using "java – jar jcr-instance-jar-with-dependencies.jar jcr-instance.xml".
Copy the addama-rest.war file into your tomcat/webapps directory and then fill in the addama-rest.properties file and copy it to your tomcat/libs directory.
Copy the addama-html.war file into your tomcat/webapps directory, make sure the addama-rest war file started up properly first.
In your browser go to http://[host name]/addama-html
The following tools are available from the main download site.
REST/URI Extensions. This is the main set of services that must be installed before any of the additional tools. The services allow for the retrieval of content using a URI.
Access Components: Components are provided to support the development of applications that require JCR retrieving, searching and publishing functionality. This are provided as a separate download and code is provided for R, Matlab, Perl, Python, Java and Ruby.
Data Feeder. A utility for loading a JCR from a sample tracking system. This utility demonstrates how annotations and data items can be mapped into a repository; and how simple monitoring can be used to mirror the contents of an existing sample tracking system in the JCR. This is built against the freely available SlimArray sample tracking software .
File Loader. A utility for loading file data directly into a JCR instance using XPATH information to define the position. This runs as an executable Java jar.
The project described was supported by grant number P50GMO76547 from the National Institute of General Medical Sciences, grant number 1R011CA1374422 from the National Cancer Institute and as part of contract HHSN272200700038C from the National Institute for Allergy and Infectious Diseases. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIMGS, NCI, NIAID or the NIH.
- Kiczales GLJ, Mendhekar A, Maeda C, Videira Lopes C, Loingtier J, Irwin J: Aspect-Oriented Programming. In European Conference on Object-Oriented Programming. Finland: Springer-Verlag; 1997.Google Scholar
- Etzold TUA, Argos P: SRS: information retrieval system for molecular biology data banks. Methods Enzymol 1996, 266: 114–128.View ArticlePubMedGoogle Scholar
- CORBA Services[http://www.omg.org/technology/documents/corbaservices_spec_catalog.htm]
- Haas LSP, Kodali P, Kotlar E, Rice J, Swope W: DiscoveryLink: A system for integrated access to life sciences data sources. IBM systems Journal 2001, 40(2):489–511.View ArticleGoogle Scholar
- Li MCX, Li X, Ma B, Vitányi P: The similarity metric. IEEE Transactions on Information Theory 2004, 50(12):3250–3264.View ArticleGoogle Scholar
- Saltz J, Oster S, Hastings S, Langella S, Kurc T, Sanchez W, Kher M, Manisundaram A, Shanbhag K, Covitz P: caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics 2006., 22(15):Google Scholar
- Kelly D: A Software Chasm: Software Engineering and Scientific Computing. IEEE Software 2007, 24(6):119–120.View ArticleGoogle Scholar
- Boyle J, Cavnor C, Killcoyne S, Shmulevich I: Systems Biology Driven Software Design for the Research Enterprise. BMC Bioinformatics 2008., 9(295):Google Scholar
- Plaisant CGJ, Bederson B: SpaceTree: Supporting Exploration in Large Node Link Tree, Design Evolution and Empirical Evaluation. Information Visualization 2002.Google Scholar
- Fischer G: Domain-Oriented Design Environments: Supporting Individual and Social Creativity. Proc CMCD IV 1999, 83–111.Google Scholar
- Bachman C: Summary of current work ANSI/X3/SPARC/study group: database systems. ACM SIGMOD 1974, 6(3):16–39.View ArticleGoogle Scholar
- Marzolf B, Troisch P: SLIMarray: Lightweight software for microarray facility management. Source Code Biol Med 2006, 1: 5.PubMed CentralView ArticlePubMedGoogle Scholar
- Goldberg I, Allan C, Burel J, Creager D, Falconi A, Hochheiser H, Johnston J, Mellen J, Sorger P, Swedlow J: The Open Microscopy Environment (OME) Data Model and XML File: Open Tools for Informatics and Quantitative Analysis in Biological Imaging. Genome Biol 2005, 6(5):R47.PubMed CentralView ArticlePubMedGoogle Scholar
- Brazma Aea: Minimum information about a microarray experiment (MIAME)–toward standards for microarray data. Nature Genetics 2001, 365–371.Google Scholar
- Reich MLT, Gould J, Lerner J, Tamayo P, Mesirov J: GenePattern 2.0. Nat Genet 2006, 38: 500–501.View ArticlePubMedGoogle Scholar
- Hood L, Galas D: The Digital Code of DNA. Nature 2003, 421: 444–448.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.