The data model
The data model underlying the Genopolis Database maps a set of concepts in the experiment annotation to objects that are grouped according to a tree structure (Figure 1).
This arrangement is adequate for most experiment designs and single channels arrays. Its regular structure allows functions on the database content, such as consistency control, analysis and search to be implemented as simple functions on nodes that can be called in a tree traversal.
The objects implementing the experiment description are:
Submitter: the scientific responsible of an experiment.
Experiment: generic information about an experiment. Experiments are associated to Submitters.
Source: the biological source (organism, tissue, cell) under study. An Experiment can have one or more Sources.
Sample: a specific state of a source that is characterized by a time and a set of stimuli affecting this source at this time.
Stimulus: information regarding a stimulus applied to a source in an experiment. This includes the time of application of the stimulus and its duration. When the same stimulus affects more than one sample within an experiment, this object is repeated for each sample. This minor flaw was chosen in order to maintain the objects organized as a tree.
Hybridization: all information regarding the hybridization of a sample. This includes information on the array used (only the microarray GeneChip® technology is supported) and the methods to extract and label the mRNA. At least one hybridization must be associated to a sample.
Measurement: a set of gene expression values derived from an hybridization. This includes information on the reading (scanning) of the microarray as well as the image analysis and normalization procedures used.
Other objects that are not organized as elements of a tree are used to define Protocols and Arrays.
Each element is characterized by several classes of attributes. Some attributes are simple named text or integer values, such as an animal identifier or an age value for a source. Some are relative to values that are defined in controlled vocabularies, such as the name of a cell line or of a tissue. Information on protocols and arrays used is defined in external objects that are referenced within the description elements. Finally each object accepts an informal natural language description to handle not explicitly supported information.
The Genopolis database object model is intended to describe experiments in terms of their building blocks. It then analyse the structure of its content to derive properties. For instance by default different hybridizations relative to the same sample are considered (and presented) as technical replicates, while distinct samples with the same stimuli and attributes (ex. time) are considered biological replicates.
Architecture
The Genopolis database is realized as a relational database managed by a web based application. The object model the database is based on is implemented by a set of software objects (business objects) that abstract the underlying relational tables. Hence, the resulting system is a n-tier architecture. The current version of the Genopolis Database makes use of MySQL 4.1, but access to the SQL layer is standard and wrapped by the business objects, so that it would be easy to port it on different systems. The core of the system is a web based application written in PHP4 and currently deployed on Apache and Linux based web servers.
In order to support the experiment annotation described later, two distinct relational databases are used. One database stores incomplete experiment descriptions while these are being assembled. Another database contains data and descriptions of complete experiments and is available to the user for queries. This distinction was made to improve reliability (provides a clean separation of data, even regarding unauthorized access and possible code flaws) and enhances performance, since read only instances of the database used for queries can be easily distributed on different machines, for instance on the nodes of a cluster.
The objects described above are organized in a tree structure and support recursive propagation of operations over the tree. One example of such operation is the checking of the consistency of the experiment description. This is implemented through an abstract check() method that is implemented for each object. These objects also support rendering of information as HTML code for web forms (used for data submission) and for read only web pages. To implement this, each object representing an entity in the experiment description contains a list of objects corresponding to description items and implementing description types as strings, numbers, controlled vocabularies, free text, files. These objects are part of a distinct library called daolib (Data Access Objects), that allows the specification of their behaviour (i.e. Accepted values) and appearance (i.e. HTML rendering).
This Software Engineering based approach eases the maintainability and upgrading of the system. The system maintains CEL files, image files and other attachments in a proper directory, and makes them available for download to authorized users. Measurement files are kept as files while assembling the experiment description, then parsed and stored in a single indexed MySQL table to support queries related to expression values.
Finally, other maintenance functionalities are implemented outside a client-server paradigm. These include import of GeneChip® descriptions from Affymetrix MAGE-ML files (implemented in Java), transfer of data between the two databases, export of its content to ArrayExpress.
Access control
The Genopolis database supports a flexible access schema to its content where users can be distinguished by group memberships and roles (Figure 2). For instance, a data set may be declared accessible to the members of a given research group, and only accessible with limited rights (ex.: read only rights) to others. In its current implementation the granularity of the access specification is the experiment: all annotation and data relative to elements that are part of the same experiment tree can be assigned as a whole to groups and users' access rights depend on their role within the group (administrator, protocol editor...). This serves also as a support for a distributed annotation process: within a group, some users can be designated as responsible of the definition of protocols, controlled vocabularies, array annotations, while other users may be responsible for the experiment annotation.
The access system is based on a custom designed object oriented API. This is based on three PHP classes: GroupSecurityMgr (manages user groups), UserSecurityMgr (manages users and their association to groups, permissions associated to roles are defined here), ObjectSecurityMgr (manages experiments membership to the user groups). API abstraction and customization classes (SecurityMgr, LoginManager) provide an easy to use access point to the programmer.
MAGE-ML and ArrayExpress export
The Genopolis database can export its content in MAGE-ML. This feature has been implemented in order to provide an automated export to the ArrayExpess public repository. The implementation of this functionality is based on Tab2MAGE. This tool, developed by the EBI, accepts the description of a single experiment in a simple tabular format and translates it into the equivalent MAGE-ML file. Producing the structure of this kind of tabular files has been straightforward, since our experiment model is similar the model represented in them. The support for controlled vocabularies has made possible their mapping to terms of ontologies accepted by ArrayExpress, such as the MGED Ontology. Integration of these ontologies within our system is undergoing.
Deployment
The Genopolis database is currently deployed on a cluster architecture. This is based on the Debian Linux distribution completed with the Web server load balancing software "Linux Virtual Server" and the high availability tool "Heart Beat".
Web users requests are transparently distributed to available service nodes. This distributes the web server load and ensures availability of the system even in case of nodes failure. Each node has a local copy of the database holding complete experiment description and data (these copies are read-only and updated when a new complete experiment description is added). This assures distribution of loads to different SQL engines and an optimization of data access.