The main tasks of a centralized organisation of biological data are to avoid errors in merging genotypes and phenotypes, to guarantee information privacy and to handle a huge amount of data in straightforward fashion. Due to the fact that an intervention with an existing workflow process is critical, the designed application has to be intuitively useable and accepted by the end users. The laboratory receives the delivered minimal requirements for further processing, which is illustrated in Figure 4. Minimal requirements are unambiguous identifiers of subjects, stored in a simple text file, which guarantee a correct mapping to the corresponding DNA material. The ideal case is that only the internal ID (generated by eCOMPAGT) has to be released as a minimal requirement, but experience shows that usually additional information (e.g. the gender of the subjects) is required to guarantee a correct identification of samples for genotyping or control purposes. Thus, the option for e.g. the lab head to alter the number and types of the minimal requirements is essential. It is worth noting that a stringent quality control has to be performed in the laboratory (testing for Hardy-Weinberg-Equilibrium, 10% quality controls, no template controls, ...) so that only final and reviewed genotypes are stored in eCOMPAGT.
Graphical User Interface (GUI)
An intuitive user interface was designed with Java Swing [10], extended by Java SwingX components [11] and improved with Substance [12].
The GUI is separated into two parts: the first part encompasses elements of a typical CRM system. Customers, team members and projects are addable, editable and erasable by dedicated menu items (Figure 5).
Figure 5 shows the second important part of eCOMPAGT, where the import and export of genotypes and phenotypes is performed ("Evaluation"). During the import of phenotypes and subject information, an internal identifier ("minimal requirement") has to be assigned to each subject. This can be achieved automatically or if a unique identifier is already available by highlighting the according column in the GUI. This informs the application to use the highlighted column as a unique identifier that makes a later combination possible and ensures subject privacy.
Implementation
The amount of data, which is derived from different genotyping methods, is steadily increasing and high-performance queries based upon an efficient data model are required. The entity-relationship modelling was the first step to produce a conceptual data model and we observed that the normalization model fits best to our requirements.
Figure 1 shows our final database structure, in form of an Entity-relationship model (Chen notation).
eCOMPAGT was originally designed for IBM DB2 Version 9. However, it can also be used with Oracle Database 10 g. With the assistance of Hibernate [13], an object-relational mapping tool (ORM), the storage of Java objects in a relational database is possible. eCOMPAGT uses this feature to guarantee an abstraction to the underlying database layer. Starting with Version 5, JDK (Java Development Kit [14]) offers the usage of typesafe annotations. The JPA (Java Persistence API, [15]) uses this feature and defines the mapping syntax, the semantics and the life cycles of objects and query possibilities, which are accessible by the Hibernate Entity Manager. With JPA all configuration files of eCOMPAGT are held in one file (persistence.xml) and the bindings to the database are directly included in the source code (via annotations, [16]). This eliminates the need of XML mapping files and simplifies the configuration.
We tested this portability by using different relational databases (DB2, MySQL). After small adjustments of the configuration file and the schema (syntax of sequences), eCOMPAGT was applicable with both database systems. Additionally, the application runs on Linux and Windows systems given the availability of a Java Virtual Machine (part of Java Runtime Environment).
To guarantee a fast import/export of data and optimized queries, we applied JDBC (Java Database Connectivity [17]) connections and index optimization.
Performance and scalability
The application was designed for the use in the intranet of a research institution and constitutes a rich client, which depends on a centralized database. Additional indices were generated to optimize the queries within the database. As a consequence, the upload of phenotype or genotype files requires more time than without additional indexing, but with the benefit of fast query results. We tested the system with 20 different projects, each containing 10,000 samples. For example the query of 10,000 samples of one project takes less than 1 second. For time-critical queries stored procedures were additionally used. As a database server we used two different machines: an Intel Pentium 4 Server with 2.8 GHz and 786 MB RAM in the beginning and moved to a server with 8 processors (quad core) with 2.6 GHz and 16 GB RAM later on.
When talking about scalability, two important features have to be addressed: time complexity and space complexity. Concerning time complexity, eCOMPAGT is based upon a relational database system, as Oracle or DB2. Hence time scalability is given directly by the system: using the B-tree for indexing, simple algorithms for search-, insertion- and deletion-functions of data records are provided in O(logm(n)) time.
Concerning space complexity, the relational databases offer several possibilities for huge amount of data, like cluster architectures, in order to avoid issues regarding limited storage requirements.
Comparison to other systems
At the time of publication, we are aware of two other systems that are similar to eCOMPAGT: IGS [7] and SNPLims [18]. In contrast to IGS and SNPLims, which have a web-based interface and command-line clients, the interface of eCOMPAGT is a user-friendly Java-based client. A major advantage of eCOMPAGT is its platform independency, whereas SNPLims requires a Debian server, and IGS was developed for Windows systems. The database system for eCOMPAGT is Oracle, for IGS it is MS SQL, and for SNPLims it is PostgreSQL. Especially in the upload of phenotypes, the three systems differ dramatically. While eCOMPAGT offers an easy upload through BIFF files, IGS requires a full definition of phenotypes before the upload, and SNPLims separates phenotypes from demographic data. IGS and SNPLims are designed for handling high-throughput genotyping data deriving from platforms like Illumina, MegaBace and Sequenom; eCOMPAGT is a solution for small to medium-sized laboratories which use TaqMan and SNPlex for genotyping. An extension to the import of STRs will be available in the next version of eCOMPAGT. A nice add-on of eCOMPAGT is the availability of a history function, which enables an easy and reliable tracking of data modifications.
Other projects that were similar by concept but that would not meet our special requirements were SNPP and PACLIMS. SNPP [19] is an open source application based on MS Access or on a MySQL database. The application works well with genome data files in the Invader Analyzer (Third Wave Technologies, Madison, WI) format and phenotypes are accepted in Excel format. Neither project management nor history functionality for controlling data modification are available. Furthermore multiplex genotyping methods are not supported. The laboratory information management system PACLIMS [20] was developed for Unix/Linux systems with a PostgreSQL database. It can be used as a model for high throughput mutational endeavours.
A tool which is not in competition to our system, but could be a very valuable complementation to eCOMPAGT, is TIMS (TaqMan Information Management System) [21]. TIMS is a package of Visual Basic programs focused on sample management and on the parsing of input and output TaqMan files and is thus a great tool to organize the data flow within the genotyping laboratory, before entering the results into eCOMPAGT.