A knowledge discovery object model API for Java
© Zuyderduyn and Jones; licensee BioMed Central Ltd. 2003
Received: 11 July 2003
Accepted: 28 October 2003
Published: 28 October 2003
Biological data resources have become heterogeneous and derive from multiple sources. This introduces challenges in the management and utilization of this data in software development. Although efforts are underway to create a standard format for the transmission and storage of biological data, this objective has yet to be fully realized.
This work describes an application programming interface (API) that provides a framework for developing an effective biological knowledge ontology for Java-based software projects. The API provides a robust framework for the data acquisition and management needs of an ontology implementation. In addition, the API contains classes to assist in creating GUIs to represent this data visually.
The Knowledge Discovery Object Model (KDOM) API is particularly useful for medium to large applications, or for a number of smaller software projects with common characteristics or objectives. KDOM can be coupled effectively with other biologically relevant APIs and classes. Source code, libraries, documentation and examples are available at http://www.bcgsc.ca/bioinfo/software.
The development of bioinformatics software for effective analysis and interrogation of biological data, and indeed software in general, must include the creation of a data handling framework. Ideally, this framework must be accurate, robust, extensible, and technically feasible. The successful implementation of this framework has substantial implications for the ultimate success of software development. Further, the ability for such projects to be quickly utilized in other arenas of biology, improved by multiple developers, or to be evolved to handle changing requirements is directly affected by the initial choice of a core data model .
Most modern programming languages are well complemented with standard libraries and components that remove a great deal of necessary low-level computational tasks. For example, arrays, lists, and vectors all provide for easily implemented methods of managing and manipulating sets of data. Since there are characteristic operations and common manipulations of data lists, the use of standard constructs is an advantage to the developer. Developers who use these standards wherever possible will increase the speed of development and robustness of the result. Improvements to implementation are transparent and inheritable, mundane algorithms do not need to be developed or repeatedly utilized, and successful use can be repeated in future projects .
The Knowledge Discovery Object Model (KDOM) is an open source API written with Java 1.4  that attempts to embrace this ideal for biological data. Characterizing and standardizing commonalities in biological knowledge utilization can divert more development focus to novel creations. Although scientific literature holds many examples of how to approach the creation of an ontological system [4, 5], KDOM provides a core API to decrease the time and effort needed to deploy such a system, including the means to allow a user to visualize and manipulate the data.
Results and Discussion
What are the commonalities in biological data? Take the example of a "gene". An in silico gene can have a sequence, an annotation, possibly a chromosomal location, functional motifs, or similarities to other genes. Not all of these properties can be assumed to be enduring. It is certain that new properties will be discovered. However, we are confident that the larger definition of a gene will remain accurate for the foreseeable future. Further, we know that data relationships have inference in and of themselves. In a microarray experiment, an oligonucleotide is spotted onto a slide, and washed with labelled cell RNA that will hybridize depending on the level of expression of genes containing that oligo's sequence. The oligonucleotide has a sequence, a position on the slide, and an observed colour when the experiment is performed. It is not the oligonucleotide itself, nor the colour, or even the corresponding gene that provide the inference of the experiment. It is the combination of the three that offer knowledge. This reality as it relates to building an effective ontology has been previously described .
At this basic level, KDOM offers several advantages:
First, data is managed such that once an instance of a unique object is created; it is guaranteed to be the only instance of that object within the system. Unrelated procedures within the application domain can freely create or call instances of objects, knowing that existing instances will be utilized. For example, the developer could define three types of BiologicalData called Chromosome, Gene, and FunctionalDomain. If the user of the application invokes a procedure where all Gene s from a single Chromosome are acquired for use, a subsequent procedure acquiring all Genes with a given FunctionalDomain will use the existing Gene objects if appropriate.
Second, the developer need only define the object itself and its relationship to other objects once. If another task (possibly undertaken by a different developer) requires the same object type (or the same relationship between objects) the existing KDOM infrastructure can be utilized.
Third, properties of the data are acquired only when needed, and need only be acquired once. If a property of the Gene is "annotation", the BiologicalBrain will retrieve it when first needed, and the system will automatically utilize it again on subsequent procedures.
This approach saves computational energy, physical memory, and development time. These benefits increase as the amount of data in the system increases. Further, unrelated analyses can become meaningfully connected, thus presenting the opportunity for hypothesis discovery.
Using KDOM does not preclude the effective use of other Java-based bioinformatics APIs, such as BioJava  or BTL . In fact, the functionality of these packages would complement the goal of KDOM. Where KDOM would provide the logical framework for defining and managing data definitions and relationships, other packages can assist in providing methods to obtain or manipulate this information.
Methods used for data acquisition are numerous. Flat files, databases, or the World Wide Web are potential sources of data. This fact necessitated flexibility in the KDOM approach. Therefore, the BiologicalBrain, as the responsible component, utilizes developer-implemented BiologicalNervousSystem s to acquire data (Figure 1).
The advantage of this design is that "nervous systems" can be swapped or combined depending on the requirements of the system. This allows a developer to simultaneously utilize information from many different sources and formats using KDOM as a semantic layer. Since this aspect of the developer's KDOM implementation is centralized, it allows relatively easy migration to a different data storage system when and if required, or the inclusion of optimizations within the acquisition procedures that will benefit the entire system.
Standardized data description and delivery systems, some of which are ontological in design and already model data using the labelled graph approach, are an area of research and development that could be particularly agreeable to a KDOM implementation. The distributed annotation system (DAS) , BioMOBY , and the Resource Description Framework (RDF)  are recent examples.
The separation of "what" the information is and "how" it is obtained is a fundamental approach in ontology building . It is also a particular advantage in larger projects, as it makes these two needs individually transferable and manageable.
The relationship between data and the context in which a relationship exists is of fundamental importance. In the simplest system, relationships might be stored as an internal list inside a data object. However, the relationship would be unidirectional and the meaning of the relationship itself is not explicit.
KDOM utilizes a BiologicalLink class to describe the context of the relationship and provides a bidirectional association. Several BiologicalLink subclasses (RelatedLink, EquivalentLink, HierarchyLink, and SimilarityLink) are provided with the API to define the most common data relationships. These can be further subclassed to provide more specific context, and to define properties specific to the relationship. The API also supports multiple relationship types between the same two classes of data.
For example, a functional domain and a protein share a hierarchical (parent-child) relationship. The relationship itself may be associated with a mathematical score describing the confidence that a particular domain is truly present, and the coordinates of the putative domain in the protein sequence itself (Figure 2).
Graphical user interfaces
The API also features support for displaying individual object properties in a context-sensitive manner. For example, an "annotation" property for a gene would be displayed to the user differently than the "image" property for a chromosome.
This provides a high level of GUI component sharing between separate deployments of the KDOM implementation. This is particularly valuable when a consistent look and feel across many projects is advantageous or desirable.
The API also contains a number of other features too numerous to list in full. Among them: a type-safe container class to store sets of BiologicalData and includes methods to facilitate threaded batch processing and set operations; internal row and column management, and cell and list renderers so sets of data can be quickly displayed and manipulated in tabular or list format; a multi-threaded, internal data request queue, which allows for large amounts of data to be retrieved without disrupting user interaction with the application; robust data serialization in XML, which provides a portable and efficient method to store the results of data analyses; a centralized drag-and-drop extension for user-driven manipulation of KDOM objects; data and dataset listeners for creating responsive application components; and custom ClassLoaders that allow data definitions to be found at run-time without specifying a CLASSPATH directive.
A KDOM example
Ongoing development and future enhancements
The API continues to undergo active development. In particular, we intend to develop and integrate a sophisticated memory-management and serialization model so that very large-sized data collections can be utilized. Serialization of information could adopt one or more of the emerging standards (e.g. RDF) to facilitate further code reuse and utilization of information with existing packages.
The next generation of the API will include a mature "action-defining" concept (the ability is loosely implemented in the current version). This improvement would provide a developer with the ability to define task-specific modifications or interrogations of biological data within the data definition themselves. Thus, the component-sharing philosophy of KDOM would be extended to object-manipulation tasks, as well as data definitions and relationships.
For example, if one has defined a relationship between genes via homology, it may be desirable to generate an algorithm that determines whether a given similarity is important for the current task. Consider a task where an investigator requires a list of genes that have greater than 80% similarity to a particular target, and are found in the mouse or human genome. The action model would define the implementation of this need (via, perhaps, a "numerical cutoff" action), and efficiently apply it to the current KDOM system. Further, these actions could be combined or linked together to create reusable analysis pipelines.
A developer repository is under construction, as of this writing, of KDOM data definitions. This repository would provide structures for common biological concepts. A developer could obtain the required objects relevant to a particular project, and would inherit their functionality, including GUI components for display of particular properties or data relationships. The developer would merely have to implement a BiologicalNervousSystem that conforms to their storage platform or data acquisition system to obtain the defined information. This collection of data types will consist of two general layers: an abstract layer of definitions of common biological concepts, and a further supplementary layer of extended definitions specific to the contents of common biological databases.
Two successful implementations of KDOM
Another application called SAGEsoma (P. Ruzanov et al., in preparation) provides the ability to visualize gene expression data on a karyotype.
The development of SAGEsoma occurred almost completely independently of DISCOVERYspace, yet the two applications were able to share and benefit from development of the same KDOM implementation. Further, integration of SAGEsoma into DISCOVERYspace as a plugin was seamless, whilst allowing both to retain their standalone capabilities.
The Knowledge Discovery Object Model (KDOM) API provides an application framework for bioinformatics software development in Java. The model provides a well-defined system for knowledge management and utilization, and facilitates efficient development of medium to large-scale software projects with multiple developers, or an easily managed system for creating smaller single-developer projects with minimum overlap and overhead. KDOM complements well with other bioinformatics Java APIs.
The API provides a foundation for a logical, structured framework for data modelling, and can provide insights that result in novel hypotheses.
The API is open-source, with conditions, and can be obtained from http://www.bcgsc.ca/bioinfo/software. Documentation and code examples are also available.
SZ would like to acknowledge the valuable criticisms and comments of Chris Fjell, Mehrdad Oveisi, Shawn Rusaw, and particularily Neil Robertson. SZ would also like to thank Peter Ruzanov for implementing the API in the SAGEsoma project. This work was supported by funding from Genome Canada and the British Columbia Cancer Foundation.
- Uschold M, Gruninger M: Ontologies: Principles, methods and applications. Knowledge Engineering Review 1996., 11(2):Google Scholar
- Jazayeri M: Component programming – a fresh look at software components. In Proceedings of the 5th European Software Engineering Conference: September 25–28 1995; Sitges, Spain (Edited by: Schafer W, Botella P). 1995, 457–478.Google Scholar
- Sun Microsystems: Java Development Kit 1.4.0. 901 San Antonio Road, Palo Alto, CA 94303 2002. [http://www.java.sun.com/j2se/1.4]Google Scholar
- Stevens R, Goble CA, Bechhofer S: Ontology-based Knowledge Representation for Bioinformatics. Brief Bioinform 2000, 1: 398–414.View ArticlePubMedGoogle Scholar
- Wiechert W, Joksch B, Wittig R, Hartbrich A, Honer T, Mollney M: Object-oriented programming for the biosciences. Bioinformatics 1995, 11: 517–534.View ArticleGoogle Scholar
- Pitt WR, Williams MA, Steven M, Sweeney B, Bleasby AJ, Moss DS: The Bioinformatics Template Library – generic components for biocomputing. Bioinformatics 2001, 17: 729–737. 10.1093/bioinformatics/17.8.729View ArticlePubMedGoogle Scholar
- Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L: The Distributed Annotation System. BMC Bioinformatics 2001, 2: 7. 10.1186/1471-2105-2-7PubMed CentralView ArticlePubMedGoogle Scholar
- Wilkinson MD, Links M: BioMOBY: an open source biological web services proposal. Brief Bioinform 2002, 3: 331–341.View ArticlePubMedGoogle Scholar
- Resource Description Framework[http://www.w3.org/RDF]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–10. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Pruitt KD, Maglott DR: Refseq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001, 29: 137–140. 10.1093/nar/29.1.137PubMed CentralView ArticlePubMedGoogle Scholar
- Boguski MS, Schuler GD: ESTablishing a Human Transcript Map. Nat Genet 1995, 10: 369–371.View ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acid Res 2000, 28: 45–48. 10.1093/nar/28.1.45PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer ELL: The Pfam protein families database. Nucleic Acids Res 2002, 30: 276–280. 10.1093/nar/30.1.276PubMed CentralView ArticlePubMedGoogle Scholar
- Nakai K, Horton P: PSORT: a program for detecting the sorting signals of proteins and predicting their subcellular localization. Trends Biochem Sci 1999, 24: 34–36. 10.1016/S0968-0004(98)01336-XView ArticlePubMedGoogle Scholar
- Online Mendelian Inheritance in Man OMIM™[http://www.ncbi.nlm.nih.gov/omim]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.