The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries
© Côté et al. 2006
Received: 13 December 2005
Accepted: 28 February 2006
Published: 28 February 2006
Skip to main content
© Côté et al. 2006
Received: 13 December 2005
Accepted: 28 February 2006
Published: 28 February 2006
With the vast amounts of biomedical data being generated by high-throughput analysis methods, controlled vocabularies and ontologies are becoming increasingly important to annotate units of information for ease of search and retrieval. Each scientific community tends to create its own locally available ontology. The interfaces to query these ontologies tend to vary from group to group. We saw the need for a centralized location to perform controlled vocabulary queries that would offer both a lightweight web-accessible user interface as well as a consistent, unified SOAP interface for automated queries.
The Ontology Lookup Service (OLS) was created to integrate publicly available biomedical ontologies into a single database. All modified ontologies are updated daily. A list of currently loaded ontologies is available online. The database can be queried to obtain information on a single term or to browse a complete ontology using AJAX. Auto-completion provides a user-friendly search mechanism. An AJAX-based ontology viewer is available to browse a complete ontology or subsets of it. A programmatic interface is available to query the webservice using SOAP. The service is described by a WSDL descriptor file available online. A sample Java client to connect to the webservice using SOAP is available for download from SourceForge. All OLS source code is publicly available under the open source Apache Licence.
The OLS provides a user-friendly single entry point for publicly available ontologies in the Open Biomedical Ontology (OBO) format. It can be accessed interactively or programmatically at http://www.ebi.ac.uk/ontology-lookup/.
Controlled vocabularies and ontologies have evolved into essential tools in large-scale high-throughput scientific data annotation and retrieval. They ensure data consistency and increase the efficiency and accuracy of queries by standardizing the wide variations in terminology that may exist in a particular field of study. Although this variability might be understandable by humans, it can hamper systematic searches through large volumes of data (take for example the possible abbreviations, synonyms and acronyms for the yeast two hybrid experimental technique: Y2H, two-hybrid, 2H, etc). 
The Open Biomedical Ontologies project catalogues well-structured controlled vocabularies for shared use across different scientific domains . To date, ontologies exist to describe the anatomy, developmental processes, phenotypes and pathologies of several species, as well as those oriented towards experimental and physical properties. For example, The Gene Ontology (GO), one of the oldest and richest ontologies, provides consistent descriptions of gene products in different databases in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner [3, 4]. The Medical Subject Headings (MeSH) thesaurus is another commonly used ontology produced by the National Library of Medicine and used for indexing, cataloguing, and searching for biomedical and health-related information and documents [5, 6].
While such a plethora of information is available to the scientific community, the tools to make efficient use of it are less forthcoming. Individual projects provide code bases and database schemas that have controlled vocabulary sub-schemas where ontologies can be loaded (the chado schema from the Generic Model Organism Database (GMOD) project  or the Genomics Unified Schema (GUS) , for example). However, the ontology segment is only one part of a larger and more complex toolkit, possibly creating a larger overhead than required.
Each major ontology tends to have its own online browser (references 6, 9 and 10, among many others) yet there has been little effort to integrate these ontologies into a single point of query. One emerging project is the National Center for Biomedical Ontology, which will be responsible for maintaining the OBO library and creating biomedical data repositories and tools for accessing and using the data . The Unified Medical Language System  is another initiative providing interactive and programmatic access to vocabularies, classifications and coding systems, though its focus is more oriented towards biomedical and clinical information sources and requires a licensing agreement and registration.
The second version of distributed annotation system protocol (DAS/2) [13, 14] proposes ontology queries using a standardized URL scheme and XML responses. It will allow DAS clients to retrieve information about ontologies and terms and perform basic queries. However, the DAS/2 specification is still being drafted. Servers and clients that will implement it are still in development. One such server  currently only has 20 ontologies available and requires an understanding of the DAS protocol to use.
The BioMOBY project  is an interoperability system focusing on the integration of biological data and defines a protocol to link together distributed webservices to form workflows. It uses internal ontologies to explicitly define the data type and the relationships between them. Services are registered in a central repository that can be queried by users wishing to discover which services are available for specific data types. The BioMOBY ontologies are a means to define tool interoperability rather than being a data source. Ontology query services are provided by third parties who make them available via the MOBY Central registry . However, the currently available services tend to be limited to either simple name queries, identifier queries or queries that return complex data types that are annotated with a given ontology term identifier. The services available are also restricted to a single ontology at a time (such as GO, EVOC or PO), generally the one being used by the party who provides the service.
There are to our knowledge no programmatic interfaces to allow for automated querying and interactive browsing of all OBO ontologies from a single interface.
Such interfaces would be useful in the creation of graphical user interface (GUI) widgets that could be integrated in the development of new tools and promote the use of ontologies in a simple yet powerful manner. Users would be more inclined to make use of controlled vocabulary terms if such data were available in applications used to generate, annotate or query scientific data.
The database model was inspired by the relevant portion of the BioSQL database schema.  Versions of the database schema currently exist for mySQL™ and Oracle™. Ontology loaders feed the database by parsing OBO-formatted flat-files and creating an object map that is persisted to the database using Apache ObjectRelationalBridge (OJB). All relevant information is extracted from the OBO file, including term accessions, names, synonyms, definitions, comments, relationships with other terms and cross-references with other ontologies and databases. The OLS does not do any curation on loaded ontologies, meaning that the data that is in the source flat-file is loaded faithfully. The OBO project maintains all of its ontologies in a CVS repository , making it easy to keep the database up-to-date. Updated files are obtained on a daily basis and any modified ontology will be loaded to the database. No loss of service is experienced during this process as the old version of the ontology is kept alive until the new one is fully loaded. Once loaded, the new version is set live and the old one is deleted.
Once the ontology has been persisted, another process will create an Apache Lucene  text index that will be used later on for case-insensitive full text queries. Terms are indexed on the preferred term name as well as on any annotated synonyms. Lucene has several advantages as a text-searching technology platform over RDBMS-based queries. It is very efficient at indexing and searching, it has a very powerful search syntax that can be used to limit and refine queries and it is platform independent, meaning that users do not need to rely on RDBMS-specific technologies to obtain good performance.
Relationships between terms are colour-coded to quickly provide an additional level of information. The three most significant relationships that comprise close to 98% of the relationships loaded in the OLS ("is a", 72%, "part of", 25% and "develops from", less than 1%) have been highlighted. Though several ontologies have defined custom relationship types, their usage is limited overall. To keep the interface simple, these relationships are colour-coded as "others" but hovering the mouse cursor over these terms will display the relationship type in the browser.
Users can also browse a subset of the ontology. This can be done by clicking on the "browse" button from the main page after a term has been selected from the auto-completion selections or by clicking on the "zoom" button from the ontology browser. This will re-root the browser on the selected term.
Although it would have been possible to generate a complete, fully-browsable tree for small ontologies, this would rapidly become cumbersome and inefficient for large ontologies such as GO, which have in excess of 20,000 terms. Using AJAX methodology, the tree is built up gradually as the user browses the ontology.
Programmatic access to the database is available through a SOAP webservice. The webservice is implemented in Java and deployed using Apache AXIS . Though the service makes internal use of the object model classes, only primitive data types are returned to help in platform interoperability. A server-side caching mechanism is implemented to store commonly accessed terms for increased performance. A sample java client connection class is made available to download from SourceForge . The methods implemented in the webservice as well as detailed documentation of the webservice WSDL are available online at the OLS website. The OLS core API javadoc is also available online.
To date, 42 ontologies have been loaded into the OLS database, which account for close to 135,000 terms. A complete list of ontologies loaded into the OLS can be found online . Currently, only ontologies available in the OBO flat-file format can be parsed into the OLS data model and persisted to the database. Future work will aim to create parsers for ontologies in the OWL format  as well as other controlled vocabularies of biological interest, such as the NEWT taxonomy .
Having a centralized point of query has proven to be useful for multiple projects at the EBI. This work started off as a requirement of the PRIDE project , which makes significant use of controlled vocabularies to annotate proteomic data sets . Using AJAX to perform term auto-completion and definition lookups allows reusability of these components in other web applications. Since transmitted data volume is quite low, the speed at which the list of suggestions is refreshed will closely match the typing speed of most users. Work is currently underway to incorporate these widgets into the PRIDE and IntAct  web interfaces at the EBI.
The programmatic SOAP interface is already being used by the PRIDE project to query the ontologies and obtain constantly updated terms while importing and exporting datasets. Work is also underway to use the SOAP interface in annotation and curation tools to edit and maintain the data in PRIDE.
The Ontology Lookup Service provides interactive and programmatic access to multiple ontologies, using lightweight and consistent interfaces. Users can perform simple queries using an interactive suggest-as-you-type form and browse ontologies in a clear tree-like browser. More sophisticated queries can be performed programmatically using a platform-independent SOAP interface. The service currently holds 42 ontologies covering fields such as anatomy, pathology, development, genomics, proteomics and experimental methods, among others. It is our hope that by providing generic, reusable code components, other projects in the bioinformatics community will make use of the ontology lookup service. Future work aims to increase the number of ontologies available to the general public and to enrich the SOAP interface from user feedback requirements. Users are encouraged to contact the authors to discuss feature requests to the interface. The data model contains more information than was required for the initial release requirements and could be made available if requested. Finally, many biomedical ontologies are available in OWL format and we hope to have OWL loaders for the next major release of the OLS.
Project name: Ontology Lookup Service
Project home page: http://www.ebi.ac.uk/ontology-lookup/
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.4, Tomcat 5.0, mySQL or Oracle
License: Apache License 2.0
Any restrictions to use by non-academics: none
Concurrent Versioning System
Distributed Annotation System
Generic Model Organism Database
Graphical User Interface
Genomics Unified Schema
Medical Subject Headings
Open Biomedical Ontologies
Object Relational Bridge
Ontology Lookup Service
Object Relational Mapping
Web Ontology Language
Relational Database Management System
Request for Comments
Simple Object Access Protocol
Extensible Markup Language
OLS, as a subproject of the PRIDE project, is supported through BBSRC iSPIDER. RC would like to thank KG for making him move to Cambridge.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.