ONTO-ToolKit: enabling bio-ontology engineering via Galaxy
© Antezana et al. 2010
Published: 21 December 2010
Skip to main content
© Antezana et al. 2010
Published: 21 December 2010
The biosciences increasingly face the challenge of integrating a wide variety of available data, information and knowledge in order to gain an understanding of biological systems. Data integration is supported by a diverse series of tools, but the lack of a consistent terminology to label these data still presents significant hurdles. As a consequence, much of the available biological data remains disconnected or worse: becomes misconnected. The need to address this terminology problem has spawned the building of a large number of bio-ontologies. OBOF, RDF and OWL are among the most used ontology formats to capture terms and relationships in the Life Sciences, opening the potential to use the Semantic Web to support data integration and further exploitation of integrated resources via automated retrieval and reasoning procedures.
We extended the Perl suite ONTO-PERL and functionally integrated it into the Galaxy platform. The resulting ONTO-ToolKit supports the analysis and handling of OBO-formatted ontologies via the Galaxy interface, and we demonstrated its functionality in different use cases that illustrate the flexibility to obtain sets of ontology terms that match specific search criteria.
ONTO-ToolKit is available as a tool suite for Galaxy. Galaxy not only provides a user friendly interface allowing the interested biologist to manipulate OBO ontologies, it also opens up the possibility to perform further biological (and ontological) analyses by using other tools available within the Galaxy environment. Moreover, it provides tools to translate OBO-formatted ontologies into Semantic Web formats such as RDF and OWL.
ONTO-ToolKit reaches out to researchers in the biosciences, by providing a user-friendly way to analyse and manipulate ontologies. This type of functionality will become increasingly important given the wealth of information that is becoming available based on ontologies.
Bio-ontologies are artefacts used to represent, build, store, and share knowledge about a biological domain by capturing the domain entities and their interrelationships. Bio-ontologies have become an important asset for the life sciences. They not only provide a controlled, standard terminology (to support annotations for instance); a variety of tools are available to exploit these ontologies, making them one of the cornerstones for biological data analysis. The Gene Ontology (GO)  is probably the best known bio-ontology. One of the most common uses of the GO is to perform term enrichment [2, 3] on a gene set. The GO website lists over fifty such tools . In addition, the life sciences community began to utilise other available ontologies (such as the Plant Ontology ) as well as to develop their own bio-ontologies to support other biology or technology domains. A recent example is the Ontology of Biomedical Investigations (OBI ), a community effort to build an ontology describing the different elements of a biomedical investigation (e.g. protocols, instruments, reagents, experimentalists). The Open Biomedical Ontologies (OBO) foundry  suggests a set of principles to guide the development of ontologies, for instance the ‘orthogonality principle’ designed to prevent overlapping ontologies. Most of the bio-ontologies gathered by the OBO foundry are represented in the OBO format , which has became the lingua franca to build bio-ontologies. An increasing number of bio-ontologies is being developed in the more expressive Web Ontology Language (OWL) that allows for advanced automated reasoning [9, 10]. Automated reasoning, performed on OWL-formatted ontologies via the so-called reasoners (such as HermiT ), allows bio-ontologists to perform various tasks such as classification (also known as subsumption), which enables the process of making explicit the relations that were hidden (i.e. implicitly captured), and in general provides help to ensure the consistency of an ontology.
Several open source tools are available to deliver native support for bio-ontology manipulation (BioPerl , ONTO-PERL , BioRuby , BioPython ). We have previously published ONTO-PERL, a suite of Perl tools supporting the management of ontologies represented in OBO format (OBOF). ONTO-PERL is a full-blown API to manipulate bio-ontologies in OBOF. It offers a set of scripts supporting the typical ontology manipulation tasks, which can be used from the command line. Useful as this API may be for bioinformaticians or expert ontologists, biologists may find it intimidating to use. To accommodate their easy use, working with ontologies has for instance been facilitated by the setting up of ontology portals [16, 17]. These applications can be directly linked to knowledge systems that store information in local infrastructures, thus taking advantage of the ontological scaffold (generally, hierarchical and partonomical relationships) through mappings between the ontology components (terms and relationships) and actual data. The linking of ontologies and biological data is proving to be a successful stepping stone towards ontology-based knowledge discovery platforms . Those platforms may eventually become important tools in the quest for new hypotheses that can drive experimental design.
To further improve the repertoire of tools available to biologists to handle and analyze the knowledge available through ontologies we have turned to Galaxy , a web-based environment that integrates various types of tools to handle biological data. Galaxy’s development is strongly targeted towards end-users who have limited computational skills (including many molecular biologists), so that they may easily perform analysis or have their favourite command line tool integrated. A tracking of the history of analyses, support for building workflows and data sharing are among Galaxy’s most appealing features.
We used Galaxy to construct ONTO-ToolKit, which is an extension of the ONTO-PERL software that we developed previously. ONTO-PERL consists of a collection of Perl modules that enable the handling of OBO-formatted ontologies (like the Gene Ontology). With these modules a user can for instance manipulate ontology elements such as a Term, a Relationship and so forth, or employ scripts to carry out various typical tasks (such as format conversions between OBO and OWL (obo2owl, owl2obo).
ONTO-ToolKit allows exploiting the ONTO-PERL functionality within the Galaxy environment. Galaxy not only provides a user friendly interface to manipulate OBOF ontologies, it also offers the possibility to perform further biological (and ontological) analyses by using other tools provided within the Galaxy platform. In addition, ONTO-ToolKit provides tools to translate OBOF ontologies into Semantic Web formats such as RDF (Resource Description Framework) and OWL.
Examples of ONTO-PERL functionalities
Collects the ancestor terms (list of IDs) from a given term (existing ID) in the given OBO ontology.
Collects the child terms (list of term IDs and their names) from a given term (existing ID) in the given OBO ontology.
Collects the descendent terms (list of IDs) from a given term (existing ID) in the given OBO ontology.
Extracts a sub-ontology (in OBO format) of a given ontology having the given term ID as the root.
Finds all the obsolete terms in a given ontology.
Collects the parent terms (list of term IDs and their names) from a given term (existing ID) in the given OBO ontology.
Finds all the relationship types in a given ontology.
Finds all the root terms in a given ontology.
Finds all the synonyms of a given term name in an ontology.
Finds all the terms in a given ontology.
Finds all the terms in a given ontology that have a given string in their names.
OBO to OWL translator.
OBO to RDF translator.
This script trims a given branch of OBO ontology.
Converts an ontology into another one which could be integrated into CCO.
OBOF into RDF translator. The resulting file has (full) transitive closure
OBO to XML translator (CCO scheme).
Gene Ontology (in OBO) to OWL translator.
Generates a simple RDF graph from a given GOA file
OWL to OBO translator.
Obsolete terms vs. their definitions
Obsolete terms vs. their names
Gets the term IDs and term definitions of a given ontology.
Gets the term IDs and term names of a given ontology.
Gets the term IDs and its namespaces in a given ontology
Collects common OBO terms from a given set of lists containing OBO terms
Provides an intersection of the given ontologies (in OBO format)
We illustrate the use of ONTO-ToolKit through three ontology-analysis use cases. In use case I we have analysed the relationship between terms from the Cell Cycle Ontology (CCO), an application ontology that we described previously . In use case II we carried out an analysis combining ONTO-ToolKit functionality with other tools available in Galaxy, and in use case III we have demonstrated how a workflow was built to analyse gene sets with GO and S. pombe annotations.
Use case II illustrates how ONTO-ToolKit can be used in combination with other functionalities available in Galaxy. A user might be interested in identifying the functional relatedness of two proteins, as described by their GO annotations. To assess this, two lists of GO Terms associated with the two proteins need to be retrieved and then matched to determine their intersection. The example uses the H. sapiens proteins JUN (UniProt ID: P05412) and FOS (UniProt ID: P01100). Their UniProt IDs were used to query the BioMart  central server from within Galaxy to retrieve lists of JUN and FOS GO terms and annotations (see Additional file 1). In the second step, the ONTO-ToolKit function get_list_intersection_from was used to obtain all the annotations shared between JUN and FOS (see Additional file 1). The results show the four GO terms (GO:0010843, GO:0070412, GO:0060395, GO:0007179) common between these two transcription factors.
This workflow starts by fetching an ontology and a set of gene associations, in this case, the Gene Ontology and the S. pombe annotations. The next step is to use the get_descendant_terms function (the converse of the get_ancestor_terms function described above) to extract a subset of the ontology (in this case, it is configured to extract all descendants of the term “cell cycle”). To get the corresponding annotations an annotation mapping function is used to get all annotations corresponding to this sub ontology. This cell cycle specific annotation file is fed into the GO TermFinder  enrichment tool, along with a user-supplied gene set. This workflow can be reused multiple times (for example, to re-check results with the latest ontology and annotations), and can be shared between Galaxy users.
A coherent integration of public, online information resources is still a major bottleneck in the post-genomic era. Bioinformatics databases are especially difficult to integrate because they are often complex, highly heterogeneous, dispersed and incessantly evolving [22–24]. Moreover, consensus naming conventions and uniform data standards are often lacking. Nevertheless, the need for efficient procedures to integrate data is only increasing, due to the growing popularity of integrative biology and systems biology: approaches that need a variety of data from multiple sources to build computational models in order to understand biological systems behaviour.
Bio-ontologies can greatly facilitate this integration process  because they provide a scaffold that allows computers to automate parts or the whole of the integration process . Setting up an integrative platform that can support an advanced data analysis based on bio-ontologies typically requires the establishment of an environment that enables access both to the many public biological databases that contain curated information, and to the various bio-ontologies. Moreover, such an integrative environment must enable the sharing of the information at any time with all contributors to the data curation process. In addition to curated databases, vast amounts of literature-independent data are being generated by high-throughput genome-wide analyses and accumulated in various databases. These databases represent another resource of context to infer biological function and to assess relations between biological entities. To obtain a powerful structuring and synthesis of all available biological knowledge it is essential to build an efficient information retrieval and management system. This system requires an extensive combination of data extraction methods, data format conversions, ontology-based analysis support and a variety of information sources. Ultimately, such an integrated and structured knowledge base may facilitate the use of computational reasoning for analysis of biological systems, an approach that we have named Semantic Systems Biology .
ONTO-ToolKit offers functionality that allows a biologist to exploit the increasingly abundant information supported by ontologies. The Gene Ontology Consortium is participating in the development of ONTO-ToolKit as an integration platform for performing many GO based workflows, replacing existing functionality in AmiGO  and expanding the range of tools to be used. For example, it is possible to extract all experimental annotations for the clade Mammalia, generate a slim (subset) from this set, or to fetch all annotations belonging to a pre-defined ontology subset. Annotations extracted in this way can also be used in term enrichment analyses using GO TermFinder . Term enrichment analysis on ontology subsets reduces the number of terms that are considered for the overrepresentation analysis, making the analysis more sensitive.
Platforms such as Galaxy are aimed to overcome the barriers in global data processing, and its flexibility offers ample opportunity to identify and implement new ways to fill the gaps in data visualisation and analysis. We have explored Galaxy’s use to implement data analysis techniques based on bio-ontologies. Bioinformatics data resources are constantly updated, i.e. by automated, software-mediated annotation or manual curation processes that depend on human intervention. Ontologies provide a means of improving the annotation process and to semantically represent the knowledge contained in biological databases in an unambiguous way. ONTO-ToolKit builds on this trend by enabling the manipulation of bio-ontologies within an integrative platform, which in turn allows analysis results to become the entry-point for further biological data analysis.
We presented several use cases to illustrate how the functionality of ONTO-PERL can be combined with the functionality of other tools in Galaxy. We have shown how the functionality of ONTO-PERL can be used to identify all the ancestor terms of a pair of ontology terms, or to simply retrieve all the terms shared by two proteins in order to assess their potential biological relatedness. We have extended and used ONTO-ToolKit to build a workflow to dynamically extract a subset of GO, map annotations to this subset, and then perform term enrichment analysis. With this we have shown that ONTO-ToolKit constitutes a useful extension to the functionalities available in Galaxy, by adding a variety of ontology-based analysis approaches that can improve the depth of the overall analysis because it builds on an increasing wealth of annotation and curation results.
ONTO-ToolKit can be obtained from its project page  or from the Galaxy Tool Shed . ONTO-ToolKit is distributed under an Open Source License: GNU General Public License . ONTO-ToolKit provides access to the latest obo2owl conversion code that implements the new proposed OBO Foundry mapping to OWL . Once the ontology is converted to OWL, there are a number of OWL processing tools available, including Pellet , and ontology processing via the Thea library . OntoToolkit, including the workflow example mentioned in use case III, is also available on-line .
Open Biomedical Ontologies
Open Biomedical Ontologies Format
Ontology of Biomedical Investigations
Cell Cycle Ontology
Resource Description Framework
Web Ontology Language
eXtensible Markup Language
We thank the Galaxy developers for the support they provided while implementing ONTO-ToolKit, and two anonymous reviewers who greatly helped to improve our manuscript
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 12, 2010: Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S12.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.