GORouter: an RDF model for providing semantic query and inference services for Gene Ontology and its associations
© Xu et al. 2008
Published: 13 February 2008
Skip to main content
© Xu et al. 2008
Published: 13 February 2008
The most renowned biological ontology, Gene Ontology (GO) is widely used for annotations of genes and gene products of different organisms. However, there are shortcomings in the Resource Description Framework (RDF) data file provided by the GO consortium: 1) Lack of sufficient semantic relationships between pairs of terms coming from the three independent GO sub-ontologies, that limit the power to provide complex semantic queries and inference services based on it. 2) The term-centric view of GO annotation data and the fact that all information is stored in a single file. This makes attempts to retrieve GO annotations based on big volume datasets unmanageable. 3) No support of GOSlim.
We propose a RDF model, GORouter, which encodes heterogeneous original data in a uniform RDF format, creates additional ontology mappings between GO terms, and introduces a set of inference rulebases. Furthermore, we use the Oracle Network Data Model (NDM) as the native RDF data repository and the table function RDF_MATCH to seamlessly combine the result of RDF queries with traditional relational data. As a result, the scale of GORouter is minimized; information not directly involved in semantic inference is put into relational tables.
Our work demonstrates how to use multiple semantic web tools and techniques to provide a mixture of semantic query and inference solutions of GO and its associations. GORouter is licensed under Apache License Version 2.0, and is accessible via the website: http://www.scbit.org/gorouter/.
The currently preferred tool for uniform data presentation in systems biology, the syntactic and document orientated eXtensible Markup Language (XML), cannot satisfy the requirements of highly dynamic and integrated bioinformatics applications. However, Semantic Web http://www.w3.org/2001/sw provides a universal mechanism for information exchange by describing, in a machine-interpretable way, the content of resources on the Web. The growing need for integration of diverse and heterogeneous data sets from distinct communities of scientists in separate biological research fields has thus been the major driving force to migrate from traditional XML to Semantic Web .
Gene Ontology  (GO, http://www.geneontology.org) is by far the most widely used bio-ontology. As of August 2007, it contains approximately 23,700 terms, linked to a database of more than 16 million annotations of genes and gene products, originating from about 20 organisms. As a Semantic Web application domain, Gene Ontology Consortium provides a RDF-XML data file http://archive.geneontology.org/full/2007-08-01/go_200708-assocdb.rdf-xml.gz. It is an export of the database, containing both the GO vocabulary and associations between GO terms and gene products. However, this file has drawbacks, making it unsuitable for providing complex semantic query and inference services.
The first drawback is the lack of relationships between concepts among different GO subontologies, limiting the power of inference based on them. GO has three independent subontologies, Cellular Component, Biological Process and Molecular Function. The terms in the subontologies are structured as Directed Acyclic Graphs (DAG), and may have one or more parents with two types of relationships: 'is-a' is a simple class-subclass relationship, while 'part-of' represents a complex part-whole relationship. However, neither of them reflects the biological relationships among various subontologies. Several approaches, Lexical [4–8] and non-lexical [9–12], have been used to tackle this issue.
Lexical approaches are based on the fact that GO terms and definitions are themselves a type of semi-structured natural language. About 65% of all GO terms contain another GO term as a proper substring . For example, the MF mannosyltransferase activity (GO:0000030) shares a substring with the CC mannosyltransferase complex (GO:0031501). The Obol project proposed a formal language to provide computable definitions that serve to differentiate a term from other similar terms . Furthermore, Bada et al. designed 31 patterns to match term substrings to concepts and predicted an initial set of over 4000 associations . Lexical methods mainly focus on the analysis of the compositional nature of Ontology terms, which leads to an increase in the number of relationships. The same ideas also could be applied to identify the dependence among various domains of biological knowledge, such as the Open Biological Ontology (OBO) family, chemical entity (ChEBI), BRENDA Tissue ontologies, and so on .
Statistical approaches based on the assumption that since some pairs of terms coming from different GO subontologies are annotated to the same gene or gene product, the relationships should reflect an actual interdependence between them. By analyzing the statistics of co-occurrence of GO terms in the model organism annotation databases of the Gene Ontology Annotation (GOA), Bada et al. developed the Gene Ontology Annotation Tool (GOAT) . GOAT assists the Gene Ontology Next Generation (GONG) project  to convert GO Terms into a description-logic-based ontology (DAML+OIL). Similarly, Kumar mined the TIGR database to establish the corresponding patterns of association between terms in GO . Other non-lexical methods, such as computing similarity in vector space, association rule mining, ontologies analysis, have also be introduced to address this problem .
The second drawback is that the RDF-XML data file is organized with a term-centric view of GO annotation data. All information is stored in a single file. The loading, querying and visualizing of massive amounts of RDF datasets are the main bottleneck of semantic web prototype applications . Several semantic web tools, Sesame , Kowari , Jena2 , 3Store and RDFStore, have been developed and made available. Unfortunately, these repositories are not suitable for work with large amounts of data http://simile.mit.edu/reports/stores/.
On the other hand, the scale of semantic web datasets of the life sciences increases dramatically. Many communities, such as GO, UniProt, UMLS, OMIM, KEGG and MGED, have provided download services for data encoded in RDF or Web Ontology Language (OWL) format. Correspondingly, semantic web prototype tools have been developed to address life science and health care requirements. For example, BioDASH  provides a Drug Development Dashboard that associates disease, compounds, drug progression stages, molecular biology, and pathways for a group of users. The YeastHub  and Bio2RDF  projects explore how the needs for data integration can be addressed by the semantic web and how a life sciences data warehouse can be built. However, most of the semantic web prototype applications create an RDF repository using the computers' main memory to speed up performance. This solution poses a high demand on the application server and is unable to satisfy the need for rapid growth of semantic web applications.
The third drawback is the lack of support for GOSlim. GOSlims are cut-down versions of the GO ontologies containing a subset of all terms in GO. They are particularly useful for giving a summary of the results of GO annotations of genomes, microarrays or cDNA collections [20, 21]. However, GOSlim properties are not considered in RDF-XML data files.
In this paper, we present a RDF model GORouter, which mainly demonstrates how to use multiple semantic web tools and techniques to integrate heterogeneous resources and to create additional semantic relationships between different RDF datasets.
By introducing GLUE system  to create ontology mappings between pairs of terms coming from the three independent GO sub-ontologies, introducing a set of inference rulebases, and using the Oracle Network Data Model (NDM)  as the native RDF data repository, we believe that GORouter has the capability to allow complex semantic queries and inference services for GO and its associations.
GORouter is licensed under Apache License Version 2.0 and available for free download from the SourceForge website http://sourceforge.net/projects/gorouter. Based on GORouter, we provide an application http://www.scbit.org/gorouter/ for searching and browsing GO and its associations, and which also delivers additional functions such as semantic inference services.
In this section, we discuss some shortcomings of current algorithms for ontology mapping.
Firstly, finding associations using non-lexical and lexical approaches has little overlap . Myhre et al. attempt multiple strategies to bridge this gap . The GLUE system supports multiple learning strategies to generate join probability distribution. However, our project currently only employs an annotation statistics strategy. Integrating lexical learning strategies into the project will be the main focus of the next development phase.
Secondly, the GLUE system can currently not handle more sophisticated mappings (i.e. non one-to-one mapping) between GO terms. As an extended version of the GLUE system, CGLUE  can be used to exploit complex mappings.
Thirdly, the GLUE system only focuses on finding correspondences among the taxonomies of two given ontologies. Ontology specifies a conceptualization of a domain in terms of concepts, attributes and relations. The concepts provide model entities of interest in the domain, and they are typically organized into a taxonomy tree. Despite taxonomies being central components of ontologies, attributes and relations also need to be considered during the process of exploit mapping.
OWL builds on RDF and adds more vocabulary along with formal computational definitions for reasoning. Compared with RDF, OWL facilitates greater machine interpretability of Web content. The OWL format is becoming the next generation of bio-ontology representation [26–29]. Several ontology editors, such as OBO-Edit , Protégé-OWL  and COBrA , can be used to perform the translation and provide Description Logic reasoning.
We currently use Oracle 10gR2 NDM as RDF repository, which does not incorporate native OWL support. The next generation, Oracle Spatial 11g, will support both RDF and OWL data management . It is another important task for us to migrate GORouter from RDF to OWL format.
The GO project is a collaborative effort to address the need for consistent descriptions of gene products in various databases. However, some molecular functions, biological processes and cellular components are not common to all life forms. GO uses the designator sensu, 'in the sense of', to name those species-specific terms. For instance, BP invasive growth (sensu Saccharomyces) (GO:0001403) represents the invasive growth process of Saccharomyces cell, which can only be used to annotate genes and gene products of the Saccharomyces Genome Database (SGD). These species-specific terms violate the species-independent principle of the GO vocabulary.
From another point of view, one could call this phenomenon a semantically-weak problem: the GO vocabulary has no control over the semantic context of term names. We will address this problem by introducing the NCBI organism classification (TAXON) into GORouter. By separating species-specific terms from the GO vocabulary, we plan to create a set of special GO subsets, which can be applied to the specified class of organism. Furthermore, the TAXON vocabulary can also be used to identify the species encoding gene products. By introducing TAXON, we can create richer relations across various GOs and their annotations.
Similarly, we also plan to introduce Sequence Ontology  (SO), a sister project of GO, to describe features and attributes of gene sequences and gene products. In recent years, the development of bio-ontologies has been very rapid [35, 36]. As an essential part of OBO collection, GO development principles have been extended to many other biological domains and give an opportunity to introduce more ontology and annotations into GORouter to enrich the content of semantic relationships.
Gene Ontology is itself dynamic . The development of GO terms and annotations reflects the current status of biological knowledge. For instance, the GO consortium has partially completed the subsumption hierarchy (a set of high-level terms) for the cellular component ontology, and the project is expected to be completed in 2007. The Plant-Associated Microbe Gene Ontology (PAMGO, http://pamgo.vbi.vt.edu/) Interest Group introduced a new set of terms representing pathogenic and symbiotic processes. Alongside the continuous improvement of GO ontology content, increasing model organism databases and genome annotation groups contribute annotation sets using GO terms.
In summary, all these changes indicate that the content of GORouter needs to be correspondingly augmented, refined and reorganized. These requirements provide two challenges: one is to improve model flexibility and the other is to adapt performance to the continual increase in size. By using multiple semantic web technologies and tools, we believe that GORouter can overcome these problems.
Most of the original files come from the Gene Ontology Consortium, including MySQL relational data, the OBO format data of GOSlim, tab-delimited annotation files, and RDF XML format data with or without annotation. We encoded these heterogeneous resources in uniform RDF format, and created a set of RDF datasets (Reference YeastHub project). Each dataset consists of two RDF files, metadata and data.
(1) Symbol: is a standard gene product symbol.
(2) Synonyms: a RDF sequence container for storing the synonyms of genes and gene products.
(3) GOA: is a RDF omitting blank node with two sub-property elements: go and evidence, which indicates the GO Annotation. A gene product may have more than one annotation.
(4) GO: is a LSID, which refers to an accession number of GO term.
(5) Evidence: is a RDF omitting blank node with two sub-property elements: ec and reference, which refers to the evidence supporting the annotation. For a given annotation, more than one evidence may be associated with it. In GORouter, we only focus on credible evidence, such as Inferred by Curator (IC), Inferred from Direct Assay (IDA), Traceable Author Statement (TAS), and so on.
(6) EC: indicates the evidence code for the annotation.
(7) Reference: is a reference cited to support the annotation.
Each metadata RDF file has a data RDF file (Figure 1B) associated with it. We assign only one unique Life Science Identifier  (LSID) to each URL of data RDF files. Currently, only few databases provide LSIDs for their data. Therefore, we decided to assign these identifiers ourselves. Each LSID consists of up to five parts (URN:LSID:Authority:Namespace:Object: [Revision-ID]), in which URN:LSID is a mandatory prefix; Authority is the Internet domain of the organization which assigns the LSID to the resource; Namespace constrains the scope of the object; Object is an alpha-numeric describing the object; Revision-ID is the optional version of the object. For an example, there is a CGD (Candida Genome Database) gene whose database accession number is 'CAL0000849'. Thus, the LSID will be written as: 'urn:lsid:lifecenter.scbit.org:cgd:CAL0000849:1' or as a simpler style: 'urn:lsid:lifecenter.scbit.org:cgd:CAL0000849'.
Given two ontologies O 1 and O 2, for each term A (A ∈ O 1), the ontology mapping algorithms attempt to find the most similar term B (B ∈ O 2). We describe this mapping as "A mapping-to B". Nowadays, there are over 23,700 GO terms, including approximately 7,800 Molecular Function terms, 2,000 Cellular Component terms and 13,900 Biological Process terms. Manual GO subontology mapping is not reliable, and it is therefore crucial to use algorithms and computational tools to assist experts to generate these mappings.
Similarly, we can estimate the value of P(A, ) and P( , B), and calculate the Jaccard Similarity between the term A and B. The output of the Similarity Estimator module is a similarity matrix of any pair of terms in the two taxonomies.
For quality consideration, those annotations without credible evidence, such as Inferred from Electronic Annotation (IEA), Non-traceable Author Statement (NAS), No biological Data available (ND), and Not Recorded (NR), are not be included. 512,721 annotations were used for the eventual construction of the similarity matrix.
To improve the match accuracy, the GLUE system uses a Relaxation Labeler, which searches for the match configuration that best satisfies the given domain constraints and heuristic knowledge. The key idea behind this approach is that the label of a node is typically influenced by the features of the node's neighborhood in the graph. For instance, if there are mappings between all children nodes of MF telomerase activity (GO:0003720) and CC telomerase catalytic core complex (GO:0000333), then the chance of "MF telomerase activity mapping-to CC telomerase catalytic core complex " will be increased. Two domain constraints were introduced into our project. One is that "If term A matches term B, then A also matches all parents of B" and the other is that "If all children of term A match term B then A also matches B". Based on the GLUE report, when the relaxation labeler was applied, the accuracy typically improved substantially in the first few iterations, and then gradually dropped. Because of this, we stopped the Relaxation Labeler operation after the first two iterations and generated a set of match candidates.
The distribution of one-to-one mappings.
By introducing a set of inference rulebases, the GORouter will be able to provide semantic inference services. In addition to the two internal RDF and RDFS rulebases, the Oracle NDM also supports user-defined rulebases and uses them in specialized inferences across various RDF datasets.
In the rulebases, each rule consists of three parts: an IF side pattern as the antecedents; an optional filter condition that further restricts the subgraphs matched by the IF side pattern; and a THEN side pattern for the consequents. To simplify the expression, we use the "→" character to separate the IF side pattern from the THEN side pattern, while optional filter conditions are omitted.
Compared with the single term-centric XML-RDF data file, the RDF datasets are organized with a three-tier framework: 1) Core Tier: consists of 3 GO Subontology Datasets and 5 GOSlim Datasets (including GOSlim_Generic, GOSlim_GOA, GOSlim_Plant, GOSlim_Prokaryotic and GOSlim_Yeast). 2) Mapping Tier: consists of the 6 Ontology Mapping Datasets generated by the GLUE system. 3) Annotation Tier: consists of 17 GO Annotation Datasets; filtering of the annotation files is provided by GO collaborating groups.
Refining the set of mapping types simplifies the search statements. In the GORouter, we normalized the definition of relationships between the RDF datasets. Furthermore, when creating mappings, we used more restricted domain constraints. Hence, these mappings enrich the relationships and have the ability to provide complex semantic query and inference services.
A variety of applications that provide visualization and query capabilities for the GO are available. For example, the AmiGO http://www.godatabase.org/cgi-bin/amigo/go.cgi, GoFish  and EP http://ep.ebi.ac.uk/EP/GO/ browsers all use web interfaces to implement searching and displaying the ontology, term definitions and associated annotated gene products for the entire spectrum of contributing GO collaborating databases. Apart from the basic functions, however, there are profound differences between the various applications. For instance, GoFish provides Boolean queries of combinations of GO attributes, and the EP GO Browser provides clustering, analysis and visualization services. Unfortunately, although many applications use the GO subontologies or the gene associations, as well as similar development architectures, so far their integration has been problematic .
Stein et al., have suggested using two technologies, ontology and globally unique identifiers for the integration of biological databases. In constructing GORouter, we have followed this suggestion. We believe that this RDF model can partially overcome the problems described above, thus promoting information sharing and exchange among different research domains. Based on GORouter, we developed a prototype application to provide semantic query and inference services.
The loading and querying performance analysis of three semantic web prototype applications.
Dual processors of 1.66 GHz, 2 GB RAM
Oracle 10g NDM
Disk + Memory
Dual processors of 1.8 GHz, 16 GB RAM
Dual processors of 2 GHz, 2 GB RAM
At present, the GORouter is about 210 MB (~5.5 million triple statements), including the essential annotations and their relationships. In comparison, the size of traditional relational data, such as GO term definition, gene product sequence, not creditable annotations, etc is over 4 GB. It took about 10 hours to convert and load these data into the Oracle database, most of which was spent in the initial loading of RDF datasets into the Oracle NDM repository.
We used a web server, running Red Hat Enterprise Linux AS release 3 (Taroon Update 2) with dual 1.66 GHz processors and 2 GB main memory. In order to attain better performance times, we created a set of indexes for RDF triples and in particular function-based indexes for RDF rulebases, adjusted the Java Virtual Memory heap size and Oracle SGA size, extended the size of temporary tablespace, and used the DBMS_STATS package to gather statistics about the physical storage characteristics of tables and indexes. As a result, the speed of semantic queries and inferences performed either on par with or slightly better than traditional relational queries.
Our example queries demonstrate how to use two types of inference rulebases to provide semantic query and inference services. In the following use cases, we attempt to show some improvement over the traditional GO query tools. To simplify RDF_MATCH search pattern across multiple RDF datasets, RDF rulebases and relational tables, we developed a set of APIs to translate user input from web form into rich SQL statement.
Molecular Function Subontology.
Biological Process Subontology.
Cellular Component Subontology.
The mapping dataset directed from MF to BP which relation is "be-involved-in".
The mapping dataset directed from BP to MF which relation with "involves".
The mapping dataset directed from MF to CC which relation is "be-performed-in".
The mapping dataset directed from CC to MF which relation is "performs".
The mapping dataset directed from BP to CC which relation is "takes-on".
The mapping dataset directed from CC to BP which relation is "undertakes".
We would like to thank H. Yu, K. Tu, and W. Wang for encouragement and helpful discussions. This work was supported by the '863' National High-Tech Programs (2006AA02Z343, 2006AA02A312), and the '973' National Basic Research Programs (2003CB715901).
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 1, 2008: Asia Pacific Bioinformatics Network (APBioNet) Sixth International Conference on Bioinformatics (InCoB2007). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.