The conversion of relational database contents into the Semantic Web
The Semantic Web [1] is gaining momentum as a framework for the development of next-generation bioinformatics data integration tools since its standards and technologies seem to have now reached enough maturity to be considered a viable solution for data integration challenges. Semantic Web based approaches to biomedical data integration have already been proposed a number of times in recent years [2–6]. Most recently an approach known as Linked Data, or Web of Data is being explored.
The vision of the Semantic Web is to evolve the Web into a distributed knowledge base: this vision relies on its evolution from the current Web of Documents, where each node of the network is represented by an unstructured document, into a Web of Data, where each node represents machine processable information. In this context, access to information is achieved through portals and search engines whose behavior is supported by semantic features. A good introduction to the Web of Data can be found in [7].
A relevant contribution to this evolution of the Web may come from the conversion of data stored in Relational Databases (RDB) into a viable representation such as the Resource Description Framework (RDF) [8], which is the basic technology to represent information in the Semantic Web. RDF is based on the composition of simple predicates ("triples") made by three elements identifying "Subject", "Predicate" (or "Property"), and "Object". Here, semantics can easily be associated to property definitions, while subjects usually are well identified entities and objects may either represent related entities or values.
Many research works have therefore been focused either on the static conversion or on the dynamic mapping of data from RDB to RDF. They have led to the implementation of both mapping tools and domain-specific applications. Some mappings are automatically generated via a simple association where the name of the relational table is mapped to an RDF class node and the names of its columns are used as RDF predicates. As a consequence, cell values are mapped to instances or data values. In this case, entities and relations, as well as their meaning, reflect the RDB schema and the knowledge of the schema is needed to understand the exported information.
In other mappings, relations and entities of the original databases are converted to a representation which is instead based on a shared conceptualization that can be, even significantly, different from the schema of the database. Differences may relate to properties, relationships, and even entity values (e.g., different coding applied, split/merged values). In this case, automatic mappings can serve as a starting point to quickly create customized, domain-specific mappings.
Relational to RDF mapping software exists both as independent tools (e.g.: D2RQ and Triplify), or as part of a larger suite (e.g. Allegrograph, Sesame, OWLim, Virtuoso). In general, they are components of a wider range of software solutions which can expose RDF entities and relations in structured information resources. A list of these tools is available on-line [9].
In the biomedical domain, an exemplar resource is represented by Bio2RDF [10], a system that allows an integrated access to a vast number of biomedical databases through Semantic Web technologies, i.e. RDF for data representation and SPARQL (SPARQL Protocol and RDF Query Language) [11] for queries. To this aim, many databases have been converted to RDF by special scripts, called RDFizers, while some information systems that were already offering a viable format and interface where directly linked to the system.
This conversion was based on a unified ontology, taking care of properties included in the information resources already available in RDF. Moreover, the system provided a unified URI schema, overcoming heterogeneity of URIs already provided by other systems. All major genomics, proteomics, networks and pathways, and nomenclatures databases were included in the system, as well as some clinical, e.g. Online Mendelian Inheritance in Man (OMIM), and bibliographic ones, e.g. PubMed, and the Gene Ontology.
The Linked Open Data (LOD) initiative, a Community Project at World Wide Web Consortium (W3C), aims at extending "the Web with a data commons by publishing various open data sets as RDF on the Web and by setting RDF links between data items from different data sources" [12]. In this context, many biomedical databases have already been made available (a Linked Open Data cloud diagram is available on-line [13]). Many of these datasets derive from Bio2RDF, but there are also some that were independently built, e.g. Diseasome, a dataset extracted from OMIM that includes information on disorders and disease-related genes linked by known associations.
Human variation data and the Semantic Web
In the last decade, with the advent of high-throughput technologies, sequencing has become faster and less expensive. As a consequence, a great wealth of data is being produced in order to identify variation data, i.e. specific, individual and sub-population related information. One of the best known projects of this kind is "1,000 Genomes", an international collaboration that recently ended its pilot phase [14, 15]. The goal of the pilot phase was the identification of at least the 95% of variations present in at least 1% of individuals in three distinct populations by means of Next-Generation Sequencing technologies. This led to the production of ca. 4.9 Tbases (about 3 Gbases/individual) and to the determination of 15 millions mutations, 1 million deletions/insertions, and 20,000 variants of greater size.
Such information constitutes the basis on which genomics may meet clinical information, correlation analyses between genotypes and phenotypes may be carried out, and the perspectives of genomic or personalized medicine may be realized [16, 17].
Although several databases on gene mutation and variation for humans exist, their semantic annotation is very limited and their formats are heterogeneous. Overall, only a little information on human variation is included in the Web of Data and/or it is available on-line in implementations that are based on Semantic Web technologies. This is the case, e.g., for the data on impact of protein mutations on their function that was extracted from scientific literature by using a specialized text mining pipeline by Laurila et al [18] (in this case, data is available on-line through a SPARQL endpoint, but access is restricted to authorized users only).
Lists of Locus Specific Data Bases (LSDB) and other databases related to human variation, like those related to Disease Centered Mutations, SNPs (Single Nucleotide Polymorphisms), National and Ethnic Mutations, Mitochondrial Mutations, and Chromosomal Variation, are available on-line at the site of the Human Genome Variation Society (HGVS) [19, 20], although many of these lists are not up-to-date. Indeed, the best human variation information is available in curated databases, many of which are managed by means of the Leiden Open Variation Database (LOVD) [21] schema and system. Many other databases are managed by proprietary systems. The Human Variome Project (HVP) [22] has produced recommendations for nomenclatures of variations and for contents of mutation databases.
The issue of integrating variation data with molecular biology databases is however well known. Conditions for the integration of LSDBs with other biological databases have been outlined by den Dunnen et al in [23]. In this paper, a distinction is made between the information that should be shared and the one that could be shared. In the former set, only some reference data are defined, including contact information for the database, identifiers of the gene in various databases, a unique reference to the sequence, and the description of the mutation at DNA level. In the latter set, that includes data on original bibliography, changes at protein and RNA levels, and associated pathogeniticy, issues related to ownerships and quality of data are also present.
Shared property definitions for human variation data
Integrating data on the Semantic Web is mainly a matter of shared and reusable properties' definitions and unique data identifiers. Some mutation related ontologies exist. These include the Variation Ontology (VariO) [24] and the Mutation Impact Ontology (MIO) [25].
VariO is still in a development phase, not officially released for annotation or analysis purposes. It is aiming at providing standardized, systematic descriptions of effects and consequences of position specific variations. It can be used to describe effects and consequences of variations at different levels (DNA, RNA protein). VariO reuses some terms and definitions from Sequence Ontology (SO), Gene Ontology (GO), and other ontologies.
MIO was developed to support semantic extraction and grounding of mutation impact data from literature. The ontology has a strong use case in the publication of text mining results through semantic Web Services [18] in the framework of the Semantic Automated Discovery and Integration (SADI) [26, 27] infrastructure.
Other biological ontologies making reference to mutations also exist, such as the Sequence Ontology (for a recent assessment of the state and issues in incorporating mutation information in SO see [28]). However, a specific ontology able to represent or support representation of gene variation data is not available yet.
Even more relevant, a specific framework for identifying variations is missing. HGVS nomenclature defines mutations in relation to a specific version of RefSeq, which leaves the reconciliation of mutations described with reference to different RefSeq versions problematic. In a LOD framework, this is a key issue as having common URIs for the same mutations is a key for the integration of different datasets.
Furthermore, the definition of equivalent mutation relies on an abstraction which is based on sequence similarity. As such, it is not easily deducible by using common inference mechanisms which are based on Semantic Web technologies and tools (e.g.: a cluster of sequences may de facto inform a class which is characterized by the related consensus sequence).
Solutions which incorporate services in the LOD, e.g. SADI, may provide a unified framework where ontology languages and sequence alignment services could be used to compute the equivalence of mutations.
Aim of this work
In this paper, we cope with issues related to the integration of mutation data in the Linked Open Data infrastructure. We present the development of a mapping between a relational version of the IARC TP53 Mutation database (IARCDB) to RDF that takes into account HGVS recommendations as well as existing ontologies for the representation of this domain knowledge. A first implementation of servers publishing this data in RDF with the aim of studying issues related to the integration of mutation data in the Linked Open Data cloud is also presented.