Transparent mediation-based access to multiple yeast data sources using an ontology driven interface
© Briache et al. 2011
Published: 25 January 2012
Skip to main content
© Briache et al. 2011
Published: 25 January 2012
Saccharomyces cerevisiae is recognized as a model system representing a simple eukaryote whose genome can be easily manipulated. Information solicited by scientists on its biological entities (Proteins, Genes, RNAs...) is scattered within several data sources like SGD, Yeastract, CYGD-MIPS, BioGrid, PhosphoGrid, etc. Because of the heterogeneity of these sources, querying them separately and then manually combining the returned results is a complex and time-consuming task for biologists most of whom are not bioinformatics expert. It also reduces and limits the use that can be made on the available data.
To provide transparent and simultaneous access to yeast sources, we have developed YeastMed: an XML and mediator-based system. In this paper, we present our approach in developing this system which takes advantage of SB-KOM to perform the query transformation needed and a set of Data Services to reach the integrated data sources. The system is composed of a set of modules that depend heavily on XML and Semantic Web technologies. User queries are expressed in terms of a domain ontology through a simple form-based web interface.
YeastMed is the first mediation-based system specific for integrating yeast data sources. It was conceived mainly to help biologists to find simultaneously relevant data from multiple data sources. It has a biologist-friendly interface easy to use. The system is available at http://www.khaos.uma.es/yeastmed/.
The yeast Saccharomyces cerevisiae, known popularly as bakers' or brewers' yeast, has been used extensively in aging research. It is a unicellular organism whose DNA is packaged into chromosomes that are localized in a subcellular structure called the nucleus. Since 1990, it has emerged as an important model organism for the dissection of the biological aging process at the genetic and molecular levels. Saccharomyces cerevisiae was the first eukaryotic genome that was completely sequenced .
Nowadays, the word yeast is widely given to the species Saccharomyces cerevisiae because of the place it occupies in biological research. Large amounts of data related to it are genereted by Life Science and stored in multiple databases. Biologists are brought systematically to query these sources in order to analyse the results of their experiments. They usually perform the following tasks during query formulation and execution: (i) look for appropriate sources where it is possible to find helpful data and specify their location, (ii) identify the focus of each source, (iii) query each convenient source independently using its specific access method and query language, (iv) navigate through the sources to obtain complementary data, and (vi) manuallymerge the results obtained from different sources. This is a tedious and time-consuming task for biologists, most of whom are not bioinformatics experts, and reduces the advantage that can be took of the available information.
The challenges of modern bioinformatics research is not only storing data in repositories, but also processing and integrating them. Multiple solutions to biological data integration have been developed. Researchers have come up with some approaches that integrate diverse biological data sources. Two common approaches have being used to interoperate biological databases: data warehousing (or materialised) approach  and federated/mediator-based (or virtual) approach .
The data warehousing approach is adopted by numerous biological integration systems like GUS , Atlas , BioSQL , BioMart , BioWarehouse , and chado . This approach uses a data warehouse repository that provides a single access point to a collection of data, obtained from a set of distributed, heterogeneous sources. Data from the remote heterogeneous databases are copied on a local server and the user will use a unique interface within the system to allow multi-database queries to be issued to this single interface. Data warehousing requires the use of Extraction, Transformation and Load (ETL)  tools to load data, and map it to a materialised global schema. In fact, warehousing requires that all the data loaded from the sources be converted through data mapping to standard unique format before it is physically stored locally. Relying less on the network to access the data clearly helps to eliminate various problems such as network bottlenecks, low response times, and the occasional unavailability of sources. Furthermore, using materialised warehouses allows for an improved efficiency of query optimisation as it can be performed locally [11, 12]. Another benefit in the data warehouse integration approach is that it allows the system to filter, validate, modify, and annotate the data obtained from the heterogenous sources and this has been noted as a very attractive property for bioinformatics. This approach however has an important and costly drawback in terms of reliability of results and overall system maintenance caused by the possibility of returning outdated results. Warehouse integration must indeed regularly check all the underlying sources for new or updated data and then reflect those modifications on the local copy of the data .
Virtual integration (mainly mediator-based systems) concentrates on query rewriting: It rewrites the user query, into queries that are understood by the integrated sources. The mediator uses the relationships between sources and a global schema to translate queries on the mediator schema to the data source schemata. The two main approaches for establishing the mappings between each source schema and the integration schema are global-as-view (GAV) and local-as-view (LAV) [11, 13]. In the GAV approach the mediator relations are directly written in terms of the source relations. The GAV approach greatly facilitates query reformulation as it simply becomes a view unfolding process. In LAV approach every source relation is defined over the relations and the schema of the mediator. It is therefore up to the individual sources to provide a description of their schema in terms of the global schema, making it very simple to add or remove sources but also complicating the query reformultaion and processing role of mediator.
The mediator-based approach has several strengths compared to data warehouse. It does not have the updating problem as the query goes directly to the original source. Mediators can be seen as a cheaper and more effective approach since they use schema or view integration, rather than having to have huge storage capacity to store copied data from all the involved data sources.
This paper presents a mediator-based system called YeastMed  that aims to provide transparent access to disparate biological databases of yeast. It provides a unique interface between the user who submits a query, and a set of five data sources accessible via web protocols. YeastMed relies on SB-KOM  to perform the query transformation needed to reach the integrated data sources. These sources are: SGD , Yeastract , CYGD-MIPS , BioGrid  and PhosphoGrid . They provide complementary data on biological entities (cellular interaction, metabolic pathways, transcription factors, annotation data...). With YeastMed, we aim to help biologists to get relevant data to understand and explain the biological processes of interest by using an integrative system.
This paper is organised as follows: an overview on some biological data integration systems is given in the next section. Then, a general overview of the system and the resources used in YeastMed are given before to describe the integration process components along with some explanatory schemas. A detailed use case is then sketched describing how YeastMed proceeds when a user query is submitted. At the end, we discuss some advantages and limitations of the current version of YeastMed before to conclude the paper.
Works specific to the integration of yeast data sources are not abundant. However a variety of data integration systems especially tailored to cater for bioinformatics applications have been developed. These systems can broadly be classified as: data warehouses, federated/mediator-based systems and XML-based systems.
Several attempts have been made to create integrated environments for storing and analysing biological data. For the sake of brevity, we sketch here two of them due to their relation to yeast.
Cell Cycle Database  is an integrated data warehouse for systems biology modelling and cell cycle analysis based on yeast and mammalian organisms. The system integrates information about genes and proteins involved in the cell cycle process. It stores complete models of interaction networks and allows the mathematical simulation over time of the quantitative behaviour of each component. The database integration system consists of a series of programs used to retrieve the data from several different external databases, transform and load them into the warehouse data model.
YeastHub  is a prototype application in which a data warehouse has been constructed in order to store and query different types of yeast genome data provided by different resources in different formats including the tabular and RDF formats. Once the data are loaded into the data warehouse, RDF-based queries can be formulated to retrieve and query the data in an integrated fashion. YeastHub is implemented using Sesame 1.1 . The tabular-to-RDF conversion is written using Java.
These two systems present some limitations: they store the extracted data locally in a data warehouse or database which render the updating process a tedious task. YeastHub presents another problem where Sesame does not have a way to identify the source of the triples (statements) once they are loaded into the repository. In contrast, YeastMed accesses and interrogates data in its original data sources, and provides the user with the possibility to choose which data source entry to return. If the user doesn't make a choice, the system explicitly gives the provenance of the result entries.
Other alternative solutions have been proposed in biological data integration adopting a virtual approach. Among them we can cite:
Kleisli  is as a mediator system encompassing a nested relational data model, a high-level query language, and a powerful query optimiser. It runs on top of a large number of light-weight wrappers for accessing various data sources. The Kleisli system is highly extensible. It can be used to support several high-level query languages by replacing its high-level query language module. Kleisli supports the Collection Programming Language (CPL)  and a nested relational version of SQL. However Kleisli does not use any global schema or ontology over which a user can formulate queries. A query attribute is bound to a matched attribute in single source, so there is no integration across different sources.
DiscoveryLink  is a wrapper-oriented bioinformatics integration system built on the Garlic project technology . It serves as a middleware between the applications and a set of wrappers. Applications connect to DiscoveryLink and submit an SQL query on its global schema. The wrappers provide source-specific information about query capabilities that help the optimiser to determine which parts of a query can be submitted to each source. The query optimiser considers the speed of various sources, their network connections, and the size of their data to predicate the costs of different plans. DiscoveryLink, however, cannot deal with complex source data such as nested data. Most biological data, unfortunately, are highly nested. Therefore, there is a significant amount of mismatch between most data sources and DiscoveryLink. Furthermore, it is hard to add new data sources or analysis tools to DiscoveryLink. In addition, DiscoveryLink requires SQL as its query language, which is not easy for biologists to write.
TAMBIS  is a mediator-based and ontology-driven integration system, it has three layers: the conceptual model, the mapping model and the physical model. In TAMBIS, the formulation of queries is done through a graphical interface where user needs to browse through the different concepts defined in the global schema and select the suitable ones for particular query. As the first step, the system expresses the graphical query in GRAIL . Then, the query is translated into a Query Internal Form (QIF), which is in turn translated into a source-dependent query execution plan in CPL . The global ontology is a unified conceptual-level representation of its registered component resources. It provides a global schema as well as an abstract framework for relating, reconciling, and coordinating the concepts in the sources. The mapping model converts a query phrased in terms of the conceptual layer into executable plans in terms of each source. The physical model submits the executable plans to different sources and retrieves the results. Although TAMBIS is more of an upper level solution than other systems, but its graphical interface is very complicated and requires that a user understands the query language. BioMediator  is a federated data integration system based on XML. It uses a mediated schema which allows for more flexible data modelling. The central component of BioMediator system is its source knowledge base, which consists of descriptions of the various data sources, mappings from the source to the mediated schema, and the mediated schema itself. The system include also wrappers that conduct syntactic translations by translating the returned data results into an XML document, a metawrapper that conducts semantic translations by mapping the returned XML document onto the mediated schema, and a query processor that queries (using XQuery Language) against the mediated schema. BioMediator is thus dedicated to users who know the XQuery language and is not willing to be used by external research groups.
Compared to these systems, YeastMed is the first data integration system which adopts the mediation approach to integrate yeast-specific data sources. It has a domain ontology which plays the role of the global schema and supports the user queries. Unlike the systems cited above, YeastMed has an easy-to-use ontology driven interface where users express their requests in simple natural language. Users do not need to know a specific query language to use it. In addition, due to its modular design, YeastMed furnishes the possibility to add easily new data sources or analysis tools.
Despite the possibility to use standard approaches for data integration , specific approaches based on the employment of XML in Bioinformatics have been proposed:
Automed  is a heterogeneous data transformation and integration system which offers the capability of handling virtual, materialised and hybrid data transformation/integration across multiple data model. AutoMed uses XML DataSource Schema (XMLDSS) as a common representation language and schema type supporting the annotations for each source by suitable ontologies. An XMLDSS schema can be automatically extracted from an XML document or automatically derived from an accompanying DTD/XML schema if one is available.
The system approach is based on: (i) XML as a common representation format; (ii) XMLDSS as the schema type for the XML documents input to and output by services; (iii) Correspondence to available ontologies; i.e. the services inputs/outputs are annotated with correspondences between the XMLDSS schema and some existing ontologies; and (iv) AutoMed toolkit to automatically transform the XMLDSS schema to output of a given service to the XMLDSS schema of the input of another service.
SWAMI  defines a rich middleware architecture that integrate different databases, formats and computational resources. Its architecture design includes a Presentation layer that receives user requests, passes them to the Core workbench Application, and returns application results to user by the same route. The Core Application consists of four major components: The user module which receives data and instructions from the Presentation Layer. The Broker module which interacts with the others modules via APIs and serves as coordinator using a registry service that maintains information about all available services and databases. Then The Tool and Data modules, which are conceptually identical, abstract respectively applications and databases, and perform their functions by orchestrating a series of services. XML is used for the declarative specification of services.
YeastMed is a mediator-based system that consists of several components contributing to the data integration process in different ways. In this section we talk in detail about the process for creating the system by giving descriptions of its components and the role of each of them.
▪ A quick search form where scientists can quickly submit their requests based on some keywords (Gene or Protein names, GO terms or any other words that can appear in the search fields of the interrogated data sources). This type of search does not make use of the mediator. It exploits the YeastMed web services to look for information in the integrated data sources.
▪ An ontology-driven search form which allows biologists to express their requests in terms of the YeastMed Ontology. These terms are presented in natural language to ease the query formulation process for biologists most of whom are not familiar with knowledge representation and query languages.
YeastMed relies on SB-KOM  to perform query transformation at execution time. Once the user submits a request from the web interface, YeastMed generates a conjunctive query. SB-KOM decomposes this query into suitable sub-queries to individual sources based on a set of mapping rules. These sub-queries are expressed in XQuery, because the sources are accessed through web services using this query language.
YeastMed have a set of web services (Data Services for us): one for each integrated source. These components receive XQueries from SB-KOM and return XML documents. The role of the web services is to allow YeastMed to use wrapper functionalities to find and extract solicited information from data sources through their web pages or FTP mechanisms. Answers, materialised by XML documents, to XQueries are sent to the mediator which combines them into a YeastMed ontology instance expressed in RDF. The final result is provided for the user in HTML format. Data sources are also an important component in the YeastMed architecture because they are the providers of the biological information.
In its current version, YeastMed integrates five Yeast databases. They have been selected for having the most appropriate properties for studying Saccharomyces cerevisiae, because they provide complementary data concerning genome, proteome, metabolome and reactome. These sources are:
▪ SGD Database : It contains the sequences of yeast genes and proteins, descriptions and classifications of their biological roles, molecular functions, subcellular localisations, links to literature information and tools for analysis and comparison of sequences.
▪ YEASTRACT Database : It is a repository of regulatory associations between transcription factors and target genes, based on experimental evidence which was spread throughout bibliographic references. Each regulation has been annotated manually, after examination of the relevant references. The database also contains the description of specific DNA binding sites for a sub-group of transcription factors.
▪ MIPS-CYGD : aims in general to present information on the molecular structure and functional network of Saccharomyces cerevisiae. In addition, the data of various projects on related yeasts are also used for comparative analysis.
▪ BioGRID : It is an online interaction repository with data compiled through comprehensive curation efforts. All interaction data are freely provided through the search index and available via download in a wide variety of standardised formats.
▪ PhosphoGRID : records the positions of specific phosphorylated residues on gene products. Where available for specific sites, PhosphoGRID has also noted the relevant protein kinases and/or phosphatases, the specific condition(s) under which phosphorylation occurs, and the effect(s) that phosphorylation has on protein function.
The YeastMed web site also gives the possibility for users to use a quick search form to interrogate the five integrated databases without using the mediator. Users simply enter their keywords in an input field, select the databases to be looked up and the system takes advantage of the YeastMed Data Services to access and extract data from the underlying sources.
YeastMed has a set of modules that depend heavily on XML and semantic web technologies to integrate syntactically and semantically biological data. In what follows, we give detailed information on these components.
YeastMed uses a set of web services (called in our case Data Services) to access data sources. We have developed one Data Service for each integrated yeast source. These components hide technical and data model details of the data source from the mediator. They receive XQueries from SB-KOM and return XML documents in addition to other metadata. The role of YeastMed Data Services is twofold:
▪ Allowing YeastMed to use the wrapper functionalities to find and extract solicited information from data sources using HTML protocols or FTP mechanisms. This means providing the ability to solve XQueries and return answers in XML format.
▪ Exporting semantic information about data schemas and data provenance. This allows mainly YeastMed to keep track of the returned information when combining them and also which source is being interrogated.
In addition to the wrapper's query service, the web services encapsulate an Application Programming Interface (API). It is the access point for SB-KOM to the functionality of the web service. This API publishes three methods: getQuery(Q) that passes to the wrapper the XQuery Q and returns its answer in an XML format. The XML structure of this answer must satisfy the constraints of the source schema. The other two methods, getSchema() and getProvenance(), provide access to the metadata that the web service stores. The former returns the XML data schema and the latter provides information on the underlying data source. In order to use these methods correctly, SB-KOM finds all the necessary information about them in a WSDL (Web Service Description Language) document.
The Data Services have been implemented in Java. They receive XQueries from SB-KOM via the getQuery() method of the API which passes it to the wrapper. This is materialised by a set of java classes that define several methods. The incoming XQuery is analysed to identify precisely what information is solicited from the underlying data source. The wrapper then generates a source-adapted query following the query capabilities of the source already specified for each Data Source. Then it establishes a connection to the data source via HTML or FTP protocols. In the case of HTML protocol, the data source is interrogated through its web interface using its query engine. The answer is one or several web pages which are parsed on the fly to extract the solicited data. In the case of FTP protocol, the data source is interrogated through its available flat files which are also parsed on the fly. A set of methods are defined to extract data from source answers and organize them as an instance of the XML source schema before to send it to the SB-KOM in the form of an XML document.
YeastMed is able to reflect data provenance by calling the method getProvenance() which returns information about a source or through the XML document returned by Data Services: it contains by default a description of the interrogated data source. Thus, instances with the integrated data can be annotated with the data provenance of each piece of information. In this way, the user interface could show users the provenance of each part of the results.
As mentioned before, the goal of the YeastMed System is to help scientists to get information from multiple Yeast data sources by providing a single access point. To that end, we have equipped YeastMed with a domain ontology. The primary purpose of this Ontology is to support the user queries. Queries are phrased in terms of the ontology and YeastMed converts these to XQuery requests to the appropriate sources via Data Services. The YeastMed ontology has been constructed from scratch by reconciliating the different data source schemas into a single, coherent ontology.
▪ The purely biological concepts category, which is a union of all the classes modelling biological entities found in the integrated data sources. As an example of this category, we cite ChromosomalFeature concept. It is the superclass of 38 classes representing different types of chromosomal features (genes, CDS, intron, repeat regions, etc).
▪ The source related concepts category is represented by concepts referring to sources. For example the concept Source represents the five integrated data sources and the concept Entry refers to entries in data sources. Adding this category to the ontology has as the objective to permit scientists, when using YeastMed, to express their preferences on data sources. So, giving the possibility to determine which source entry they want YeastMed to return if a result is found rather than the system making its own choice.
To convey additional semantic information about the concepts, the ontology defines two types of properties. The first one is defined by a set of object properties that model the relationships that can hold between two individuals belonging to one or two different classes of the ontology. The second type concerns data properties: these are relationships linking an individual to a literal data.
▪ SWI4 and TOP1 are two Genes having the systematic names YER111c and YOL006c;
▪ SWI4 and TOP1 code respectively for a Transcription Factor and an Enzyme;
▪ SWI4 regulates the transcription activity of TOP1;
▪ Both SWI4 and TOP1 code for proteins (having respectively the same standard names as their Genes).
These assertions let one define:
▪ Four concepts: Gene, Protein, TranscriptionFactor, and Enzyme;
▪ Four object properties: codesFor and its inverse property codedBy linking Gene to Protein, in addition to the property regulates and its inverse regulatedBy linking TranscriptionFactor to Gene;
▪ Two datatype properties: hasSystematicName and hasStandardName linking TranscriptionFactor and Enzyme to literal values of type String (SWI4 and TOP1 for the former and YER111c and YOL006c for the latter);
▪ Enzyme and TranscriptionFactor as child concepts of Protein.
In YeastMed, we have chosen OWL  as a standard Ontology language to represent the ontology. OWL is, like RDF , taking advantage of the syntactic universality of XML. Based on the RDF/XML syntax, OWL provides a way to write web ontologies. It is different from the couple RDF/RDFS in the sense that is just a language of ontologies: If RDF and RDFS bring the user the ability to describe classes (with constructors) and properties, OWL incorporates, in addition, comparison tools for properties and classes: identity, equivalence, contrary, cardinality, symmetry, transitivity, disjunction, etc. Thus, OWL offers for machines a greater capacity of interpretation of the web content than RDF and RDFS , with a wider vocabulary and a real formal semantics. To be more precise, we have contented ourselves with using OWL-Lite (which is a sub-language of OWL) because we have envisaged from the beginning to equip YeastMed with a simple domain ontology showing a simple concepts hierarchy and simple constraints.
Having a domain ontology facilitates the formulation of queries to the system. The users simply pose queries in terms of the ontology rather than directly in terms of the Source Schemas. Although this is very practical and effective in terms of the system transparency to the user, it brings the problem of mapping the query in the mediated schema to one or more queries in the schemas of the data sources. In YeastMed, this problem is solved using the functionality of SB-KOM. So in addition to modelling the ontology and the sources, we needed to establish associations between the concepts in the ontology and the appropriate elements representing the information in the sources. These associations are materialised in YeastMed by the mapping rules.
SB-KOM is designed to decompose queries based on GAV approach-based mappings. That means each concept (also property in our case) in the ontology is a view defined in terms of the source schemas' elements. This view specifies how to obtain instances of the mediated schema elements from sources. In this context, the mapping rules we have used are defined as pairs (P,Q). P is one or a couple of path expressions on a source schema expressed in XPath, and Q a conjunctive query expressed in terms of the Ontology terms. Three kinds of mappings have been defined:
▪ Class Mapping: it maps ontology classes to source schemas. It has the following form:
XPath-Element-Location, Ontology-Class-Name, correspondence-index
Where XPath-Element-Location is the location of an element in the source schema, expressed in XPath; Ontology-Class-Name is the name of the corresponding class in the Ontology and correspondence-index is an integer value that informs on the correctness of the mapping instance. In YeastMed, this index is always 100 since all the mappings are done manually and not automatically. An example which maps the Protein class to the SGD schema is as follows:
▪ Datatype Property Mapping: it maps ontology datatype properties to source schemas. It has the following form:
XPath-Domain-Location; XPath-value-Location, Ontology-Domain-Name; Property-Name, correspondence-index
XPath-Domain-Location is the Path to the element in the source schema which is mapped to the domain of the datatype property; XPath-value-Location is the Path to the element where the property takes the value of its range and Ontology-Domain-Name and Property-Name are respectively the domain and the name of the property. The following example concerns the datatype property hasName:
Result/Entries/Entry/Protein; Result/Entries/Entry/Protein/SysName, TranscriptionFactor;has Name,100
▪ Object Property Mapping: it maps ontology object properties to source schemas. It has the following form:
XPath-Domain-Location; XPath-Range-Location, Ontology-Domain-Name; Ontology-Range-Name; Property-Name, correspondence-index
XPath-Range-Location is the Path to the element in the source schema which is mapped to the range of the object property. Ontology-Range-Name is the range name of the property. The following example shows how the object property hasBibRef is mapped to the source schema:
YeastMed relies on SB-KOM , to perform query transformations at execution time. KOMF is a generic infrastructure to register and manage ontologies, their relationships and also information relating to the resources. This infrastructure is based on a resource directory, called Semantic Directory , with information about web resource semantics. KOMF has been successfully instantiated in the context of molecular biology for integrating biological data sources [41–43]. SB-KOM mediator is composed of three main components: the Controller, the Query planner and the Evaluator/Integrator.
The Controller component receives requests coming from the YeastMed web interface and evaluates them to obtain a result for the requests. The controller creates different threads for different user requests, and assumes the role of the middleware between the mediator components. Queries are expressed as conjunctive predicates , with three main types of predicate: classes in terms of YeastMed ontology which is registered in the Semantic Directory, datatype properties that link individuals to data values, and object properties that link individuals to individuals. The results of these queries are instances of the YeastMed ontology which the query was expressed in.
The Query planner component is by far one of the most fundamental pillars in elaborating one or several query plans to solve the query from different data sources. Plans generated by this component specify the data sources from which the information can be retrieved and in which order they must be accessed. The evaluation of these queries depends on the query plans themselves.
According to the query (a conjunctive query), there will be different types of mapping in the Semantic Directory. Classes will be connected to the XPath of one or several XML Schema resource elements. On the other hand, datatype properties will be connected to those two expressions: the first one corresponds to the class and the second to the property. The object properties will be related to the active XPath classes in the property.
The Query Planner runs following a simple algorithm that receives as entry a conjunctive query expressed in terms of the YeastMed ontology (conjunction of concepts and properties) and returns a set of possible plan trees. The algorithm steps are enumerated below (for a use case see the following section):
1. Get all the query predicates (concepts and properties) and distribute them in two groups based on the number of the arguments: Ɠ 1 will contain predicates having one argument (concepts) and Ɠ 2 will contain predicates having two arguments (properties).
2. Construct a set Ƈ s of combinations between the two groups based on common arguments, add all the elements of Ɠ 1 and Ɠ 2 to it and eliminate the repeated ones.
3. Eliminate from Ƈ s the elements that do not have a representation in the mapping rules registered in the Semantic Directory.
4. For each instantiated variable in the predicate arguments, elaborate a plan tree:
a. The instantiated variable will construct a root node.
b. The elements that contain a predicate specifying a value for the instantiated variable and the elements that contain only the instantiated variable (without other variables) will be passed to the current node and eliminated from Ƈ s .
c. The elements that contain, in addition to the instantiated variable, another variable will constitute the edges leaving the current node to new nodes and eliminated from Ƈ s . The newly created nodes will be represented by the other variables which will be the instantiated variables.
d. if there are still more elements in Ƈ s and for each new instantiated variable we continue from the step 4.b.
The Elvaluator/Integrator is the third component of SB-KOM mediator. It analyses the query plan (QP), and performs the corresponding calls to the Data Services involved in the sub-queries (SQ1, ..., SQn) of the query plan. To answer YeastMed query, this component first executes the Data Services in the order specified by the query plan. Then, it obtains the instances from the Data Service results. These instances are not interconnected because they have been produced by different Data Services. In order to retrieve a set of interrelated instances we need to establish relationships between them. This can be achieved by the object properties defined in the ontology that are used as relationships between services in the query plan. Finally, these interrelated instances are filtered in order to eliminate the information not required.
In this section, we show how a user query is solved by YeastMed, and how its different components take part in this process. Let us take the case of a biologist who is using YeastMed to find information about two kinds of proteins. The first one is represented by DNA Topoisomerase III, and the second one is indicated by some transcription factors regulating the expression of the first kind. The biologist is interested in the phosphorylation sites that are found in the sequences of the transcription factors of DNA Topoisomerase III, especially the one (or ones if they exist) whose gene is located on the Chromosome 16. In addition, the biologist also aims to get all the literature on DNA Topoisomerase III. As stated previously, YeastMed provides a web interface that allows biologists to express this kind of requests in terms of the ontology. The user can formulate its request in the YeastMed interface by selecting fields' items as follows:
"Protein", "having Description", "DNA Topoisomerase III";
"Protein", "having Bibliographic Reference", "Bibliographic Reference";
"Protein", "Regulated By", "Transcription Factor";
"Transcription Factor", "Belongs To", "Chromosome";
"Chromosome", "having Name", "16";
"Transcription Factor", "having Phosphorylation Site", "Phosphorylation Site".
To specify to the system what to return, the user should add checkmarks by clicking on the boxes above the fields where "Bibliographic Reference" and "Phosphorylation Site" were chosen before to submit its query.
Ans(BR,Ph):= Protein(P),hasDescription(P,"DNA Topoisomerase III"), BibRef(BR),hasBibRef(P,BR), hasSystematicName(P,SN), regulatedBy(P,TF),hasName(TF,Nt),TranscriptionFactor(TF),Chromosome(C),hasName(C,"16"), BelongsTo(TF,C),PhosphoSite(Ph), hasPhosphoSite(TF,Ph);"
This conjunctive query includes as predicates five ontology classes (Protein, BibRef, TranscriptionFactor, Chromosome and PhosphoSite), three datatype properties (hasDescription, hasSystematicName and hasName) and four object properties (hasBibRef, regulatedBy, belongsTo and hasPhosphoSite). This query will return instances of PhosphoSite and BibRef that satisfy its constraints.
The groups used to form the plan tree.
Protein(P),hasDescription(P,"DNA Topoisomerase III")
Protein(P), hasSystematicName(P, SN)
Protein(P), RegulatedBy(P, TF)
TranscriptionFactor(TF), hasName(TF, Nt)
TranscriptionFactor(TF), hasPhosphorylationSite(TF, Ph)
SGD; Yeastract; PhosphoGrid
The YeastMed Data Services are executed by the Evaluator/Integrator following the plan, after optimisation, generated by the Planner. In our case, SGD Data Service receives the first sub-query, because the object property hasDescription is mapped to the SGD Schema. TOP3 is returned as an answer of this sub-query and then is used by the sub-query RegulatedBy to find instances of TranscriptionFactor. The Yeastract Data Service is invoked this time because the property is mapped to the Yeastract Schema. Three instances of the type TranscriptionFactor are returned: Fhl1p, Hsf1p and Swi4p. For each of these instances, the Yeastract Data Service is called again. It receives this time the sub-query represented by the property belongsTo that contains the two arguments instantiated: the first one is one of the three instances returned by the previous query, and the second argument is instantiated by the name of the chromosome 16. This sub-query checks whether the Transcription factor has its coding gene on the chromosome 16. Only the instance Fhl1p is maintained. Finally the sub-query hasPhosphoSite is executed on the PhosphoGrid Data Service that returns all the PhosphoSite instances of the Transcription Factor Fhl1p. At each execution, the Evaluator/Integrator receives results in XML format from the target Data Services.
These results are instances of the XML schemas of the underlying sources. Based on the mapping between the elements of the source schemas and the elements of the ontology, these XML schema instances are translated into ontology instances which are not interconnected because they have been produced by different Data Services. To associate them, the Evaluator/Integrator uses just the instances of the domain and range classes of the object properties. The final result is an ontology instance that includes all the data extracted from the interrogated data sources. That is all the instances of the concepts BibRef of the protein TOP3 and all the PhosphoSite objects of the Transcription Factor Fhl1p.
We have conducted a usability assessment in order to grade how well biologists can learn and use YeastMed to achieve their goals and how satisfied they are with the system. We have also conducted a performance study of the system to reveal how run times behave towards the increase of the number of implied data sources in queries. In this section we present results obtained from these two studies.
A variety of methods have been reported in the literature for assessing the perceived usability of interactive systems. We can particularly cite QUIS , SUS , CSUQ  and Microsoft's Product Reaction Cards . Tullis and Stetson  reported a study that compared these methods and showed that the accuracy of the analysis increases as the number of participants gets larger (for a sample of 6 to 14) and that the accuracy of SUS increases quicker than the others. For that, we have used SUS method in our study. The SUS questionnaire consists of 10 items to which participants rate their level of agreement. Odd-numbered items are positively worded and even-numbered items are negatively worded. A 5-point scale of agreements numbered from 1 (anchored with "Strongly disagree") to 5 (anchored with "Strongly agree") is used for each. Each item's score contribution will range from 0 to 4. For odd-numbered items, the score contribution is the scale position minus 1. For even-numbered items, the score contribution is 5 minus the scale position. To get the overall SUS score, which is the indicator of usability, the sum of the item score contributions is multiplied by 2.5. SUS scores ranges from 0 to 100, with 100 representing a perfect score.
The usability study we conducted had two objectives: (1) having a general indicator on the usability of YeastMed, and (2) assessing the evolution of the system usability with the level of familiarisation to biological databases. This is represented, in our study, by the frequency of using biological databases of the participants. These objectives will let us (1) to grade how well biologists, in general, can learn and use the system and (2) how well we have succeeded to furnish an easy-to-use system for biologists who are familiar with HTML forms of biological databases.
To illustrate the performance of YeastMed, we present, in this section, the result of a study conducted on the run times of the three main stages of the YeastMed query processing: Planification, Execution and Integration. The study concerned 50 queries distributed on 5 groups following the number of data sources participating in the query answer (from 1 to 5). All queries were run on a dual-processor 2.33 GHz Pentium 4 processor machine with 4 GB of memory. The objective of this study is not to provide a thorough performance analysis, but simply to show how run times behave towards the increase of the number of implied data sources. Each query was executed in three instances before to calculate its mean run times for the three stages. The data sources implied in the multi-sources queries are called exactly one time for each query. This had the objective to give certain uniformities to the study.
Dynamic integration is a very important issue for traditional mediator-based systems. They are usually developed as monolithic systems and their architecture based on wrappers involves a high degree of coupling among the system components. They usually do not provide scalable and reusable solutions. By the modular design and the uncoupling of all the components of YeastMed, we have sought to break out of traditional mediation architecture and provide a flexible platform for integrating Yeast data sources. The modular structure of the system reduces the costs of the system maintenance. The system can be easily extended to cover other sources. It is not required to rebuild the system from scratch. The new source components are built independently and then integrated in the system, i.e. publishing a web service underlying that source, adding semantic views on the source to the ontology, and publishing its mapping rules in the Semantic Directory. The rest of the system components are not touched. On the other hand, the fact that the system adopts a mediation approach avoids the updating problem when a change is made in a source at the level of data, because the system does not have a local copy of data. But when the change touches the structure of the flat files or the HTML pages from which YeastMed extracts data, the system will need to reflect this on its components, but only on the modified-source components: the modified-source schema, the mapping rules implying that source, the source-related entities in the ontology and also the web service of the source. The other source components are not modified and the system is not rebuilt from scratch.
Relying on Data Services rather than classical wrappers offers the possibility of reusing them by other mediators or any other data accessing application. This is the case of the quick search service we are proposing in the web site of YeastMed. This service is an added value of the system architecture. It makes use independently of the Data Services to look for entries and direct access to the integrated data sources without passing through the system mediator.
While some mediator-based systems require a specific query language or propose a complicated graphical user interface [25, 27, 29], the YeastMed mediator receives conjunctive queries expressed in terms of the ontology. Even though we estimate that these are not very complicated for biologists to express their requests, we have proposed a simple interface where requests are expressed using natural language through simple forms. All the required translations to conjunctive queries are hidden from users.
It is known that biologists have their own preferences toward databases . In YeastMed we take this into account by giving users the possibility to specify from which database they prefer to get answers. The YeastMed ontology includes some source-related concepts which permit the user to express preferences on data sources. For example, a user can specify SGD as a source from which to get entries by selecting SGDEntry in the query form. Specifying a data source does not mean reducing the constraints to be only applied on the data of that source. Users can specify a source from which to get entries and apply constraints on related data from other sources. For example a user can ask for entries from SGD describing a chromosomal feature regulated by a transcription factor having the standard name Rtg1. This is translated into the following conjunctive query:
Ans(E) := SGDEntry(E),describes(E,F), ChromosomalFeature(F), RegulatedBy(F,R),TranscriptionFactor(R),hasName(R, "Rtg1");
In this conjunctive query, SGD entries are solicited, but all the constraints are made on data residing in Yeastract (data related to Transcription Factor). If the SGD entry has not been specified, the result entries will be returned by default from Yeastract. YeastMed is able to find the equivalent of such entries (if any) in SGD.
We have described YeastMed: an XML and mediator-based system that Integrates five Yeast databases which have the most appropriate properties for studying Saccharomyces cerevisiae.
Data Services play an important role in the integration process of this system, where they are considered as an interface which receives queries, accesses to a data source, extracts data and translates them into a common data model used by SB-KOM. In YeastMed, Data Services extract data mainly from flat files because most of the integrated data sources are accessible via ftp mechanisms and provide data in tabular or XML format. This reduces the costs of the maintainability of the system because flat files structures are not frequently target to changes.
In our system, the schema integrator is an ontology and the results are ontology instances. The use of the ontology and instances enables basic reasoning processes (class-subclass inference) to be later included. This will permit YeastMed to infer new relationships between the instances of the ontology when solving a user query and thus, discover new knowledge for the query answers. The final result is an ontology instance that includes all the data extracted from the integrated data sources. It is converted to an HTML Format before to be presented to users.
The objectives expected from the YeastMed system are not yet all met. The system is still in its natal phase and additional work is undertaken to improve it. The system does not yet make all its ontology available to users when formulating queries. This is because it is not yet able to answer queries expressed in terms of some part of the ontology. In addition, the fact that the system answers only conjunctive queries limits the user requests expression; i.e. it is not able to answer queries using disjunction quantification (or union operator which is denoted as ∪).
Application Programming Interface
Collection Programming Language
Computer System Usability Questionnaire
the Comprehensive Yeast Genome Database
Document Type Definition
Extract, Transform, Load
File Transfer Protocol
Global as view
Genomics Unified Schema
Hypertext Markup Language
Local as view
Munich Information center for Protein Sequences
Web Ontology Language
Query Internal Form
Questionnaire For User Interaction Satisfaction
Resource Description Framework
Resource Description Framework Schema
System Biology Khaos Ontology-based Mediator
Saccharomyces Genome Database
Structured Query Language
System Usability Scale
Transparent Access to Multiple Bioinformatics Information Sources
XML Query Language
Extensible Markup Language
XML DataSource Schema: WSDL: Web Service Description Language.
This work has been funded by the Spanish Ministry of Innovation, Science and Technology (TIN2011-25840) and the Innovation, Science and Enterprise Ministry of the regional government of the Junta de Andalucía (P11-TIC-7529).
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 1, 2012: Semantic Web Applications and Tools for Life Sciences (SWAT4LS) 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.