An XML transfer schema for exchange of genomic and genetic mapping data: implementation as a web service in a Taverna workflow
© Paterson and Law; licensee BioMed Central Ltd. 2009
Received: 9 February 2009
Accepted: 14 August 2009
Published: 14 August 2009
Genomic analysis, particularly for less well-characterized organisms, is greatly assisted by performing comparative analyses between different types of genome maps and across species boundaries. Various providers publish a plethora of on-line resources collating genome mapping data from a multitude of species. Datasources range in scale and scope from small bespoke resources for particular organisms, through larger web-resources containing data from multiple species, to large-scale bioinformatics resources providing access to data derived from genome projects for model and non-model organisms. The heterogeneity of information held in these resources reflects both the technologies used to generate the data and the target users of each resource. Currently there is no common information exchange standard or protocol to enable access and integration of these disparate resources. Consequently data integration and comparison must be performed in an ad hoc manner.
We have developed a simple generic XML schema (GenomicMappingData.xsd – GMD) to allow export and exchange of mapping data in a common lightweight XML document format. This schema represents the various types of data objects commonly described across mapping datasources and provides a mechanism for recording relationships between data objects. The schema is sufficiently generic to allow representation of any map type (for example genetic linkage maps, radiation hybrid maps, sequence maps and physical maps). It also provides mechanisms for recording data provenance and for cross referencing external datasources (including for example ENSEMBL, PubMed and Genbank.). The schema is extensible via the inclusion of additional datatypes, which can be achieved by importing further schemas, e.g. a schema defining relationship types. We have built demonstration web services that export data from our ArkDB database according to the GMD schema, facilitating the integration of data retrieval into Taverna workflows.
The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data. The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards. Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.
Data integration across disparate datasources is a common problem, not limited to the sphere of bioinformatics, and has been addressed in many ways [1, 2]. Cross indexing and hypertext linking of resources has historically been used for accessing bioinformatics data (e.g. SRS ) but is curation intensive and does not adequately address different datatypes found in the distributed datasources. However, cross referencing is widely used and forms the basis of the highly successful Entrez data retrieval system . A common approach for integrating bioinformatics data is to warehouse data according to a shared data schema (e.g. ENSEMBL Biomart ). Data in the individual data sources must be mapped to the shared schema, capturing both the syntax of the data structure and the semantics of what the data objects represent. The warehousing approach therefore incurs heavy data curation costs in creating and maintaining the warehouse and requires extensive computing resources; the alternative approach of federating the warehouse introduces further problems in distributing queries across the underlying distributed resources.
An alternative recent approach involves attempting to integrate data semantically through the use of defined ontologies describing a particular data domain. In addition to the widely used Gene Ontology  which provides a controlled vocabulary for annotating gene and gene function attributes that can then be used in index based retrieval systems, other formal ontologies are actively being developed to support bioinformatics, particularly under the OBO umbrella (Open Biological Ontologies ). Well-defined, detailed ontologies have been created representing common concepts used across the bioinformatics domain, e.g. for sequence features (the Sequence Ontology ) and common concept relationships (the OBO Relation Ontology ) and also more specific concepts for particular domains, e.g. the Amphibian Gross Anatomy . These ontologies provide excellent controlled terminologies for data description and annotation, but do not provide data transport protocols and provide neither the structure nor semantics required for data integration. Consequently, whilst ontology based descriptions of data objects can be compared across data sets – two objects with similar annotations may share similar properties – these annotations do not capture or assert actual relationships between the data objects.
Technologies developed to support the 'Semantic Web' such as OWL, the 'Web Ontology Language' provide an approach for exposing and integrating semantically consistent datasources (i.e. where multiple resources are represented in a common formal ontology) and the potential for Semantic Web integration of bioinformatics resources is being actively explored (e.g. Yeasthub ). Recently we have been involved in the ComparaGRID project which aims to integrate genomic data by mapping data sources to a shared OWL ontology for genomic information, thus facilitating semantic integration and query . However, such an approach is computationally expensive and as yet difficult to implement using available semantic ontology tools.
A more lightweight and arguably more practical approach for data exchange is to encourage the curators of data resources to provide data export facilities in a common simple exchange format, which both captures the data structure and provides a degree of semantic clarity to the exported data, without providing the exacting constraints of a formal ontology. XML has long been recognized as a useful data structure with which to implement formal exchange formats . Such an approach allows application developers to develop their integration systems against the common data format, where the structure of the data is unambiguous, and the semantics are sufficiently defined for usability.
Here we present our approach of using a simple structured XML schema (XSD) to define the data exchange structure for genetic mapping information, an approach successfully used for integrating data in other bioinformatics domains (e.g. Taxonomy ) and also for specifying standards defining the key information to include when reporting experimental results (e.g. those hosted at MIBBI ). Use of common exchange schemas to define data structures allows users (including consuming services or applications) to parse the data syntax into semantically described data objects. Our Genomic Mapping Data (GMD) schema allows common information types for our domain (and the relationships between these objects) to be represented in a syntactically defined structure, thus enabling GMD conformant data documents to be programatically parsed into meaningful genetic information. Furthermore, the GMD schema is extendable to handle additional datatypes. Extension may be achieved not only by evolution of the schema, but simply by using datatypes defined elsewhere, for example by importing additional schemas to the data document, or referring to controlled vocabularies or external ontologies.
To illustrate how the schema can be used we have built web services which export data from the ArkDB genomic mapping resource using our exchange format. We also illustrate how additional defined terms can be imported from a second schema which defines common data relationships relevant to the genomic mapping domain. We demonstrate how the Taverna Workbench can consume these web services into workflows to harvest and use mapping data for bioinformatics tasks. This is possible because Taverna can parse the XML data structure of the data documents provided as web service responses in the context of the defined data format (referenced as the GMD XSD schema).
Results and discussion
One approach for developing a shared schema for data exchange in a given information domain is to initiate a consultation and prototyping cycle amongst interested parties (data providers and users). However, this 'design by committee' approach is frequently protracted and contentious. In order to 'kick start' development of a usable standard for genomic mapping data we have developed a candidate exchange schema that seeks to capture the important core data concepts and relationships, whilst being potentially expandable to incorporate additional requirements if necessary for wider use. To encourage adoption, a generic, unambiguous and semantically 'neutral' schema is desirable. Indeed many of the common data objects in the genomic mapping domain cannot be defined unambiguously across user groups (for example the concepts of 'Gene', 'Map', and 'Marker'). Therefore the schema avoids precise semantic definition of concepts and leaves interpretation to individual data providers and applications. However, the structured syntax of the schema allows any important 'relationships' between concepts to be fully represented. Furthermore, because XML data documents can reference multiple schemas, users can further define their information by reference to additional external schemas. The generic nature of our mapping schema allows any type of map to be represented, allowing integration of the many types of data found in genetic resources. For example the representation of genetic linkage maps and sequence maps is conceptually and syntactically the same: the positioning of mapped objects on a map object using a coordinate system.
The prototype schema GenomicMappingData.xsd, (GMD: provided as additional file 1: GenomicMappingData.xsd and with further documentation at ), includes in-line descriptive annotations. Example data exchange documents conforming to the schema are provided as additional files 2 and 3 (demo1result1.xml, demo2result.xml), and afford examples of the data structures decribed below. We are also hosting an interactive WIKI site  to provide further guidance on best practice for schema usage, and as a forum for discussion of schema refinement and evolution by interested data providers and users.
The structure of our GMD schema aims to be clear and logical, but is primarily designed for unambiguous data storage and programmatic parsing rather than 'human readability'. The root Element of the document is <DataSet>, which is itself defined as a 'complex type', thus allowing this 'type' (gmd:DataSet) to be referenced in other XML documents that import the GMD schema. For example, this allows a web service description document (WSDL) to specify a return type of 'gmd:DataSet' for a particular web service, thus enabling parsing of returned result documents by a web service client (such as a Taverna web service workflow) according to the defined schema structure (see below).
The children of the <DataSet> root Element are multiple, single-copy containers for the data objects (e.g. <SpeciesContainer>, <Chromosomes>, <Markers>) and for binary data relationships(<Relationships>), assertions about data (<Assertions>), third-party datasources such as PubMed or ENSEMBL(<DataSources>), references to these (<ExternalReferences>) and a <Metadata> container for information about the provenance of the DataSet document.
All the main data Elements in the schema have an 'id' Attribute, which might typically be the source database ID. An 'id' Attribute should be unique within a document, but this is not enforced in the schema. These data objects represented by Elements with an 'id' Attribute can be reused by 'foreign key' reference in the document by various 'reference type' Elements. These have an 'idref' Attribute, which should reference an extant 'id' Attribute of another Element (again this is not schema-enforced to prevent programmatic parsing from failing due to validation errors caused by minor inconsistencies in a DataSet). For example, a <Species> Element with 'id = x', can be referenced by a <SpeciesREF> Element with 'idref = x', in this case efficiently allowing multiple objects to declare that they pertain to a particular organism. Similarly, various mapping objects may reference a particular <Analysis> (experimental set) through an <AnalysisREF>, providing important contextual information. In addition, the top level containers are allowed document unique 'id' Attributes of type 'xsd:ID'. In practice, this greatly assists programmatic parsing of the document as such Elements can be obtained from a Document Model by their 'id' value.
The information about each data object is held in subsidiary Elements. In designing the schema, Elements were preferred over Attributes for data storage in order to allow the built-in Taverna parser ('XMLSplitter') to access data values. In order not to overly constrain schema users, most data Elements are 'optional' apart from where required for structural association. For example, a <Chromosome> Element optionally contains <ChromosomeName>, <DBIdentifier>, <DataAccessMethod>, <SpeciesREF> and <ExternalReferenceREFS> whilst a <Relationship> Element logically must contain <ObjectREF>, <SubjectREF> and <RelationshipType> Elements (see below). 'Terminal' data Elements generally have content of type 'xsd:string' or 'xsd:double', whilst those of REFType contain only an 'idref' Attribute (see above).
Several Elements such as <Units>, <QTLDetails>, <OtherData> and the <'Object'Type> child Elements (discussed further below) are allowed content defined as being of 'xsd:anyType', allowing any data structure (e.g. a simple string, an ad hoc XML data structure or an external datatype included by import from an external schema. This allows, for example, <Units> to be recorded as a free text string (e.g. "base pair", "centiMorgans", "flpter"), or perhaps as a type from the OBO Units of Measurement Ontology , by using the OWL Class URL (e.g. http://purl.org/obo/owl/UO#UO_0000244 for 'base pair'). This external type could be represented by a single Element of type xsd:string or xsd:anyURI, a type structure defined or not by schema import, or might be of gmd:REFType and point at an ExternalReference Element in the document itself (see above). Allowing <Units> to be defined as any type permits a generic representation of any map type data in schema-compliant DataSet documents. Of course inclusion of xsd:anyType extensions only allows defined client parsing if the extended types are referenced by import of an external XSD schema.
Structural nesting of Elements is used where logical, for example the <Mappings> that lie on a given <Map> are nested under that <Map>, and a sequence associated with a particular <Marker> or <Clone> may be nested as a <SequenceREF> reference under that object. However, any data relationship may alternatively be stored as a binary relationship (see below). Rather than provide multiple subsumption-typed versions of data objects such as Map, Sequence or Marker, which would cause gross schema inflation when accounting for all users' requirements, such Elements contain optional child Elements to record a specific 'ObjectType' for the object (such as <MapType>, or <SequenceType> for example) and are also allowed content of 'xsd:anyType', which, as described above, could hold a string value such as "Linkage Map", "Radiation Hybrid Map", "EST", "genomic". In order to promote data consistency within a user group, permitted string values might derive from enumerated (but expandable) lists, or from more controlled defined vocabularies, or indeed types could be imported from external sources (as above). This semantically 'loose' mechanism enables a 'Marker' object to be interpreted generically as an 'undefined genetic marker' or, as a specific type of marker (e.g. a "SNP", a "Gene") if such ObjectTypes are provided and documented by the data provider and are relevant to the consumer application.
An important feature of the DataSet schema is the provision of full data provenance and data access protocols; only by providing this information can reliable data integration be achieved. To this end an optional Element <DBIdentifier> is allowed for most data Elements which can be used to record the identifier or accession number for that object in the datasource (as recorded in the DataSet Metadata). In addition, the optional <DataAccessMethod> Element can detail how to actually access a record of this object from the source datatabase (as a service URL or by a parameterized Web Service call or other means). Providing this information enables programmatic expansion of datasets, for example where one dataset provides details of markers found on a particular map, and each marker provides its access method, a client program could fetch the full record for each marker, which might detail every map that this marker is associated with, its relationships to other markers or sequences and any external datasource links. In order to facilitate integration across datasources most data Elements in the schema may include external references to third party data repositories. This is provided by <External Reference> Elements which detail an external datasource name/location and a specific object ID therein. These are referenced internally in the document by optional <ExternalReferenceREF> child Elements for a given data Element, which point to a <External Reference> Element. For example this could allow a Sequence in the DataSet to be explicitly linked to a Sequence in GenBank by its GenBank accession. The client can then use this GenBank ID to retrieve the original record for this Sequence. This allows integrative workflows to be composed, for example using the Taverna Workbench (see below).
Representing Data Relationships
A part of the knowledge domain that is typically hard to capture is represented by <Relationship> Elements, which capture (potentially directional) relationships between two data objects (captured as a <SubjectREF> and an <ObjectREF>, which reference a pair of data Elements in the document). In order to capture the nature of the Relationship each <Relationship> must be 'typed' via a <RelationshipType> child Element (again allowed content of 'xsd:anyType', as described above). Ultimately, use and interpretation of relationships will be dependant on a shared interpretation by data users and providers. Again data interpretation and consistency might be assisted by provision of a controlled 'vocabulary' of possible types (e.g. "homologous to", "orthologous to", "associated with", "derived from") or defined terms or types could be imported from external sources (e.g. the OBO Relation Ontology  or an extension thereof with relationships of greater relevance to our genome mapping domain). However, precise semantic definition of relationship types across the whole potential user domain is perhaps overly ambitious, and controlled vocabularies for particular sub-domains or 'user groups' may prove more pragmatic.
An important feature of <Relationship> Elements, in common with other data Elements, is that they can record the provenance of the assertion through <ExternalReference> Elements, allowing users to validate and accept or reject these relationships.
ArkDB web services
ArkDB Web Services
Single access point for all Methods
Individual services take a number of query parameters (as defined in the WSDLs, as named strings wrapped in XML) such as: a species or chromosome name or a database identifier. and return the response information as a DataSet document.
The services are implemented using the ArkDB web application's Java Model and API, and serialize Ark Objects to XML. Each individual service returns an internally consistent document filled in with appropriate data Element values for that query. For example, the SpeciesService provides a single getAllSpecies method that returns a document with limited details of all the Species represented in the ArkDB, i.e. only the Metadata and SpeciesContainer Elements contain data. On the other hand, the KaryotypeService provides a method to getAllSpeciesKaryotypes which additionally fills in Chromosome data in the Chromosomes container, where each Chromosome cross references the source Species. Further methods allow karyotypes to be retrieved by species name or ArkDB accession number. A more complicated service such as MapService provides more extensive information to the returned GMD document. For example, the getMapsForChromosome takes speciesName and chromosomeName parameters, and return data about Species, Chromosomes, Maps and Analyses. The actual Mappings on a Map are returned by a method in the MappingsService, which returns Mappings data as well as data about the mapped Markers, Clones, Sequences or other objects. These returned XML documents necessarily expand in size, but the XFire SOAP API efficiently handles data compression and transfer.
The service designer controls the level of detail provided in each request, with the ability to 'drill down' provided by chaining data parsed from the results of one request as input parameters to the next. For example, full details on an individual marker discovered on a MappingService request can be obtained by querying the MarkerService which will return not only mapping data about that marker, but also any Relationships held about that marker in the database, which for example might include known associations with other markers, genes or sequences and cross references to third party data resources.
Taverna workflows using the ArkDB web services
The structure of the ArkDB web services described above allows requests to be chained together in a workflow to drill down and query through the data held in the database. Both the GMD schema and the web services were designed to be compatible with the Taverna Workbench application . Taverna is an open-source workflow tool that allows users to construct analysis workflows from components located on both local and remote machines, and run these workflows incorporating their own data or query parameters. We have used methods from the ArkDB web services as external components in workflows and used the Taverna 'XMLSplitter' processor and java XPath 'widget' to parse through the structure of the returned documents to extract the data fields desired. In this manner it is possible to chain web services together, by extracting the required value(s) from one web service response and using it as the input of a subsequent query.
We have developed a data exchange standard for genomic mapping data that allows us to define the structure and general semantics of data exported via web services from our in house genomic mapping resources. Data exported in our data exchange standard can then be meaningfully integrated into workflows created using the Taverna Workbench. Adoption of this exchange format (or an evolution of it) by further data providers will allow integration of mapping data from distributed data resources, either by incorporation of WSDL described web services into Taverna workflows, or by any bespoke web service client application.
Our exchange format provides several important features to facilitate its generic use by a wide community of data providers and consumers. (1) We have aimed for a relaxed semantic definition of the data objects found in the mapping world. This should simplify the data mapping process, both that performed by the data provider (from data source to schema) and between data resources when data users integrate data from multiple sources. (2) We provide extensibility within the schema, both to allow additional information to be included in GMD conformant documents, and to include specific DataTypes from third party sources such as ontologies or other data schemas. (3) We provide an External Reference mechanism to carry references to third party data resources; obvious uses for this are for PubMed citations, Genbank accessions or ENSEMBL gene IDs. (4) We provide a structure to record any type of relationship between two data objects, thus not constraining the document format to the common mapping relationships with which we are familiar. (5) We provide mechanisms to record the provenance of data: by detailing the datasource (as Metadata for the whole document, and by recording source identifiers and access methods for individual data Elements) and by allowing reference to external datasources both for individual data Elements and any for individual data relationships and assertions. (6) We provide a mechanism for recording ad hoc assertions about data objects which should allow flexibility in representing nuanced observations about the data, for example third party comments that a particular piece of data is obsolete or incorrect.
The authors would like to thank members of the ComparaGRID project  for useful experience in exploring the issues of semantically controlled data integration. This work was funded by the BBSRC grant number BBS/B05478.
- Philippi S, Kohler J: Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet 2006, 7(6):482–488. 10.1038/nrg1872View ArticlePubMedGoogle Scholar
- Stein LD: Integrating biological databases. Nat Rev Genet 2003, 4(5):337–345. 10.1038/nrg1065View ArticlePubMedGoogle Scholar
- Zdobnov EM, Lopez R, Apweiler R, Etzold T: The ebi srs server – recent developments. Bioinformatics 2002, 18(2):368–373. 10.1093/bioinformatics/18.2.368View ArticlePubMedGoogle Scholar
- Geer RC, Sayers EW: Entrez: making use of its power. Brief Bioinform 2003, 4(2):179–184. 10.1093/bib/4.2.179View ArticlePubMedGoogle Scholar
- ENSEMBL Genome Browser[http://www.ensembl.org]
- The Gene Ontology[http://www.geneontology.org]
- The Open Biomedical Ontologies[http://www.obofoundry.org]
- The Sequence Ontology Project[http://www.sequenceontology.org]
- The Open Biomedical Ontologies: OBO relationship types[http://obofoundry.org/cgi-bin/detail.cgi?id=relationship]
- Amphibian Anatomical Ontology[http://www.amphibanat.org]
- Cheung K-H, Yip KY, Smith A, Deknikker R, Masiar A, Gerstein M: Yeasthub: a semantic web use case for integrating data in the life sciences domain. Bioinformatics 2005, 21(Suppl 1):i85-i96. 10.1093/bioinformatics/bti1026View ArticlePubMedGoogle Scholar
- The ComparaGRID Project[http://www.comparagrid.org]
- Achard F, Vaysseix G, Barillot E: Xml, bioinformatics and data integration. Bioinformatics 2001, 17(2):115–125. 10.1093/bioinformatics/17.2.115View ArticlePubMedGoogle Scholar
- Kennedy J, Hyam R, Kukla R, Paterson T: A Standard Data Model Representation for Taxonomic Information. OMICS: A Journal of Integrative Biology 2006, 102: 220–230. 10.1089/omi.2006.10.220View ArticleGoogle Scholar
- Minimal Information for Biological and Biomedical Investigations[http://mibbi.sourceforge.net]
- Genomic Mapping Data Schema Documentation[http://www.thearkdb.org/ws/schema/v1.0/documentation/GMDDocumentation.html]
- Genomic Mapping Data Transfer Schema WIKI[https://www.wiki.ed.ac.uk/display/GMDTSW]
- The Open Biomedical Ontologies: Units of measurement[http://obofoundry.org/cgi-bin/detail.cgi?id=unit]
- Genomic Data Relationships Schema Documentation[http://www.thearkdb.org/ws/schema/v1.0/documentation/GDRDocumentation.html]
- GFF (General Feature Format)[http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml]
- Codehaus Xfire[http://xfire.codehaus.org]
- Taverna Project Website[http://taverna.sourceforge.net]
- DNA Data Bank of Japan (DDBJ) Web API for Bioinformatics[http://xml.nig.ac.jp/index.html]
- EBI WebServices[http://www.ebi.ac.uk/Tools/webservices]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.