Volume 10 Supplement 14
bioGUID: resolving, discovering, and minting identifiers for biodiversity informatics
© Page; licensee BioMed Central Ltd. 2009
Published: 10 November 2009
Linking together the data of interest to biodiversity researchers (including specimen records, images, taxonomic names, and DNA sequences) requires services that can mint, resolve, and discover globally unique identifiers (including, but not limited to, DOIs, HTTP URIs, and LSIDs).
bioGUID implements a range of services, the core ones being an OpenURL resolver for bibliographic resources, and a LSID resolver. The LSID resolver supports Linked Data-friendly resolution using HTTP 303 redirects and content negotiation. Additional services include journal ISSN look-up, author name matching, and a tool to monitor the status of biodiversity data providers.
One vision of biodiversity informatics is that of a cloud of digital records representing objects and events such as images, specimens, macro-molecular sequences, phenotypes, observations, publications, and taxonomic names. Each digital record would carry a globally unique identifier that would both identify that object and, given appropriate technology, be used to retrieve what we know about that object, including how it is linked to other objects. Implementing this vision, which is essentially that of Linked Data , requires services that can mint, resolve, and discover identifiers .
A minting service creates identifiers, and ensures their uniqueness. Given an identifier we need a resolution service that can retrieve the object identified (or a digital representation of that object). This service may return information in multiple formats, such as binary data (e.g., an image or a PDF document), or metadata about the object. Lastly, if we don't have an identifier for an object it should be straightforward to discover if one has already been minted.
CrossRef also provides an OpenURL resolver http://www.crossref.org/openurl that takes user-supplied metadata (such as article title, journal title, volume, and pagination) and returns a DOI, if it exists. Publishers can use this service to find DOIs for articles in the "literature cited" section of an author's manuscript, and hence when the manuscript is published online it will contain electronic links to the literature cited in that manuscript.
Ensuring an identifier is unique needs some care. Typically identifiers are unique within some scope, such as a local database, or a particular discipline. However, once one moves outside that scope, we can have unintended collisions between identifiers. As an example, the paper by Mesibov  contains strings that match existing GenBank accession numbers, such as DQ402119 (a human herpesvirus sequence). However, in the context of this paper, DQ402119 is a UTM grid reference for a locality in Tasmania with the co-ordinates 41° 26' 31" S, 146° 17' 02" E. Clearly, within the context of DQ402119 is not intended to be interpreted as a GenBank accession number.
Generating identifiers from metadata
One approach to minting identifiers is to generate them based on metadata for the object, which has the advantage that, in theory, two different people acting independently of each other will generate the same identifier, obviating the need for a central agency to mint the identifier. This also greatly simplifies identifier discovery - the identifier can be generated from the object at hand.
Another approach to generating identifiers is to use unique strings such as UUIDs  which can be generated completely independently, with a very low probability that the same UUID will be generated more than once. Such a system is attractive if identifiers need to be coined independently of any central agency (for example, if one is in the field without network access and need to generate GUIDs on the fly). It may also be an issue for projects that aggregate information from a range of sources, each source of which may mint it's own GUIDs. Using UUIDs should ensure that the GUIDs are, actually, unique. The Catalogue of Life  adopted UUIDs for its LSIDs for this reason  (although the UUIDs in the 2008 release were generated centrally).
UUIDs are opaque identifiers, that is, they that contains no information about the object it identifies. In this sense it is the antithesis of an identifier such as a SICI or JACC, which embed detailed bibliographic metadata.
Identifiers by themselves have limited utility unless they can be resolved, that is, given an identifier we should be able to retrieve information about the object the identifier refers to. In practice, resolution means that we can retrieve information about the object from the Web. For identifiers such as HTTP URIs, this is straightforward (simply enter the URI in a web browser), but for other identifiers we need a resolution mechanism.
LSIDs are the identifier recommend by the Biodiversity Information Standards (TDWG) organisation . For the biodiversity informatics community the attractions of LSIDs include the distributed nature of the identifier (no central authority is required for registering or resolving identifiers), the low cost, and the convention that resolving a LSID returns metadata in RDF. The later facilitates integrating information from multiple sources using tools being developed for the Semantic Web .
Despite being specifically developed to provide globally unique identifiers for objects in biological databases , within mainstream bioinformatics relatively few "early adopters" have deployed LSIDs . In part this may because of the complexity of the resolution mechanism. A LSID client resolves a LSID in four steps:
find location of LSID resolution service by querying the DNS service (SRV) records to find the hostname and TCP/IP service port for the LSID authority
retrieve from the LSID authority the WSDL that defines the LSID resolution service
retrieve a second WSDL file (the service WSDL) that specifies how the metadata and/or data corresponding to the LSID can be retrieved
retrieve the metadata (or data), typically using HTTP GET
Not only is resolving a LSID more complicated than resolving a HTTP URI, setting up a LSID resolution service is non-trivial.
Furthermore, although LSIDs are conceptually rooted in the Semantic Web in the sense that the default metadata format is RDF, current approaches to realising the Semantic Web (such as Linked Data ) have settled on using HTTP URIs as the identifier. Using HTTP URIs to identify both real world objects and web pages has the potential to cause ambiguity - if I use "http://dbpedia.org/resource/Glasgow" as an identifier, am I talking about the city in Scotland, or the web page with that URL? The Linked Data community has adopted the use of HTTP 303 redirects and content negotiation to distinguish between a resource and a document that describes that resource . A client resolving a URI http://dbpedia.org/resource/Glasgow will receive a HTTP 303 ("see other") redirect, which tells the client that http://dbpedia.org/resource/Glasgow identifies a non-information resource (i.e., a real-world object or concept), and it will also receive a location for a document that describes the resource (for example a web page or a RDF document). Enabling LSIDs to comply with Linked Data approaches requires a resolver that supports this mechanism.
The activities of minting and resolving identifiers tend to receive more attention than discovering existing identifiers. However, if a major goal of biodiversity informatics is to integrate biodiversity resources then data providers need to re-use shared global identifiers wherever possible , rather than simply mint new identifiers. Having multiple identifiers for the same object is potentially a major obstacle to integrating data, hence we need services that can discover whether an identifier already exists for an object. Perhaps the most obvious domain where this is relevant is literature databases, where publishers, digital repositories (e.g., JSTOR ), institutional archives (e.g., the Smithsonian Digital Repository ), indexing services (e.g., PubMed), and domain-specific databases may all assigned one or more identifiers to a scientific publication. One method the digital library community has developed to retrieve information about a bibliographic item is OpenURL .
One reason for the standard's complexity is its attempt to be highly generic, and thus applicable outside the library community. For example, Chute and Van de Sompel  use OpenURL to request regions (or metadata about regions) from a JPEG 2000 image. We could also use OpenURL to request metadata for a specimen, or indeed any other object of interest.
Some of the complexity in the OpenURL standard reflects an emphasis in the digital library community on providing the "appropriate copy" , for example, a copy that the user (say a member of a library) has the right to access. Most OpenURL resolvers are local in scope (e.g., they know about the contents of a particular physical library), and will return web pages telling the user if an item exists in the library (either digitally or physically). While such a service may be locally useful, if we assume that a user simply wants access to the resource (or information about the resource), and doesn't care about where it resides (i.e., "local" has no relevance when everything is "global"), then the practical utility of many OpenURL resolvers is somewhat limited.
Discovering identifiers from metadata can become challenging if the items of metadata associated with the identifier in a database differs from the metadata actually available. For example, many taxonomic citations are not to articles or books (the typical unit stored in a bibliographic database) but to an individual page. This mismatch in granularity can frustrate efforts to link identifiers for taxonomic names to identifiers for literature. Hence, it would be desirable to have a service that can return the containing document for a given page.
To illustrate, the Index Fungorum database record for Hyaloperonospora galligena (S. Blumer) Göker, Riethm., Voglmayr, Weiss & Oberw. 2004 http://www.indexfungorum.org/Names/SynSpecies.asp?RecordID=371153 gives the bibliographic source as "Mycol. Progr. 3(2): 89 (2004)". There is no article in volume 3 of Mycological Progress that starts on page 89, so a standard search for a DOI using CrossRef's OpenURL resolver will fail. However, if we repeat the search, each time decreasing the start page by one, we will retrieve a document (Göker et al. ), which starts on page 83 and ends on page 94. This article contains page 89, and so we can now link the identifier for the name Hyaloperonospora galligena (S. Blumer) Göker, Riethm., Voglmayr, Weiss & Oberw. 2004 (urn:lsid:indexfungorum.org:name:371153) to the identifier for the publication (doi:10.1007/s11557-006-0079-7).
This paper provides a brief description of bioGUID http://bioguid.info/, which implements a set of services for resolving, discovering, and minting identifiers. The initial design and development of this site is described on the bioGUID blog , however the version of bioGUID described in this paper differs in several ways, including support for OpenURL 1.0, and Linked Data-compliant resolution of LSIDs.
bioGUID is written in the PHP programming language, and the source code is available from http://code.google.com/p/bioguid/. The LSID resolution code comes from the LSID Tester project . Other third-part libraries used include the ADOdb database abstraction library , and the PEAR Net_DNS module .
DOI resolution uses CrossRef's OpenURL resolver. Article metadata is cached locally in a MySQL database to minimise requests to external services, and to facilitate locating articles based on an individual page. The MySQL database also contains metadata for articles that don't have DOIs but which are available online, such as those in DSpace repositories  that support OAI-PMH harvesting .
Results and discussion
bioGUID http://bioguid.info/ implements a range of services, the core ones being an OpenURL resolver, and a LSID resolver. Additional services include journal ISSN look-up, author name matching, and monitoring the status of biodiversity data providers.
For journal articles the bioGUID OpenURL resolver will generate a JACC for an article, provided that sufficient metadata (journal ISSN, volume, and starting page) are available. This provides a globally unique identifier for an article, and in the absence of an existing DOI, PubMed number, or URL, it may be the only available GUID for that article.
bioGUID can act as a resolver for several different identifiers by appending the identifier (and it's namespace) to the base URL http://bioguid.info/. For example, the JACC 1175-5326:1671@3 becomes http://bioguid.info/jacc:1175-5326:1671@3, and the DOI 10.1093/bib/bbn022 becomes http://bioguid.info/doi:10.1093/bib/bbn022. For these identifiers bioGUID simply uses the Apache Web server's mod_rewrite to rewrite the URLs to OpenURLs.
In addition to the core services listed above, bioGUID provides additional (sometimes experimental) services.
Journal ISSN lookup
bioGUID has a local database of journal names and ISSN numbers. A user can lookup a ISSN for a journal name by appending the journal name (or abbreviation) to the URL http://bioguid.info/services/journalsuggest?title=. This service returns a list of titles that match the request, together with their ISSNs, in JSON format. The bioGUID OpenURL resolver web page uses this service to find the ISSN of a journal the user has entered.
Author name matching
bioGUID's article cache includes author names, unfortunately for any given author there may be more than one way their name has been recorded in various bibliographic databases. For example, my name may be stored as "Roderic D. M. Page" or "R. D. M. Page". As a first step towards normalising author names bioGUID implements Feitelson's  weighted clique algorithm for finding equivalent names. This service takes a set of forenames (and initials) and returns a set of names that can regarded as equivalent. Names can be entered in a web form at http://bioguid.info/services/equivalent.php, or the service can be called directly by sending a HTTP POST request to the URL http://bioguid.info/services/equivalent.php with a parameter names whose value is a list of author names (separated by the end-of-line character), and an optional parameter format with the value html or json.
The lack of standard URIs for biodiversity data objects reflects a broader lack of agreement on this issue . It is likely that HTTP URIs will become broadly adopted, at least outside biodiversity informatics, and they are at the heart of Linked Data . However, HTTP URIs have their own problems. We are currently faced by either a great dearth of URIs or, ironically, an over abundance of them. If an entity is shared across multiple domains, then there may be multiple, competing URIs for the that entity. For example, there are numerous web sites that make statements about individual books, often using URIs that embody an ISBN. In such cases there often is not an obvious reason to choose between any of the URIs. In the same way, we have multiple identifiers for articles (such as DOIs and PubMed numbers). In such cases, tools such as OpenURL may have a role, in that the OpenURL Context Object can contain an identifier as one of its kev-value pairs. Hence, we could use the Context Object to encode this information, but delegate the choice of resolver to the client.
In cases where these is an obvious canonical source for information about an object, and that source issues HTTP URIs, it would make sense to use those URIs. Museum specimens would seem to be an obvious case (the host institution being the canonical source). However, there are few such URIs available. I regard the lack of URIs for individual specimens is one of the greatest obstacles to progress in data integration in biodiversity informatics. Again, in the absence of a recognised identifier one could adopt the OpenURL approach of encoding sufficient metadata to enable some services to retrieve a digital record about the specimen, if and when it becomes available.
bioGUID is being developed to address some of these issues, in that it supports OpenURL for literature (and experimentally for specimens), and can resolve non HTTP URI identifiers (such as DOIs and LSIDs) following Linked Data guidelines. These services can be accessed with a web browser, or programmatically. For example, the basis of my entry  in the Elsevier Grand Challenge  was a database populated by harvesting data on literature, specimens, and GenBank using bioGUID's OpenURL resolver. Having tools such as bioGUID may help mobilise biodiversity data that is currently digitised but not easily accessible, and thus bring the goal of a linked web of biodiversity data a little closer to being realised.
Availability and requirements
Project Name: bioGUID
Operating System: The bioGUID web site is usable with any modern web browser. The source code can be easily installed on a Mac OS X, Linux server. It has not been tested on a Windows machine.
Programming Language: PHP
Other Requirements: Web server
License: GNU General Public License version 2
Any restrictions to use by non-academics: None
List of abbreviations used
Digital Object Identifier
Domain Name Service
Global Biodiversity Informatics Facility
Globally Unique IDentifier
International Standard Book Number
International Standard Serial Number
Life Science Identifier
Resource Description Format
Serial Item and Contribution Identifier
Taxonomic Databases Working Group
Uniform Resource IDentifier
Uniform Resource Name
Universally Unique Identifier
Web Service Definition Language
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 14, 2009: Biodiversity Informatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S14.
- Linked Data - Connect Distributed Data across the Web[http://www.linkeddata.org]
- Page RDM: Biodiversity informatics: the challenge of linking data and the role of shared identifiers. Briefings in Bioinformatics 2008, 9(5):345–354. 10.1093/bib/bbn022View ArticlePubMedGoogle Scholar
- Cameron RD: Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention.Tech. Rep. CMPT TR 1998–08, School of Computing Science, Simon Fraser University; 1998. [http://elib.cs.sfu.ca/USIN/JACC.html]Google Scholar
- Mesibov R: The millipede genus Lissodesmus Chamberlin, 1920 (Diplopoda: Polydesmida: Dalodesmidae) from Tasmania and Victoria, with descriptions of a new genus and 24 new species. Memoirs of Museum Victoria 2005, 62: 103–146.Google Scholar
- Life Science Record Name (LSRN)[http://lsrn.org/]
- "info" URI Scheme[http://info-uri.info/]
- Clark T, Martin S, Liefeld T: Globally distributed object identification for biological knowledgebases. Briefings in Bioinformatics 2004, 50: 59–70. 10.1093/bib/5.1.59View ArticleGoogle Scholar
- NISO: Serial Item and Contribution Identifier (SICI) ANSI/NISO Z39.56–1996 (Version 2). Bethesda, Maryland: National Information Standards Organization Press; 1996. [Approved August 14, 1996]. [Approved August 14, 1996].Google Scholar
- Leach P, Mealling M, Salz R: RFC 4122: A Universally Unique Identifier (UUID) URN Namespace.2005. [ftp://ftp.rfc-editor.org/in-notes/rfc4122.txt]Google Scholar
- Catalogue of Life[http://www.catalogueoflife.org/]
- Ewen R, Orme ACJ, White RJ: LSID Deployment in the Catalogue of Life. BNCOD 2008 Workshop: "Biodiversity Informatics: challenges in modelling and managing biodiversity knowledge" 2008. [http://biodiversity.cs.cf.ac.uk/bncod/OrmeJonesAndWhite.pdf]Google Scholar
- Biodiversity Information Standards (TDWG)[http://www.tdwg.org]
- Page RDM: Taxonomic names, metadata, and the Semantic Web. Biodiversity Informatics 2006., 3: [http://jbi.nhm.ku.edu/index.php/jbi/article/view/25]Google Scholar
- Martin S, Hohman MM, Liefeld T: The impact of Life Science Identifier on informatics data. Drug Discovery Today 2005, 10: 1566–1572. 10.1016/S1359-6446(05)03651-2View ArticlePubMedGoogle Scholar
- Cool URIs for the Semantic Web[http://www.w3.org/TR/cooluris/]
- Smithsonian Digital Repository[http://si-pddr.si.edu/dspace/]
- Apps A, MacIntyre R: Why OpenURL? D-Lib Magazine 2006, 12: 5. 10.1045/may2006-appsView ArticleGoogle Scholar
- The OpenURL Framework for Context-Sensitive Services ANSI/NISO Standard Z39.88–2004 2005.
- Chute R, de Sompel HV: Introducing djatoka: A Reuse Friendly, Open Source JPEG 2000 Image Server. D-Lib Magazine 2008., 14: 10.1045/september2008-chuteGoogle Scholar
- Beit-Arie O, Blake M, Caplan P, Flecker D, Ingoldsby T, Lannom LW, Mischo WH, Pentz E, Rogers S, Sompel HVD: Linking to the Appropriate Copy. D-Lib Magazine 2001., 7: 10.1045/september2001-caplanGoogle Scholar
- OpenURL ContextObject in SPAN (COinS)[http://ocoins.info/]
- OpenURL Referrer[https://addons.mozilla.org/en-US/firefox/addon/4150]
- Göker M, Riethmüller A, Voglmayr H, Weiss M, Oberwinkler F: Phylogeny of Hyaloperonospora based on nuclear ribosomal internal transcribed spacer sequences. Mycological Progress 2004, 3: 83–94. 10.1007/s11557-006-0079-7View ArticleGoogle Scholar
- bioGUID blog[http://bioguid.blogspot.com/]
- Page RDM: LSID Tester, a tool for testing Life Science Identifier resolution services. Source Code for Biology and Medicine 2008, 3: 2. [http://www.scfbm.org/content/3/1/2] 10.1186/1751-0473-3-2PubMed CentralView ArticlePubMedGoogle Scholar
- ADOdb Database Abstraction Library for PHP (and Python)[http://adodb.sourceforge.net/]
- PEAR Net_DNS[http://pear.php.net/package/Net_DNS]
- Smith M, Barton M, Branschofsky M, Mcclellan G, Walker JH, Bass M, Stuve D, Tansley R: DSpace: An Open Source Dynamic Digital Repository. D-Lib Magazine 2003., 9: 10.1045/january2003-smithGoogle Scholar
- Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)[http://www.openarchives.org/pmh/]
- Pyle RL, Earle JL, Greene BD: Five new species of the damselfish genus Chromis (Perciformes: Labroidei: Pomacentridae) from deep coral reefs in the tropical western Pacific. Zootaxa 2008, 1671: 3–31.Google Scholar
- Feitelson DG: On identifying name equivalences in digital libraries. Information Research 2004., 9: [http://informationr.net/ir/9–4/paper192.html]Google Scholar
- The Big Dig[http://bigdig.ecoforge.net/]
- Vandervalk BP, McCarthy EL, Wilkinson MD: Moby and Moby 2: Creatures of the Deep (Web). Briefings in Bioinformatics 2009, bbn051. 10.1093/bib/bbn051Google Scholar
- Page RDM: Visualising a scientfic article. 2008. [Available from Nature Precedings]. [Available from Nature Precedings]. 10.1038/npre.2008.2579.1Google Scholar
- The Elsevier Grand Challenge: Knowledge Enhancement in the Life Sciences[http://www.elseviergrandchallenge.com/]
- Vapour, a Linked Data validator[http://validator.linkeddata.org/vapour]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.