Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library
© Page; licensee BioMed Central Ltd. 2011
Received: 21 September 2010
Accepted: 23 May 2011
Published: 23 May 2011
The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive.
A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site http://biostor.org/openurl/. This resolver can be used on the web, or called by bibliographic tools that support OpenURL.
BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from http://biostor.org/.
In July 2010 Lambert et al.  published a paper in Nature that described an extinct sperm whale possessing the biggest bite of any tetrapod known. They named this formidable predator Leviathan melvillei, the genus name Leviathan being derived from the Hebrew 'Livyatan', the species name honouring Herman Melville (author of Moby Dick ). As appropriate as this name was, it quickly ran foul of the rules of zoological nomenclature  because Leviathan had been used 169 years previously for an extinct species of mammoth . Although the name Leviathan Koch  had lapsed into obscurity (as a synonym of Mammut Blummenbach) its existence meant the newly discovered whale had to be renamed, which it duly was a month after the original publication .
The fate of Lambert et al.'s Leviathan illustrates a significant challenge facing researchers finding and naming new species - the discoverability of existing names. In the absence of a global register of all taxonomic names that have ever been published, a researcher about to publish a new name may struggle to establish that that it has not already been used. Zoological nomenclature dates from 1758, botanical nomenclature from 1753, hence a comprehensive list of taxonomic names must survey some 250 years of literature , much of which is obscure and may not exist in digital form. Digitising this legacy literature is the goal of the Biodiversity Heritage Library (BHL) [7, 8], a consortium of natural history museum libraries, botanic libraries, and research institutions. The bulk of this digitisation is carried out by the Internet Archive , which scans books (broadly defined to include bound issues of journals), creating a set of electronic files for each scanned item, which includes images of individual pages, and text extracted from those pages using Optical Character Recognition (OCR). BHL takes these files (together with the output from the scanning projects of individual BHL members), indexes them by bibliographic metadata and taxonomic names, and makes the content available on its web site  (both as web pages and web services). Although the bulk of BHL's scanning activities focus on pre-1923 content that is out of copyright, it has not inconsiderable post-1923 content contributed by its member institutions, notably publications by various natural history museums.
The inability to easily locate articles in BHL is a substantial obstacle to integrating this legacy biodiversity literature into mainstream scientific publishing. The goal of BioStor is to provide tools to locate and extract articles from the BHL archive. BioStor differs from search engines such as PubMed  and Google Scholar , which support free-form queries such as "what articles have been published on this topic?", or "what papers has this author published?" BioStor addresses a different question, namely "does this article exist in the BHL archive?" It is a tool to find out whether a specific article exists in the archive, as opposed to finding what articles exist on a particular topic.
Locating articles in BHL
For most modern articles the triple of journal name, volume, and starting page is sufficient to uniquely identify an article , and tools such as CrossRef's OpenURL resolver  can take this this triple and discover whether a Digital Object Identifier (DOI)  exists for a that article. Publishers make use of this tool to map the literature cited in a manuscript to the corresponding DOI. In an ideal world the BHL model of (title, item, page) (Figure 1) would map exactly to (journal, volume, page), such that an individual journal would correspond to a title in BHL, and each volume of that journal was a separate item. Given that BHL stores page numbers for each scanned page , locating articles would then be trivial and linking to BHL content could be readily integrated into existing publication processes, as well as bibliographic management tools that make use of CrossRef's services to augment user-provided metadata (e.g., Mendeley ).
Unfortunately, the actual mapping between articles and BHL content is often rather more complicated. Large articles (e.g., monographs) may be treated as separate "titles" (effectively as if they were books), rather than parts of the same title. A contributing library may have bound several volumes of a journal together, such that a single "item" may comprise multiple volumes. Volume numbers themselves may not be unique within a journal. The Annals and Magazine of Natural History (ISSN 0374-5481), published from 1828 until 1967 (being succeeded by the Journal of Natural History, ISSN 0022-2933), is divided into 13 "series", each series numbering its volumes from one onwards. Hence, "volume 1" of Annals and Magazine of Natural History may refer to any one of 13 volumes spanning 138 years . Journals also differ in whether pagination is unique within a volume, or within parts of a volume. For example, in the journal Arkiv för Zoologi (ISSN 0004-2110) each article starts on page 1, so that the triple (Arkiv för Zoologi, 13, 1) may refer to [17, 18], or any of 23 other articles in volume 13 of that journal.
Discovering articles also assumes that the pagination in BHL is complete and correct, and that one side of a sheet of paper corresponds to a "page". BHL records the page number of regular pages, but not pages that are classified as special in some way, such as title pages, or tables of contents. For example, page 1 in Lynch et al.  is recorded in BHL as being the title page without any number, which will frustrate efforts to locate this article by starting page alone.
Given that locating articles in a archive of legacy literature such as BHL is a non-trivial task, it is worth considering why such an undertaking is worthwhile, beyond integrating BHL with existing citation practices. Indeed, one could argue that, given that the OCR text for BHL content has been indexed by taxonomic name, the need for indexing by article has been greatly reduced - the user could simply search by taxonomic name and find the content they require. This would be sufficient for many users, especially if we were con fident that BHL had correctly indexed all the taxonomic names contained in the pages it has scanned. However, OCR errors mean that a significant fraction of names will be missed . An obvious approach to discovering these missing names would be to take existing databases of taxonomic names and publications and search for those publications in BHL.
Construction and content
Locating an article
Step 1 - Finding the journal
The first step is to determine whether BHL includes the journal containing the article. BioStor uses a service provided by bioGUID [27, 28] to find the ISSN  for the journal. If the bioGUID service returns an ISSN, the algorithm looks up the ISSN in the Title Identifier table (Figure 1) and retrieves the corresponding BHL TitleID. If the bioGUID service doesn't return a ISSN the algorithm attempts to find the journal title in the ShortTitle field in the Title table using approximate string matching. If it fails to find the title it then searches the VolumeInfo field in the Item table - for some journals (e.g., Fieldiana Zoology, ISSN 0015-0754) the journal title is stored in that field. If at this point we can't find the journal we exit.
Step 2 - Finding scanned items for the journal
Ideally each journal corresponds to a single BHL title, but in some cases the same journal may be represented by more than one BHL title, and hence have more than one TitleID. Step 2 uses a hard-coded table of such cases to ensure that all items for a given journal are considered by Step 3.
Step 3 - Finding the volume and page
Ideally the VolumeInfo field in the Item table would contain just the volume number, however all manner of free-form text may be found there. The volume may be recorded as simple numbers or as strings, sometimes indicating volume, page or date ranges, notes on completeness of the volume, or other comments (e.g., "Index"). Metadata may also be in a variety of languages, such that the field may refer to "Volume", "Band", or "Tome". Nor is metadata always recorded consistently within a journal, for example the VolumeInfo field for scanned items belonging to the journal Proceedings of the Zoological Society of London contains strings such as:
Part 1- Part 4 (1833-38)
1901, v. 1 (Jan.-Apr.)
1912 v. 2
1923, pt. 1-2 (pp. 1-481)
BioStor uses a set of ad-hoc regular expressions to extract volume (and other information where present, such series, issue, and date) information from the VolumeInfo field. If no match to the target volume is found the algorithm exits.
Step 4 - Checking the match
Utility and Discussion
The BioStor database is available at http://biostor.org/. It features an OpenURL resolver, and can display individual articles, lists of publications by author, by taxonomic name, and by journal. At the time of writing the database contains 26,784 articles extracted from BHL.
Cutting and pasting bibliographic details into web forms is tedious, so the web interface to the OpenURL resolver is intended for casual use only. Instead, it is envisaged that users will interact with the OpenURL resolver using one of the bibliographic tools that supports the protocol, such as EndNote  and Zotero , or a web browser that supports OpenURL ContextObject in SPAN (COinS) , such as Firefox with the OpenURL Referrer add on . For example, the following OpenURL corresponds to the web form shown in Figure 8a (with line breaks added for clarity):
&atitle=On the Arachnida taken in the Transvaal and in Nyasaland by Mr W. L. Distant and Dr Percy
&title=Ann. Mag. nat. Hist.
&volume = 1
&spage = 308
&epage = 321
&date = 1898
The ability of BioStor to find articles in BHL depends on several factors. An obvious reason BioStor may fail to find an article is that it simply has not been scanned by BHL. Alternatively, it may have been scanned by BHL but not yet added to the local copy of BHL used by BioStor. Even if an article exists in BHL, BioStor may fail to find it if the metadata describing the item that contains the article doesn't conform to one of the regular expressions BioStor uses to interpret the VolumeInfo field in the Item table. Because BioStor evaluates the quality of a match by comparing the title of the target article with the OCR text (Figure 6), OCR errors may result in the match being deemed too poor to be correct. If the metadata for the target article contains significant errors, such as incorrect pagination, then BioStor may also fail to find an article.
Retrieval of articles in the journal Tijdschrift voor Entomologie
To provide a benchmark for BioStor's performance I used an EndNote database of 2330 articles from the journal Tijdschrift voor Entomologie spanning the years 1858 to 1999, inclusive, assembled by E. J. van Nieukerken as part of a complete index of the journal . Almost all volumes of Tijdschrift voor Entomologie for this period have been scanned by BHL, so ideally BioStor should recover most, if not all of these articles from this journal. This database chosen because of the quality of the bibliographic metadata, and the fact it spanned some 150 years, during which time the typeface and layout of the journal changed significantly.
The EndNote file for Tijdschrift voor Entomologie was converted into a Research Information Systems (RIS) format file, which was then parsed by a script which extracted each article, constructed an OpenURL query, and forwarded it to BioStor, which returned a response in JSON format. The script scored recorded whether a match for article was found, ignoring matches with an alignment score of less than 0.5. As part of the output the script created web pages displaying details of each putative match including a thumbnail image of the first page of the article, making it possible to quickly evaluate whether the match was correct. The database, scripts, and HTML output are available from http://biostor.org/ms/.
Tijdschrift voor Entomologie is just one of the journals scanned by BHL, and it would be desirable to evaluate BioStor's performance across a range of journals. However, at present evaluation is hampered by the lack of freely available, comprehensive bibliographic databases for taxonomic journals.
The metadata (such as title, authors, journal name, etc.) can all be edited by the user. These edits will be saved if the user passes a reCAPTHCA test. The metadata can be retrieved in standard formats such as Reference Manager (RIS), Endnote XML, and BibTeX. The web page also contains bibliographic metadata embedded using the Context Object in Span (COinS) technique , and <meta> tags using the Dublin Core  and Google Scholar  vocabularies. The article itself can also be downloaded as a PDF file, with bibliographic metadata embedded using Adobe's Extensible Metadata Platform (XMP) . Desktop bibliographic software that can read XMP, such as Mendeley [15, 43] and Papers , can extract this metadata so that the user need not manually re-enter bibliographic details for the paper.
The article page also displays the taxonomic and, where possible, geographic scope of the article. Taxonomic scope is represented by a tag cloud of the taxonomic names that BHL has found in the OCR text for the article, and by a taxonomic classification of those names based on the 2008 edition of the Catalogue of Life . When an article is added to the BioStor database the OCR text is searched for strings that represent latitude and longitude values for point locations. Any points found are displayed on a Google Map.
BioStor displays a summary page for each author in the database. To mitigate the problem of an author having more than one spelling of their name, BioStor clusters names using a web service provided by bioGUID , which implements Feitelson's  weighted clique algorithm for finding equivalent names. The summary page aggregates publications and coauthorships across this set of names. The page uses Exhibit  to create a faceted browser, enabling the user to browse an author's publications by date, journal, and coauthors.
Displaying taxonomic names
If the user clicks on a name in the taxonomic tag cloud (Figure 10), or appends a taxonomic name (or uBio NameBankID ) to the URL http://bioguid.org/name/ for a name that has been taxonomically indexed by BHL, BioStor displays a web page listing the articles in BioStor that contain that name. The page also displays a sparkline showing the distribution of that name over time in the local copy of BHL, and lists taxonomic synonyms of the name according to the 2008 edition of the Catalogue of Life .
Searching and browsing
BioStor locates articles by matching existing bibliographies to BHL content, hence it relies on external sources of metadata to find articles. Typically these are bibliographies assembled by individual taxonomists for particular taxonomic groups, or lists of articles published in a single journal. An alternative approach would be to extract articles directly from the archive. Lu et al.  used feature extraction and a mixture of rule-based and machine-learning techniques to extract metadata from BHL OCR text, recovering between 66% to 94% of articles in selection of three journals. The set of articles in BioStor could be used as a training data set to help further develop these methods. Another approach to article extraction is crowd sourcing, where the task of identifying articles would be devolved to users. Ultimately, crowd sourcing could become important in cleaning metadata, but it may prove challenging to engage users in creating metadata from scratch.
The BHL archive has extracted taxonomic names from the OCR text, and BioStor looks for geographic localities encoded as latitude and longitude pairs. We could make more extensive use of the OCR text, for example by using autonomous citation indexing  to extract citations from the literature cited section of each article. These citations could in turn be feed into the BioStor OpenURL resolver to attempt to locate them in BHL. The combination of variable citation styles and OCR errors means that the same reference may have be represented by several different citations, requiring tools for cleaning and merging citation data (e.g., ).
BioStor is built as a service on the top of a copy of data from BHL, and creates a local bibliographic database of articles. One future direction would be to integrate this data with BHL itself. BHL has an OpenURL resolver http://www.biodiversitylibrary.org/openurlhelp.aspx that primarily supports books rather than articles. Adding metadata from BioStor could enhance the BHL OpenURL service, and provide the biodiversity community with a single source for BHL-derived content. BioStor content could also be added to other bibliographic databases, in particular Mendeley [15, 43]. Mendeley is developing an API for storing and retrieving documents and associated metadata, hence it might be possible to devolve the storing of basic bibliographic metadata to Mendeley, BioStor then becoming simply an OpenURL resolver.
The 31 million scanned pages made available by the Biodiversity Heritage Library (BHL) represents a substantial resource of biological literature. BioStor provides an OpenURL resolver to locate articles in this archive. Each article extracted from BHL is given a unique URL, corresponding to a web page that displays the article pages, and information about the taxonomic names and geographic localities mentioned in the article. BioStor is available at http://biostor.org/.
Availability and requirements
Project Name: BioStor
Project Home Page:http://biostor.org/. Source code is available from http://code.google.com/p/bioguid/source/browse/#svn/trunk/biostor.
Operating System: The BioStor web site is usable with any modern web browser. The source code can be easily installed on a Mac OS X, Linux server. It has not been tested on a Windows machine.
Programming Language: PHP
Other Requirements: Web server
License: GNU General Public License version 2
Any restrictions to use by non-academics: None
Application Programming Interface
Biodiversity Heritage Library
Digital Object Identifier
International Standard Serial Number
Optical Character Recognition
Uniform Resource Locator.
The core data for BioStor comes from the Biodiversity Heritage Library . Chris Freeland, Phil Cryer, and Mike Lichtenberg provided data dumps from BHL, and answered queries regarding the BHL database schema. E. J. van Nieukerken kindly provided the EndNote database for Tijdschrift voor Entomologie. I thank the anonymous referees for their comments.
- Lambert O, Bianucci G, Post K, de Muizon C, Salas-Gismondi R, Urbina M, Reumer J: The giant bite of a new raptorial sperm whale from the Miocene epoch of Peru. Nature 2010, 466(7302):105–108. 10.1038/nature09067View ArticlePubMedGoogle Scholar
- Melville H: Moby-Dick. Richard Bentley, London; 1851.Google Scholar
- International Commission on Zoological Nomenclature: International code of zoological nomenclature. International Trust for Zoological Nomenclature. 4th edition. 1999.View ArticleGoogle Scholar
- Koch AC:Description of the Missourium, or Missouri Leviathan: together with its supposed habits and Indian traditions concerning the location from whence it was exhumed; also, comparisons of the whale, crocodile and missourium with the leviathan, as described in 41st chapter of the book of Job. 2nd edition. Prentice and Weissinger; 1841. [http://www.biodiversitylibrary.org/item/81522]View ArticleGoogle Scholar
- Lambert O, Bianucci G, Post K, de Muizon C, Salas-Gismondi R, Urbina M, Reumer J: The giant bite of a new raptorial sperm whale from the Miocene epoch of Peru. Nature 2010, 466(7310):1134. 10.1038/nature09381View ArticleGoogle Scholar
- Anonymous: The legacy of Linnaeus. Nature 2007, 446: 231–232.Google Scholar
- Biodiversity Heritage Library[http://biodiversitylibrary.org]
- Pilsk S, Person M, Deveer J, Furfey J, Kalfatovic M: The Biodiversity Heritage Library: Advancing Metadata Practices in a Collaborative Digital Library. Journal of Library Metadata 2010, 10(2):136–155. 10.1080/19386389.2010.506400View ArticleGoogle Scholar
- Internet Archive[http://www.archive.org/]
- Google Scholar[http://scholar.google.com/]
- Cameron RD: Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention. Tech. Rep. CMPT TR 1998–08, School of Computing Science, Simon Fraser University 1998.Google Scholar
- CrossRef OpenURL[http://www.crossref.org/openurl]
- The Digital Object Identifier System[http://www.doi.org/]
- Evenhuis NL: Publication and dating of the journals forming the Annals and Magazine of Natural History and the Journal of Natural History . Zootaxa 2003, 385: 1–68.Google Scholar
- Alexander CP: The crane-flies collected by the Swedish expedition (1895–1896) to southern Chile and Tierra del Fuego (Tipulidae, Diptera). Arkiv för Zoologi 1920, 13(6):1–32. [http://biostor.org/reference/13820]Google Scholar
- Michaelsen W: Neue und wenig bekannte Oligochäten aus skandinavischen Sammlungen. Arkiv för Zoologi 1921, 13(19):1–25. [http://biostor.org/reference/14784]Google Scholar
- Lynch JD, Ruíz-Carranza PM, Ardila-Robayo MC: The identities of the Colombian frogs confused with Eleutherodactylus latidiscus (Boulenger) (Amphibia: Anura: Leptodactylidae). Occasional Papers of the Museum of Natural History University of Kansas 1994, 170: 1–42. [http://biostor.org/reference/228]Google Scholar
- Wei Q, Heidorn PB, Freeland C: Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL). iConference 2010 Proceedings 2010, 284–288. [http://hdl.handle.net/2142/14919]Google Scholar
- Encylopedia of Life[http://www.eol.org/]
- Holthuis LB: The Scientific Name of the Sperm Whale. Marine Mammal Science 1987, 3: 87–89. 10.1111/j.1748-7692.1987.tb00154.xView ArticleGoogle Scholar
- Schevill WE: Mr. Schevill replies. Marine Mammal Science 1987, 3: 89–90. 10.1111/j.1748-7692.1987.tb00155.xView ArticleGoogle Scholar
- Schevill WE: The International Code of Zoological Nomenclature and a paradigm: the name Physeter catodon Linnaeus 1758. Marine Mammal Science 1986, 2(2):153–157. 10.1111/j.1748-7692.1986.tb00036.xView ArticleGoogle Scholar
- Page RDM: Wikipedia as an encyclopaedia of life. Organisms Diversity and Evolution 2010, 10(4):343–349. 10.1007/s13127-010-0028-9View ArticleGoogle Scholar
- de Sompel HV, Beit-Arie O: Open Linking in the Scholarly Information Environment Using the OpenURL Framework. D-Lib Magazine 2001., 7(3): 10.1045/march2001-vandesompel
- Page RDM: bioGUID: resolving, discovering, and minting identifiers for biodiversity informatics. BMC Bioinformatics 2009, 10(Suppl 14):S5. 10.1186/1471-2105-10-S14-S5PubMed CentralView ArticlePubMedGoogle Scholar
- ISSN International Centre[http://www.issn.org]
- Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of Molecular Biology 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Holt EWL, Tattersall WM: Preliminary notice of the Schizopoda collected by H. M.S. Discovery in the Antarctic region. Ann Mag Nat Hist 1906, 17: 1–11. [http://biostor.org/reference/50163]View ArticleGoogle Scholar
- von Ahn L, Maurer B, McMillen C, Abraham D, Blum M: reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science 2008, 321(5895):1465–1468. 10.1126/science.1160379View ArticlePubMedGoogle Scholar
- OpenURL ContextObject in SPAN (COinS)[http://ocoins.info/]
- OpenURL Referrer[https://addons.mozilla.org/en-US/firefox/addon/4150]
- van Nieukerken EJ: Tijdschrift voor Entomologie 150 volumes: one and a half century of Systematic Entomology in a changing world. Tijdschrift voor Entomologie 2007, 1(2):245–261. [http://www.repository.naturalis.nl/document/93299]View ArticleGoogle Scholar
- Raselimanana AP, Raxworthy CJ, Nussbaum RA: A revision of the dwarf Zonosaurus Boulenger (Reptilia: Squamata: Cordylidae) from Madagascar, including descriptions of three new species. Scientific Papers Natural History Museum University of Kansas 2000, 18: 1–16. [http://biostor.org/reference/50335]Google Scholar
- Dublin Core Metadata Initiative[http://dublincore.org/]
- Adobe XMP[http://www.adobe.com/products/xmp/index.html]
- Henning V, Reichelt J: Mendeley - A Last.fm For Research? eScience '08. IEEE Fourth International Conference on eScience, 2008 2008, 327–328.View ArticleGoogle Scholar
- The Species 2000 and ITIS Catalogue of Life[http://www.catalogueoflife.org]
- Feitelson DG: On identifying name equivalences in digital libraries. Information Research 2004., 9: [http://informationr.net/ir/9–4/paper192.html]Google Scholar
- Exhibit: Publishing Framework for Data-Rich Interactive Web Pages[http://www.simile-widgets.org/exhibit/]
- WorldCat.org: The World's Largest Library Catalog[http://www.worldcat.org/]
- Universal Biological Indexer and Organizer (uBio)[http://www.ubio.org/]
- Lu X, Kahle B, Wang JZ, Giles CL: A metadata generation system for scanned scientific volumes. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries 2008, 167–179. 10.1145/1378889.1378918Google Scholar
- Lawrence S, Giles CL, Bollacker K: Digital libraries and autonomous citation indexing. IEEE COMPUTER 1999, 32(6):67–71. 10.1109/2.769447View ArticleGoogle Scholar
- Councill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL: Learning metadata from the evidence in an on-line citation matching scheme. In JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries. New York, NY, USA: ACM; 2006:276–285. 10.1145/1141753.1141817View ArticleGoogle Scholar
- Pocock RI: On the Arachnida taken in the Transvaal and in Nyasaland by Mr W. L. Distant and Dr Percy Rendall. Ann Mag nat Hist 1898, 1: 308–321. [http://biostor.org/reference/52084]View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.