bioNerDS: exploring bioinformatics’ database and software use through literature mining
© Duck et al.; licensee BioMed Central Ltd. 2013
Received: 15 October 2012
Accepted: 11 June 2013
Published: 15 June 2013
Skip to main content
© Duck et al.; licensee BioMed Central Ltd. 2013
Received: 15 October 2012
Accepted: 11 June 2013
Published: 15 June 2013
Biology-focused databases and software define bioinformatics and their use is central to computational biology. In such a complex and dynamic field, it is of interest to understand what resources are available, which are used, how much they are used, and for what they are used. While scholarly literature surveys can provide some insights, large-scale computer-based approaches to identify mentions of bioinformatics databases and software from primary literature would automate systematic cataloguing, facilitate the monitoring of usage, and provide the foundations for the recovery of computational methods for analysing biological data, with the long-term aim of identifying best/common practice in different areas of biology.
We have developed bioNerDS, a named entity recogniser for the recovery of bioinformatics databases and software from primary literature. We identify such entities with an F-measure ranging from 63% to 91% at the mention level and 63-78% at the document level, depending on corpus. Not attaining a higher F-measure is mostly due to high ambiguity in resource naming, which is compounded by the on-going introduction of new resources. To demonstrate the software, we applied bioNerDS to full-text articles from BMC Bioinformatics and Genome Biology. General mention patterns reflect the remit of these journals, highlighting BMC Bioinformatics’s emphasis on new tools and Genome Biology’s greater emphasis on data analysis. The data also illustrates some shifts in resource usage: for example, the past decade has seen R and the Gene Ontology join BLAST and GenBank as the main components in bioinformatics processing.
Conclusions We demonstrate the feasibility of automatically identifying resource names on a large-scale from the scientific literature and show that the generated data can be used for exploration of bioinformatics database and software usage. For example, our results help to investigate the rate of change in resource usage and corroborate the suspicion that a vast majority of resources are created, but rarely (if ever) used thereafter. bioNerDS is available at http://bionerds.sourceforge.net/.
The fields of bioinformatics and computational biology are established as ones of rapid change with a continued expansion of the available “resourceome” , which includes numerous databases and software [1, 2]. Such resources facilitate research in biology, and many have become “household names” (e.g., BLAST , ClustalW , etc.). Still, the huge resourceome also creates problems for the choice of appropriate methods for performing a particular task, and poses a challenge of identifying “best practice”: a well-known, popular tool may not be the “best” tool currently available . To help with method choice, we first need to determine what software and data resources are available and used in computational analyses. Several inventories and repositories already exist that list available database and software resources. For example, the 2011 special issues of Nucleic Acids Research’s Databases  and the Bioinformatics Links Directory  list over 1,330 databases and over 1,250 web services respectively. However, many of these inventories and repositories are incomplete and require labour intensive manual curation. Similarly, “manual” literature surveys of published tools and databases are time-consuming and often out-of-date by the time they are published. Therefore, large-scale automated ways for extraction of database and software use patterns are needed. As well as helping with maintenance of resource catalogues, such systematic processing could offer insights into the dynamics of software and data resource usage, particularly as many resources are infrequently used . This is not only of interest to users of these resources, who wish to know what is current and most used, but also to any potential new users and resource developers.
In our previous work we used the literature to explore and evaluate methods used in phylogenetics [5, 8]. We implemented a named-entity recognition (NER) system that utilised a controlled vocabulary of terms as specified by a comprehensive software resource dictionary. We also used a semantic-based approach to identify and profile existing and new resources using keyword association . We then attempted to capture phylogenetic methods based on a predefined abstract representation of four stages within phylogenetics (sequence alignment; tree inference; statistical testing and data re-sampling; tree visualisation and annotation ). This approach could be applicable to other fields within bioinformatics, but it first requires an extensive resource repository and the ability to identify mentions of tools and databases in text. This task is far from trivial — in our previous work we have demonstrated the high level of ambiguity and variability of database and software names in the bioinformatics literature .
In this paper we introduce and evaluate bioNerDS, a bioinformatics named-entity recognition system for database and software names, which is used to identify mentions of such entities in the literature. It makes use of a range of both sentence and document-level clues to learn database and software names, while propagating mentions up to the article level. To illustrate its potential, we use bioNerDS to survey software and data resource usage in two journals from computational biology and bioinformatics.
Several other approaches to automated extraction of bioinformatics resources from primary literature have been suggested. For example, OReFiL (Online Resource Finder in Life sciences)  and BIRI (BioInformatics Resource Inventory)  aim to harvest resource names and fill their repositories in order to enable resource discovery. OReFiL uses URLs as a “proxy” to identify mentions of resources by custom regular expressions and by extracting |<url>.. </url>| tags from BioMed Central (BMC) papers. BIRI utilises keywords and sentence structure to identify relevant terms through custom patterns translated into “transition networks”, which match associated regular expressions for resource names, functions and classifications. bioNerDS on the other hand builds on established approaches to NER by using a generally applicable method for identification of software and database mentions. Furthermore, while OReFiL focuses on the abstract and “availability” or “implementation” sections, and BIRI solely looks into abstracts and titles, bioNerDS can detect resource name mentions throughout full-text articles.
We note that throughout this paper we will mention numerous databases and tools by name as examples. A full list of references and web-links to these can be found on our website. Note, we only cite the first mention of the resource within this paper.
bioNerDS is designed and developed as an NER tool that aims to recognise database and software mentions in literature, and to provide a document-level “list” of resources mentioned in a given article. We identify resource names that represent databases, ontologies, classifications, software, programs, tools, web-services or packages, and exclude names of files and file formats, methods, algorithms, identifiers, operating systems and programming languages (see ).
While previous steps focus on single mentions, we also collect supporting “weak” evidence across different mentions of the same candidate (within the document) and use it to update/adjust the mention-level scores. Finally, all candidate mentions with a score above a given threshold are propagated through to the whole document and extracted so a document-level list of mentioned resources can be generated. We discuss these steps in the following subsections.
Applied scores for local clues
Matches title pattern
Is part of a known resource enumeration
Is part of a Hearst pattern
Associated with a positive head term
Followed by a version number
Followed by a reference
Followed by a hyper-link or URL
Is MiXeD CaSe
Is UPPER CASE
Matches Bioconductor dictionary
Is an English dictionary word
Is a known bio-acronym
Associated with a negative head term
Is lower case
Is only a part of a word
Term fires multiple positive clues
Associated with a weak identifier
We consider several situations where new/unknown database/tool names can be identified. In our previous research, we have shown that such names are typically contiguous sequences of nouns . We find that including other part of speech tags in this definition is unhelpful, as names do not commonly include other token types (e.g., modifiers). Therefore, we focus on noun phrases that contain only nouns.
Each software or database candidate noun phrase mention is then assigned a score, which integrates scores for several clues that are spotted in the neighbourhood of the mention. Table 1 shows the individual scores assigned to particular patterns. Our approach is similar to other rule-based scoring approaches (e.g., species name identification ). Initial scores were generated by ranking the rules according to their extraction potential and assigning numeric values, with the most powerful given the highest score. We took these numbers as weights and adjusted them empirically based on results from the training data. The scores for all patterns that apply to a given mention are then summed up to provide a score for the given mention.
Title mentions: Many articles that introduce a new database or program place the name in their paper’s title, often following a standard format where the name begins the title, followed by some punctuation (colon, dash, etc.), and finally with a short description (or the expanded acronym) that typically includes a specific keyword. We have compiled a list of such keywords that indicate the presence of a software/database mention (e.g., database, ontology, web service, etc. [10, 11]; a full-list is available from our website).
Enumerations and Hearst patterns: We have followed a standard approach to identify enumerations of names, primarily using Hearst patterns  — e.g., “tool such as MUMmer or Vmatch” (PMCID: PMC2753849), but also any list of noun phrases is considered, even if not part of a Hearst pattern, if at least one of the members has been recognised as a candidate name from our dictionary.
“Good” head nouns: The list of keywords is re-used in combination with the Stanford dependency parser  to identify noun phrases associated with the keyword heads (e.g., “The PolyFreq program”, PMCID: PMC1239908) in order to “recover” potential database and software names that precede the keyword.
Version mention: Strings that appear to represent a version number (e.g., 2.0) following a candidate noun phrase are considered.
References and URLs: Such mentions are also good indicators of a possible preceding database or program name.
Positive orthographic clues: Words in ALL CAPS and MiXeD CasE gain a small positive boost in their score.
In case a mention matches several clues, they are all combined. Additionally, we take the number of different positive clues fired for the given mention, and multiply this by the Compound Factor, which is then added to the candidate’s score.
Common English words and acronyms: Our primary method of filtering names is through term comparison to a common-English dictionary and an acronym list. If a predicted candidate term is either a known English word or known acronym, then it takes a score reduction (see Table 1). The English dictionary is derived from a publicly available list  and the acronym dictionary is derived from ADAM , consisting of 86,308 and 1,933 terms, respectively.
“Negative” head nouns: This approach is similar to the identification of keyword heads that characterise positive mentions, but instead uses a set of “blacklisted” terms with the primary aim of restricting the matches to only those within the scope of the definitions. For example, these heads help filter file formats, programming languages, methods, algorithms and so on (the full-list is available on our website).
Negative orthographic clues: lower case words receive a slight decrease in their final score.
A partial word match: This helps filter out some situations of incorrect tokenization, in particular, for database identifiers. For example, this will help filter out the “GO” in GO:001234.
All rules are designed in JAPE (compound regular expressions) and are matched using GATE .
In some cases, mentions of tools/databases do not have any of the “strong” clues mentioned above, but are rather used with specific verbs (e.g., record, alignment, develop, ran, use, interface, platform; see our website for the full list) or appear with some indicative, but ambiguous head (e.g., interface, platform). While these clues on their own are insufficient to suggest a resource name mention, when combined as weak clues across several mentions of the same candidate, they can be an indication of a resource. In this step we therefore calculate this combined score as the number of weak clues for that name throughout the entire document (all candidate terms within a document that map to the same putative resource, (e.g., all mentions of BLAST)). This value is multiplied by the score of the weak indicator (+0.50, Table 1) and then added to the individual score of each candidate term for that particular resource.
For each candidate mention whose total score is above a given threshold, all lexically equivalent mentions elsewhere in the document are also tagged. In the experiments reported in this paper, the minimum threshold a candidate term needed to exceed was +5.00. Such names are all added to an internal dictionary for that specific document, and LINNAEUS is then passed this “personalised” dictionary file to match against. In this way we “propagate” names that have been spotted with high confidence to mentions that do not have enough local clues on their own (mention-level propagation to document level). This is of particular importance in papers which have a heavy focus on one or two specific tools and these are then mentioned numerous times, often without other clues (e.g., not with a reference, or version number, or URL). As a result, the chance of bioNerDS matching all these mentions is small, leading to a substantially lower recall performance (see Results for validation).
Corpora summary statistics
# of mentions
# of documents
Evaluation scores for bioNerDS
Mention level scores
Document level scores
Tools with ambiguous names are a source of lower precision, much like in other related NER tasks [22, 26]. The context that disambiguates databases and software from other bioinformatics concepts proved hard to determine automatically. This is especially true of Bioconductor package names, which are often all in lower case, and use the same name as the corresponding approach, method or data that they are trying to provide (e.g., aCGH, affy, graph, and ROC). This problem also includes other resources such as the database tRNA listed in the BMC Databases catalog , and the ambiguously named analysis and Network programs. Although bioNerDS contains rules to help address these (through a score reduction), they can still be scored high enough to pass the threshold based on other presented evidence.
Fuzzy distinctions between tools, methods and algorithms, and of file formats and programming languages alongside naming inconsistencies between authors also caused some false positive and false negative results in our evaluation.
bioNerDS evaluation scores with some clues disabled
Without score propagation
Few literature surveys of bioinformatics resource usage currently exist. One such example by Southan and Cameron  surveyed the mentions of databases in the literature, but focused only on European nations and articles from the previous 10 years with “database” in their titles. For our survey of software and database usage, we applied bioNerDS to the entire collection of BMC Bioinformatics and Genome Biology full-text journal articles (up to 2011) as downloaded from PubMed Central (PMC) . We selected these two open-access journals as BMC Bioinformatics aims to provide a venue for publishing about resources for bioinformatics research, whereas Genome Biology’s remit is to apply bioinformatics tools to gain biological insight, which seemed a good contrast for comparison. Each journal’s associated scope emphasises this assessment [35, 36]. The experiment was performed on a total of 6,267 full-text open access documents, with 3,746 from BMC Bioinformatics and 2,521 from Genome Biology. A small subset of these were excluded (84 from BMC Bioinformatics and 55 from Genome Biology) due to full-text files that were unavailable at the time or because preprocessing text mining tools were unable to process them.
Before this experiment, we updated the primary dictionary used in bioNerDS by both including all the terms annotated in the gold standard sets, and by updating all of the dictionary file lists to 28th February, 2012 from 12th April, 2011. This resulted in roughly 1,400 additional entries being added to the dictionary (if this updated dictionary is applied to the evaluation set, an F-measure of 72.5% is achieved; 88% recall).
Top 10 resources and the mean number of documents to include them
Mean number of mentions within a document of a top 10 resource
Mention level counts are, as expected, higher than document level counts, and this is likely to be due to one of a few reasons. First, a resource could be mentioned in the, Background, Methods and Results and discussion sections giving an average of 4-5 times per document. Alternatively, the same database could be used to annotate multiple data entries within a single document, or used for more than one entry/record from that database (e.g., with GO, PubMed, GenBank and Ensembl). R also has very high mention level counts — this could be because it can be used for multiple different analyses (in combination with Bioconductor, for example) and because it is picked up as both a resource, and as a programming language (which is a false-positive by our definition). Conversely, a resource might have relatively low mention level counts compared to document level counts if it is a programming resource, and thus only relevant to a particular part of the articles underlying method (e.g., MySQL). Finally, multiple mentions of a single resource within a paper can be due to a comparison between two (or more) resources made within that paper (for example, comparing BLAST, ClustalW and MUSCLE  multiple-alignment tools).
The large overall usage of R and Gene Ontology within both journals is of interest. This suggests that R is now being accepted as the standard statistical analysis resource for biology and bioinformatics, and that the Gene Ontology is quickly becoming the primary shared vocabulary in the field and is an important resource for bioinformatics and computational biology. The observed relative decline in usage of some “top” resources can, perhaps, be explained by the continued increase in the number of resources becoming available over time, and — as new ones are developed — older ones are slowly phased out. An example is the increase in the use of MUSCLE as an alternative to ClustalW. Similarly, Swiss-Prot fails to feature in the top 50 list for Genome Biology post 2008, possibly as people started to cite Uniprot (a combination of Swiss-Prot and TrEMBL) .
Other interesting findings include, for example, that, for both journals, MUSCLE has only appeared in the top 50 in recent years with a steady increase in ranking. GEO has seen a rise in usage in Genome Biology, while BMC Bioinformatics has MeSH, PubMed and MEDLINE featured much higher than in Genome Biology, suggesting a greater focus on more general biological computational methods and analysis, and a wider bioinformatics scope that, for example, includes text-mining and semantic indexing. Conversely, Genome Biology features Galaxy  within its top 50 (mention level), whereas BMC Bioinformatics does not (ranked 1369). This suggests that Genome Biology articles are more focused on data analysis using established techniques rather than on introducing novel ones.
With Genome Biology’s greater emphasis on biological insight and biological data over method development, it is no surprise that it gets higher absolute usage results for GenBank (both per journal and per article), and higher normalised counts for GenBank, BLAST and Gene Ontology. This helps suggest a potential common methodological pattern of “computational biology”: get sequence (GenBank and Gene Ontology), characterise and analyse (Gene Ontology) and compare it (BLAST). On the other hand, BMC Bioinformatics “favours” R usage, which could be down to an inherent use of R in Genome Biology (to do statistics, generate ROC curves, etc.), compared to the general use of R as a programming platform in BMC Bioinformatics for method development. Genome Biology also has a wider usage of Gene Ontology (high Σ Δ). This could be because Genome Biology can (as with R) apply the result of a Gene Ontology/R analysis once wrapped up as another tool or database without needing the direct reference, e.g., during an over-expression analysis. Finally, BMC Bioinformatics has high variation in the use of BLAST, which is often used as a comparison for new tools, whereas it would tend to form part of a primary analysis pipeline within Genome Biology articles.
The data also show that 25.3% of BMC Bioinformatics papers potentially mention a new resource in the title, as opposed to only 4.3% of Genome Biology papers, confirming again that BMC Bioinformatics has a much greater focus on resource creation than Genome Biology.
We additionally calculated the resource name union and intersection between the two journals. The intersection covers 34% of the resource mentions in the Genome Biology corpus and 14% of the BMC Bioinformatics corpus. Only 11% of all resource names collected are contained within both journals. These names, however, accounted for 57% of the total mentions extracted. This further highlights how a relatively small number of resources are mentioned very frequently within (and across) the literature. Conversely, 53% of the total number of unique resource names extracted across both journals were only mentioned once (at the mention level).
Finally, we evaluated the “long tail curve” property of our data given the hypothesis that a majority of resources are introduced, but hardly used again and potentially ignored after that point. We are careful not to extrapolate too much from this analysis as our results are only from two journals. The document level results reflect this hypothesis (see Additional file 2): for example, 95% of resources are mentioned less than 6 times (78% only once) and, on the other end, the top 100 resources account for over 96% of all mentions (9% of the total mentions are of R). There was little difference between the two journals for these figures.
We note several limitations of the analysis presented here. Within this survey, we have considered a resource mention to imply the use of that resource, though we are aware that this is not always the case. There are also several other limitations, due to the nature of the topic. Firstly, we do limited resource aggregation of name variants for our survey — in particular, we aggregate some known name variants involving word-case and acronyms (as automatically recognised by LINNAEUS or BADREX ), those linked in our primary dictionary, and the use of spaces verses dashes/hyphens (it is perhaps important to point out that this aggregation combines matches for ClustalW with matches for ClustalX in text, which we have only referred to as ClustalW within this paper). For a fairer analysis, more extensive name normalisation would be required on the data to accurately group all name variations. Second, our study does not directly take into account the creation date of resources. We would expect that resources that have been around longer, would generally have more mentions. However, as our analysis only goes back as far as the journals we are looking at, normalising by the number of years that we have mentions for would be unfair. Finally, although we normalise for the number of articles processed in a given year, this does not take into account the number of alternative places to publish in that year, or the general increase in publication rate each year. A far more detailed analysis of the types of trends and their potential reasons, particularly using the resources to characterise the journals is needed, similar to the review of biomedical corpora usage by Cohen et al..
bioNerDS can recognise mentions of bioinformatics’ databases and software in primary literature with a reasonable accuracy. It achieved an F-measure of between 63% and 91% on different datasets (63%–78% at the document level). Though other NER tasks, like gene name recognition, are now considered mature, this was not always the case, especially when gene recognition was first attempted. For example, in the first BioCreAtIvE task, F-measures ranging from below 50% to just over 80% at best were achieved . bioNerDS is, to the best of our knowledge, the first attempt at comprehensive database and software name recognition at the mention level and identification accuracy will improve. While further work is required, we think that the approach represents a significant step towards providing a means to explore the usage of databases and tools in bioinformatics.
Still, the accuracy achieved is sufficient to evaluate resource usage across the literature on both the document and mention levels. We have further demonstrated the potential of bioNerDS in exploring similarities and differences between journals and fields through systematic literature analysis of database and software use. The results obtained provide an indication of the similarities and differences between the two journals surveyed.
Finally, additional work is required both to further increase the accuracy of the tool (especially in automated recognition of false-positive results) and in a more comprehensive analysis of the results obtained.
bioNerDS and the data extracted are available at http://bionerds.sourceforge.net/ under the Simplified 2-Clause BSD Licence.
another database of abbreviations in MEDLINE
a Nearly-New Information Extraction System
bioinformatics Named entity recogniser for Databases and Software
BioInformatics Resource Inventory
Basic Local Alignment Search Tool
General Architecture for Text Engineering
Gene Expression Omnibus
Horizontal Gene Transfer (name of a database or name of a program, or could just be a false positive hit)
Java Annotation Patterns Engine
Kyoto Encyclopedia of Genes and Genomes
Medical Subject Headings
Named entity recognition
Online Resource Finder in Life sciences
Protein Data Bank
PubMed Central Identifier
Recall (also used for the statistics program R, which is not an acronym)
Structure-based Sequence alignments of SCOP Superfamilies
Structural Classification of Proteins
We would like to thank Daniel Jamieson (University of Manchester) for his help in establishing the inter-annotator agreement, and to the many developers at the GATE mailing list for their assistance in various GATE development issues during the project. GD is funded by a studentship from the Biotechnology and Biological Sciences Research Council (BBSRC) to GN, DLR and RS.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.