- Research article
- Open Access
bioNerDS: exploring bioinformatics’ database and software use through literature mining
BMC Bioinformatics volume 14, Article number: 194 (2013)
Biology-focused databases and software define bioinformatics and their use is central to computational biology. In such a complex and dynamic field, it is of interest to understand what resources are available, which are used, how much they are used, and for what they are used. While scholarly literature surveys can provide some insights, large-scale computer-based approaches to identify mentions of bioinformatics databases and software from primary literature would automate systematic cataloguing, facilitate the monitoring of usage, and provide the foundations for the recovery of computational methods for analysing biological data, with the long-term aim of identifying best/common practice in different areas of biology.
We have developed bioNerDS, a named entity recogniser for the recovery of bioinformatics databases and software from primary literature. We identify such entities with an F-measure ranging from 63% to 91% at the mention level and 63-78% at the document level, depending on corpus. Not attaining a higher F-measure is mostly due to high ambiguity in resource naming, which is compounded by the on-going introduction of new resources. To demonstrate the software, we applied bioNerDS to full-text articles from BMC Bioinformatics and Genome Biology. General mention patterns reflect the remit of these journals, highlighting BMC Bioinformatics’s emphasis on new tools and Genome Biology’s greater emphasis on data analysis. The data also illustrates some shifts in resource usage: for example, the past decade has seen R and the Gene Ontology join BLAST and GenBank as the main components in bioinformatics processing.
We demonstrate the feasibility of automatically identifying resource names on a large-scale from the scientific literature and show that the generated data can be used for exploration of bioinformatics database and software usage. For example, our results help to investigate the rate of change in resource usage and corroborate the suspicion that a vast majority of resources are created, but rarely (if ever) used thereafter. bioNerDS is available at http://bionerds.sourceforge.net/.
The fields of bioinformatics and computational biology are established as ones of rapid change with a continued expansion of the available “resourceome” , which includes numerous databases and software [1, 2]. Such resources facilitate research in biology, and many have become “household names” (e.g., BLAST , ClustalW , etc.). Still, the huge resourceome also creates problems for the choice of appropriate methods for performing a particular task, and poses a challenge of identifying “best practice”: a well-known, popular tool may not be the “best” tool currently available . To help with method choice, we first need to determine what software and data resources are available and used in computational analyses. Several inventories and repositories already exist that list available database and software resources. For example, the 2011 special issues of Nucleic Acids Research’s Databases  and the Bioinformatics Links Directory  list over 1,330 databases and over 1,250 web services respectively. However, many of these inventories and repositories are incomplete and require labour intensive manual curation. Similarly, “manual” literature surveys of published tools and databases are time-consuming and often out-of-date by the time they are published. Therefore, large-scale automated ways for extraction of database and software use patterns are needed. As well as helping with maintenance of resource catalogues, such systematic processing could offer insights into the dynamics of software and data resource usage, particularly as many resources are infrequently used . This is not only of interest to users of these resources, who wish to know what is current and most used, but also to any potential new users and resource developers.
In our previous work we used the literature to explore and evaluate methods used in phylogenetics [5, 8]. We implemented a named-entity recognition (NER) system that utilised a controlled vocabulary of terms as specified by a comprehensive software resource dictionary. We also used a semantic-based approach to identify and profile existing and new resources using keyword association . We then attempted to capture phylogenetic methods based on a predefined abstract representation of four stages within phylogenetics (sequence alignment; tree inference; statistical testing and data re-sampling; tree visualisation and annotation ). This approach could be applicable to other fields within bioinformatics, but it first requires an extensive resource repository and the ability to identify mentions of tools and databases in text. This task is far from trivial — in our previous work we have demonstrated the high level of ambiguity and variability of database and software names in the bioinformatics literature .
In this paper we introduce and evaluate bioNerDS, a bioinformatics named-entity recognition system for database and software names, which is used to identify mentions of such entities in the literature. It makes use of a range of both sentence and document-level clues to learn database and software names, while propagating mentions up to the article level. To illustrate its potential, we use bioNerDS to survey software and data resource usage in two journals from computational biology and bioinformatics.
Several other approaches to automated extraction of bioinformatics resources from primary literature have been suggested. For example, OReFiL (Online Resource Finder in Life sciences)  and BIRI (BioInformatics Resource Inventory)  aim to harvest resource names and fill their repositories in order to enable resource discovery. OReFiL uses URLs as a “proxy” to identify mentions of resources by custom regular expressions and by extracting |<url>.. </url>| tags from BioMed Central (BMC) papers. BIRI utilises keywords and sentence structure to identify relevant terms through custom patterns translated into “transition networks”, which match associated regular expressions for resource names, functions and classifications. bioNerDS on the other hand builds on established approaches to NER by using a generally applicable method for identification of software and database mentions. Furthermore, while OReFiL focuses on the abstract and “availability” or “implementation” sections, and BIRI solely looks into abstracts and titles, bioNerDS can detect resource name mentions throughout full-text articles.
We note that throughout this paper we will mention numerous databases and tools by name as examples. A full list of references and web-links to these can be found on our website. Note, we only cite the first mention of the resource within this paper.
bioNerDS is designed and developed as an NER tool that aims to recognise database and software mentions in literature, and to provide a document-level “list” of resources mentioned in a given article. We identify resource names that represent databases, ontologies, classifications, software, programs, tools, web-services or packages, and exclude names of files and file formats, methods, algorithms, identifiers, operating systems and programming languages (see ).
Figure 1 represents a high-level overview of bioNerDS. Each document is first pre-processed using a typical text-mining pipeline consisting of tokenization, sentence splitting and part of speech tagging, all using GATE’s ANNIE plug-in [12, 13]. In the following step, we apply a dictionary look-up to identify candidate mentions of tool/database names. Given the dynamic nature of the bioinformatics resourceome, the dictionary-based approach alone is insufficient for large-scale and dynamic capture of databases and software in the literature . To increase coverage, we make use of several rule-based techniques to recognise unknown database and tool names in text. Therefore, in the third step, we use a number of local clues that indicate that a given phrase might represent a mention of a database/software name. Each mention is assigned a score; these scores can be positive or negative, depending on their indication of a true positive or a false positive match.
While previous steps focus on single mentions, we also collect supporting “weak” evidence across different mentions of the same candidate (within the document) and use it to update/adjust the mention-level scores. Finally, all candidate mentions with a score above a given threshold are propagated through to the whole document and extracted so a document-level list of mentioned resources can be generated. We discuss these steps in the following subsections.
At bioNerDS’ core is a large case-sensitive dictionary of known database and tool names. This dictionary contained 4,871 entries with 4,879 unique name variants compiled from several available online resource repositories and inventories (e.g., Nucleic Acids Research , the Bioinformatics Links Directory , Bioconductor  and Wikipedia ). Initial dictionary-matched candidates are assigned a score (see Table 1) that will be used to estimate our confidence that a given candidate is indeed a tool/database name. We note that names from Bioconductor have been assigned a small negative impact in addition to the initial dictionary score (see Table 1), as they are often homonymic with names of associated general bioinformatics concepts (e.g., aCGH, affy, graph, and ROC). The dictionary was then used as input to LINNAEUS, a dictionary-based mention-level NER tool .
Applying local clues to recognise unknown mentions
We consider several situations where new/unknown database/tool names can be identified. In our previous research, we have shown that such names are typically contiguous sequences of nouns . We find that including other part of speech tags in this definition is unhelpful, as names do not commonly include other token types (e.g., modifiers). Therefore, we focus on noun phrases that contain only nouns.
Each software or database candidate noun phrase mention is then assigned a score, which integrates scores for several clues that are spotted in the neighbourhood of the mention. Table 1 shows the individual scores assigned to particular patterns. Our approach is similar to other rule-based scoring approaches (e.g., species name identification ). Initial scores were generated by ranking the rules according to their extraction potential and assigning numeric values, with the most powerful given the highest score. We took these numbers as weights and adjusted them empirically based on results from the training data. The scores for all patterns that apply to a given mention are then summed up to provide a score for the given mention.
The positive local clues include the following:
Title mentions: Many articles that introduce a new database or program place the name in their paper’s title, often following a standard format where the name begins the title, followed by some punctuation (colon, dash, etc.), and finally with a short description (or the expanded acronym) that typically includes a specific keyword. We have compiled a list of such keywords that indicate the presence of a software/database mention (e.g., database, ontology, web service, etc. [10, 11]; a full-list is available from our website).
Enumerations and Hearst patterns: We have followed a standard approach to identify enumerations of names, primarily using Hearst patterns  — e.g., “tool such as MUMmer or Vmatch” (PMCID: PMC2753849), but also any list of noun phrases is considered, even if not part of a Hearst pattern, if at least one of the members has been recognised as a candidate name from our dictionary.
“Good” head nouns: The list of keywords is re-used in combination with the Stanford dependency parser  to identify noun phrases associated with the keyword heads (e.g., “The PolyFreq program”, PMCID: PMC1239908) in order to “recover” potential database and software names that precede the keyword.
Version mention: Strings that appear to represent a version number (e.g., 2.0) following a candidate noun phrase are considered.
References and URLs: Such mentions are also good indicators of a possible preceding database or program name.
Positive orthographic clues: Words in ALL CAPS and MiXeD CasE gain a small positive boost in their score.
In case a mention matches several clues, they are all combined. Additionally, we take the number of different positive clues fired for the given mention, and multiply this by the Compound Factor, which is then added to the candidate’s score.
There are numerous resource names that are ambiguous; this is a common problem in NER tasks [22, 26]. For example, the two most troublesome tool names that we came across were Network and analysis, also widely used for common concepts in bioinformatics. We have therefore designed a set of rules that suggest a putative resource name would be incorrectly identified as a resource in a given context. The negative local clues include the following:
Common English words and acronyms: Our primary method of filtering names is through term comparison to a common-English dictionary and an acronym list. If a predicted candidate term is either a known English word or known acronym, then it takes a score reduction (see Table 1). The English dictionary is derived from a publicly available list  and the acronym dictionary is derived from ADAM , consisting of 86,308 and 1,933 terms, respectively.
“Negative” head nouns: This approach is similar to the identification of keyword heads that characterise positive mentions, but instead uses a set of “blacklisted” terms with the primary aim of restricting the matches to only those within the scope of the definitions. For example, these heads help filter file formats, programming languages, methods, algorithms and so on (the full-list is available on our website).
Negative orthographic clues: lower case words receive a slight decrease in their final score.
A partial word match: This helps filter out some situations of incorrect tokenization, in particular, for database identifiers. For example, this will help filter out the “GO” in GO:001234.
All rules are designed in JAPE (compound regular expressions) and are matched using GATE .
Cross-mention “weak” clues
In some cases, mentions of tools/databases do not have any of the “strong” clues mentioned above, but are rather used with specific verbs (e.g., record, alignment, develop, ran, use, interface, platform; see our website for the full list) or appear with some indicative, but ambiguous head (e.g., interface, platform). While these clues on their own are insufficient to suggest a resource name mention, when combined as weak clues across several mentions of the same candidate, they can be an indication of a resource. In this step we therefore calculate this combined score as the number of weak clues for that name throughout the entire document (all candidate terms within a document that map to the same putative resource, (e.g., all mentions of BLAST)). This value is multiplied by the score of the weak indicator (+0.50, Table 1) and then added to the individual score of each candidate term for that particular resource.
Applying a threshold and document-level annotations
For each candidate mention whose total score is above a given threshold, all lexically equivalent mentions elsewhere in the document are also tagged. In the experiments reported in this paper, the minimum threshold a candidate term needed to exceed was +5.00. Such names are all added to an internal dictionary for that specific document, and LINNAEUS is then passed this “personalised” dictionary file to match against. In this way we “propagate” names that have been spotted with high confidence to mentions that do not have enough local clues on their own (mention-level propagation to document level). This is of particular importance in papers which have a heavy focus on one or two specific tools and these are then mentioned numerous times, often without other clues (e.g., not with a reference, or version number, or URL). As a result, the chance of bioNerDS matching all these mentions is small, leading to a substantially lower recall performance (see Results for validation).
Results and discussion
Evaluation of bioNerDS
A gold standard comprised of 60 full-text articles was split into a training set of 25 articles, a development set of 5 articles, an evaluation set of 25 articles (all from a random sample of BMC Bioinformatics and PLoS Computational Biology articles), and a set of 5 articles from Genome Biology that was used for evaluation only. A summary of the four corpora is provided in Table 2. The inter-annotator agreement (F-measure) for this corpus has been previously reported as 86% (80% strict; observed agreement 60%) .
We make use of the standard evaluation metrics: precision (P), recall (R), and F-measure (F), based on the numbers of true-positives (TP), false-positives (FP), and false-negatives (FN):
Table 3 shows the results with both strict (exact offset matches between gold standard and system annotation) and lenient matching (annotation overlap is sufficient; this is the main metric mentioned in the rest of the article). While we were able to obtain an F-measure of nearly 80% for the development set, the F-measure for the evaluation set was only 63%. For comparison, a baseline with dictionary-based approach achieved a lenient F-measure of just 54% . After analysing the reasons behind the drop in performance, we concluded that there was a single document (PMC1664705) in the evaluation set responsible for a high number of total errors (32% of the total false positives): by excluding it, the lenient F-measure increased to 66%. The errors came from a series of false positive matches against “P and D modules”, where these protein modules were frequently used with interface and so were incorrectly matched by bioNerDS (protein docking interface, rather than a user/graphical interface). We also note that the performance on the Genome Biology evaluation corpus was more consistent with the training/development dataset, with an F-measure of 91%.
Tools with ambiguous names are a source of lower precision, much like in other related NER tasks [22, 26]. The context that disambiguates databases and software from other bioinformatics concepts proved hard to determine automatically. This is especially true of Bioconductor package names, which are often all in lower case, and use the same name as the corresponding approach, method or data that they are trying to provide (e.g., aCGH, affy, graph, and ROC). This problem also includes other resources such as the database tRNA listed in the BMC Databases catalog , and the ambiguously named analysis and Network programs. Although bioNerDS contains rules to help address these (through a score reduction), they can still be scored high enough to pass the threshold based on other presented evidence.
Fuzzy distinctions between tools, methods and algorithms, and of file formats and programming languages alongside naming inconsistencies between authors also caused some false positive and false negative results in our evaluation.
Effects of cross-mention scoring
To evaluate the effectiveness of the weak clues and score propagation approach, bioNerDS was run on the three corpora (training, evaluation and Genome Biology) with rules for various clues disabled. This allowed us to determine how much of the final accuracy is attributed to each of the clues. We note, however, that no module scores or thresholds were retrained or adjusted for these tests. Table 4 shows the mention-level results for three different settings. They are all having a positive impact to the overall F-score from the corpora. In the cases of weak clues and score propagation, this is down to the large effect that they have on the recall of the system, providing increases ranging from 5 to 33%. As can be expected, they do generally provide some negative impact on precision (up to 12%), but the overall impact on F-score was a positive one in each case. Conversely, disabling the “head pattern” gave mixed results. For the training set, disabling it increased both recall and precision, resulting in a 3% higher F-score. In the evaluation and Genome Biology corpora, disabling it resulted in reduced F-scores (by 2% and 6%).
Few literature surveys of bioinformatics resource usage currently exist. One such example by Southan and Cameron  surveyed the mentions of databases in the literature, but focused only on European nations and articles from the previous 10 years with “database” in their titles. For our survey of software and database usage, we applied bioNerDS to the entire collection of BMC Bioinformatics and Genome Biology full-text journal articles (up to 2011) as downloaded from PubMed Central (PMC) . We selected these two open-access journals as BMC Bioinformatics aims to provide a venue for publishing about resources for bioinformatics research, whereas Genome Biology’s remit is to apply bioinformatics tools to gain biological insight, which seemed a good contrast for comparison. Each journal’s associated scope emphasises this assessment [35, 36]. The experiment was performed on a total of 6,267 full-text open access documents, with 3,746 from BMC Bioinformatics and 2,521 from Genome Biology. A small subset of these were excluded (84 from BMC Bioinformatics and 55 from Genome Biology) due to full-text files that were unavailable at the time or because preprocessing text mining tools were unable to process them.
Before this experiment, we updated the primary dictionary used in bioNerDS by both including all the terms annotated in the gold standard sets, and by updating all of the dictionary file lists to 28th February, 2012 from 12th April, 2011. This resulted in roughly 1,400 additional entries being added to the dictionary (if this updated dictionary is applied to the evaluation set, an F-measure of 72.5% is achieved; 88% recall).
In total, Genome Biology contained over 53,000 mentions (5,284 unique), and BMC Bioinformatics contained over 174,000 mentions (with 13,023 unique). The results revealed that 77.6% of the Genome Biology papers contained at least one resource mention, compared to 97.7% of BMC Bioinformatics papers. Table 5 provides the mean number of documents to contain a mention for the top 10 resources of each journal. Both journals include many of the well-known and well-established databases and tools within the bioinformatics field (e.g., R , Gene Ontology , BLAST and GenBank ). However, some interesting differences between the two journals emerged, with Genome Biology featuring Ensembl , whereas BMC Bioinformatics featured PDB , KEGG  and MySQL  within the top 10. There are also some probable errors resulting from bioNerDS’ possible false-positive hits (e.g., tRNA, S4, Q and Network).
Table 6 provides the results obtained on the mention level for the top 10 resources from each journal. It features many of the same resource names listed as in Table 5, but some notable changes are that KEGG now appears in both journals’ top 10 lists, and SCOP  and PubMed  now appear in BMC Bioinformatics. Additionally, Gene Ontology has overtaken R in both journals. Both tables also show that BMC Bioinformatics tends to have higher overall counts of resource mentions than Genome Biology, with the notable exceptions of GenBank, Ensembl and GEO , which are all higher in Genome Biology.
Mention level counts are, as expected, higher than document level counts, and this is likely to be due to one of a few reasons. First, a resource could be mentioned in the, Background, Methods and Results and discussion sections giving an average of 4-5 times per document. Alternatively, the same database could be used to annotate multiple data entries within a single document, or used for more than one entry/record from that database (e.g., with GO, PubMed, GenBank and Ensembl). R also has very high mention level counts — this could be because it can be used for multiple different analyses (in combination with Bioconductor, for example) and because it is picked up as both a resource, and as a programming language (which is a false-positive by our definition). Conversely, a resource might have relatively low mention level counts compared to document level counts if it is a programming resource, and thus only relevant to a particular part of the articles underlying method (e.g., MySQL). Finally, multiple mentions of a single resource within a paper can be due to a comparison between two (or more) resources made within that paper (for example, comparing BLAST, ClustalW and MUSCLE  multiple-alignment tools).
To evaluate how resource usage has changed over time, we selected some key database and software resources and plotted the relative use (document level mentions in a given year, divided by the total number of journal papers in that year,) for each year from 2001 to 2011 inclusive for each journal on the document level. The resulting graphs (Figures 2 and 3) show that R usage (and, perhaps by relation, Bioconductor) has been consistently increasing for both journals over the 10 year period (more so within BMC Bioinformatics) and that the use of Gene Ontology has steadied off in recent years after a rapid increase in the first few years (of the survey). Most other major resources have started to decline in usage for both journals (e.g., GenBank, Swiss-Prot , BLAST and ClustalW) and this decline is more prevalent in BMC Bioinformatics than in Genome Biology. Finally, in contrast to other database resources, Ensembl has seen a slight increase in usage in Genome Biology.
The large overall usage of R and Gene Ontology within both journals is of interest. This suggests that R is now being accepted as the standard statistical analysis resource for biology and bioinformatics, and that the Gene Ontology is quickly becoming the primary shared vocabulary in the field and is an important resource for bioinformatics and computational biology. The observed relative decline in usage of some “top” resources can, perhaps, be explained by the continued increase in the number of resources becoming available over time, and — as new ones are developed — older ones are slowly phased out. An example is the increase in the use of MUSCLE as an alternative to ClustalW. Similarly, Swiss-Prot fails to feature in the top 50 list for Genome Biology post 2008, possibly as people started to cite Uniprot (a combination of Swiss-Prot and TrEMBL) .
Other interesting findings include, for example, that, for both journals, MUSCLE has only appeared in the top 50 in recent years with a steady increase in ranking. GEO has seen a rise in usage in Genome Biology, while BMC Bioinformatics has MeSH, PubMed and MEDLINE featured much higher than in Genome Biology, suggesting a greater focus on more general biological computational methods and analysis, and a wider bioinformatics scope that, for example, includes text-mining and semantic indexing. Conversely, Genome Biology features Galaxy  within its top 50 (mention level), whereas BMC Bioinformatics does not (ranked 1369). This suggests that Genome Biology articles are more focused on data analysis using established techniques rather than on introducing novel ones.
To test whether the variations seen in usage were those that we might expect by chance, we looked at the normalised usage statistics of the resources and compared the change in usage starting from a normalised baseline (Year 0). Across the resources, the changes in usage had an approximately Gaussian distribution. We could therefore test whether the usage patterns seen across years mapped onto what would be expected from a Gaussian random walk process . The expected variation (95% confidence interval) of the random walk process was therefore added to the graph of the normalised differences. If the variation graphs stayed within these bounds we could infer that the changes could be accounted for by a random walk process. Deviations from these bounds would suggest a process different from a random walk. The resulting graphs showing the results of these analyses are given in Figures 4 and 5. For Genome Biology, there is no significant deviation from the 95% boundaries, suggesting that the variation could be down to random variation alone. However, within BMC Bioinformatics, the Gene Ontology is increasing at a rate that far surpasses the upper 95% boundary, confirming that the Gene Ontology has seen a significant rise in usage over the last 10 years (at least, within BMC Bioinformatics). R is also trailing close to the upper boundary, also suggesting that an external factor outside of general variation is having an influence on its usage. We note that the Gene Ontology is increasing at a higher rate than R in Figure 5, whereas in the initial graph (Figure 3), R is growing faster than the Gene Ontology. This is the case because the Gene Ontology has a lower starting value and so, when normalised by Year 0, sees a higher relative rate of growth. The data used for these tests is provided in Additional file 1.
To further explore the “rate” of change in the bioinformatics resourceome in the two journal corpora, we plotted the normalised frequencies of each resource per year against the sum of the absolute difference (Σ Δ) in those frequencies across the years. So, for the number of mentions xy of a resource x in a given year y, we define:
The graphs in Figures 6 and 7 show that, when comparing the two journals, Genome Biology has higher usage counts of Gene Ontology and GenBank, whereas BMC Bioinformatics has higher usage of R. This seems to be in contradiction to the results in Tables 5 and 6, but we note that the tables are normalised for the whole corpus (per document), whereas the graphs are normalised by the total resource counts of the top 50 resources. Conversely, Genome Biology has a higher variation in use (higher Σ Δ) of Gene Ontology and R, with BMC Bioinformatics having a higher variation in BLAST use. It is also of interest that Bioconductor and GEO feature to the right of the Genome Biology graph confirming their rapid acceleration of usage in recent years (high variation).
With Genome Biology’s greater emphasis on biological insight and biological data over method development, it is no surprise that it gets higher absolute usage results for GenBank (both per journal and per article), and higher normalised counts for GenBank, BLAST and Gene Ontology. This helps suggest a potential common methodological pattern of “computational biology”: get sequence (GenBank and Gene Ontology), characterise and analyse (Gene Ontology) and compare it (BLAST). On the other hand, BMC Bioinformatics “favours” R usage, which could be down to an inherent use of R in Genome Biology (to do statistics, generate ROC curves, etc.), compared to the general use of R as a programming platform in BMC Bioinformatics for method development. Genome Biology also has a wider usage of Gene Ontology (high Σ Δ). This could be because Genome Biology can (as with R) apply the result of a Gene Ontology/R analysis once wrapped up as another tool or database without needing the direct reference, e.g., during an over-expression analysis. Finally, BMC Bioinformatics has high variation in the use of BLAST, which is often used as a comparison for new tools, whereas it would tend to form part of a primary analysis pipeline within Genome Biology articles.
The data also show that 25.3% of BMC Bioinformatics papers potentially mention a new resource in the title, as opposed to only 4.3% of Genome Biology papers, confirming again that BMC Bioinformatics has a much greater focus on resource creation than Genome Biology.
We additionally calculated the resource name union and intersection between the two journals. The intersection covers 34% of the resource mentions in the Genome Biology corpus and 14% of the BMC Bioinformatics corpus. Only 11% of all resource names collected are contained within both journals. These names, however, accounted for 57% of the total mentions extracted. This further highlights how a relatively small number of resources are mentioned very frequently within (and across) the literature. Conversely, 53% of the total number of unique resource names extracted across both journals were only mentioned once (at the mention level).
Finally, we evaluated the “long tail curve” property of our data given the hypothesis that a majority of resources are introduced, but hardly used again and potentially ignored after that point. We are careful not to extrapolate too much from this analysis as our results are only from two journals. The document level results reflect this hypothesis (see Additional file 2): for example, 95% of resources are mentioned less than 6 times (78% only once) and, on the other end, the top 100 resources account for over 96% of all mentions (9% of the total mentions are of R). There was little difference between the two journals for these figures.
We note several limitations of the analysis presented here. Within this survey, we have considered a resource mention to imply the use of that resource, though we are aware that this is not always the case. There are also several other limitations, due to the nature of the topic. Firstly, we do limited resource aggregation of name variants for our survey — in particular, we aggregate some known name variants involving word-case and acronyms (as automatically recognised by LINNAEUS or BADREX ), those linked in our primary dictionary, and the use of spaces verses dashes/hyphens (it is perhaps important to point out that this aggregation combines matches for ClustalW with matches for ClustalX in text, which we have only referred to as ClustalW within this paper). For a fairer analysis, more extensive name normalisation would be required on the data to accurately group all name variations. Second, our study does not directly take into account the creation date of resources. We would expect that resources that have been around longer, would generally have more mentions. However, as our analysis only goes back as far as the journals we are looking at, normalising by the number of years that we have mentions for would be unfair. Finally, although we normalise for the number of articles processed in a given year, this does not take into account the number of alternative places to publish in that year, or the general increase in publication rate each year. A far more detailed analysis of the types of trends and their potential reasons, particularly using the resources to characterise the journals is needed, similar to the review of biomedical corpora usage by Cohen et al..
bioNerDS can recognise mentions of bioinformatics’ databases and software in primary literature with a reasonable accuracy. It achieved an F-measure of between 63% and 91% on different datasets (63%-78% at the document level). Though other NER tasks, like gene name recognition, are now considered mature, this was not always the case, especially when gene recognition was first attempted. For example, in the first BioCreAtIvE task, F-measures ranging from below 50% to just over 80% at best were achieved . bioNerDS is, to the best of our knowledge, the first attempt at comprehensive database and software name recognition at the mention level and identification accuracy will improve. While further work is required, we think that the approach represents a significant step towards providing a means to explore the usage of databases and tools in bioinformatics.
Still, the accuracy achieved is sufficient to evaluate resource usage across the literature on both the document and mention levels. We have further demonstrated the potential of bioNerDS in exploring similarities and differences between journals and fields through systematic literature analysis of database and software use. The results obtained provide an indication of the similarities and differences between the two journals surveyed.
Finally, additional work is required both to further increase the accuracy of the tool (especially in automated recognition of false-positive results) and in a more comprehensive analysis of the results obtained.
bioNerDS and the data extracted are available at http://bionerds.sourceforge.net/ under the Simplified 2-Clause BSD Licence.
another database of abbreviations in MEDLINE
a Nearly-New Information Extraction System
bioinformatics Named entity recogniser for Databases and Software
BioInformatics Resource Inventory
Basic Local Alignment Search Tool
General Architecture for Text Engineering
Gene Expression Omnibus
Horizontal Gene Transfer (name of a database or name of a program, or could just be a false positive hit)
Java Annotation Patterns Engine
Kyoto Encyclopedia of Genes and Genomes
Medical Subject Headings
Named entity recognition
Online Resource Finder in Life sciences
Protein Data Bank
PubMed Central Identifier
Recall (also used for the statistics program R, which is not an acronym)
Structure-based Sequence alignments of SCOP Superfamilies
Structural Classification of Proteins
Cannata N, Merelli E, Altman RB: Time to organize the bioinformatics resourceome. PLoS Comput Biol. 2005, 1 (7): e76-10.1371/journal.pcbi.0010076. [http://www.ncbi.nlm.nih.gov/pubmed/16738704]
Wren JD, Bateman A: Databases, data tombs and dust in the wind. Bioinformatics. 2008, 24 (19): 2127-2128. 10.1093/bioinformatics/btn464. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/19/2127]
Altschul SF, Gish W, Miller W, Myers EW, Lipman D J etal: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics (Oxford, England). 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404. [http://www.ncbi.nlm.nih.gov/pubmed/17846036]
Eales JM, Pinney JW, Stevens RD, Robertson DL: Methodology capture discriminating between the “best” and the rest of community practice. BMC Bioinformatics. 2008, 9: 359-10.1186/1471-2105-9-359. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2553348,
Galperin MY, Cochrane GR: The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 2011, 39 (Database issue): D1-D6. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3013748]
Brazas MD, Yim DS, Yamada JT, Ouellette BFF: The 2011 bioinformatics links directory update: more resources, tools and databases and features to empower the bioinformatics community. Nucleic Acids Res. 2011, 39 Suppl 2 (suppl_2): W3-W7. [http://nar.oxfordjournals.org/cgi/content/abstract/39/suppl_2/W3]
Afzal H, Eales J, Stevens R, Nenadic G: Mining Semantic Networks of Bioinformatics e-Resources from the Literature. Proceedings of the SWAT4LS Workshop. 2009, Amsterdam
Duck G, Stevens R, Robertson D, Nenadic G: Ambiguity and Variability of Database and Software Names in Bioinformatics. Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine (SMBM). Edited by: Ananiadou S, Pyysalo S, Rebholz-Schuhmann D, Rinaldi F, Salakoski T. 2012, 2-9. [http://dx.doi.org/10.5167/uzh-64476]
Yamamoto Y, Takagi T: OReFiL: an online resource finder for life sciences. BMC Bioinformatics. 2007, 8: 287-10.1186/1471-2105-8-287. [http://www.biomedcentral.com/1471-2105/8/287]
de la Calle G, García-Remesal M, Chiesa S, de la Iglesia D, Maojo V: BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics. 2009, 10: 320-10.1186/1471-2105-10-320. [http://www.biomedcentral.com/1471-2105/10/320]
Cunningham H, Maynard D, Bontcheva K, Tablan V: GATE: an architecture for development of robust HLT applications. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02. 2002, Philadelphia, Pennsylvania: Association for Computational Linguistics, 168-175. [http://portal.acm.org/citation.cfm?doid=1073083.1073112]
Cunningham H, Maynard D, Bontcheva K, Tablan V, Aswani N, Roberts I, Gorrell G, Funk A, Roberts A, Damljanovic D, Heitz T, Greenwood MA, Saggion H, Petrak J, Li Y, Peters W, et al: Text Processing with GATE (Version 6). 2011, University of Sheffield Department of Computer Science 15 April
Oxford journals | Life Sciences | Nucleic Acids Research | Database Summary Paper Alpha List. [http://www.oxfordjournals.org/nar/database/a/]
Bioinformatics Links Directory | Bioinformatics.ca Links Directory. [http://bioinformatics.ca/links_directory/]
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80. [https://www.ncbi.nlm.nih.gov/pubmed/15461798]
Wikipedia, the free encylopedia. [https://en.wikipedia.org/wiki/Main_Page]
Bioconductor — aCGH. [http://www.bioconductor.org/packages/release/bioc/html/aCGH.html]
Bioconductor — affy. [http://www.bioconductor.org/packages/release/bioc/html/affy.html]
Bioconductor — graph. [http://www.bioconductor.org/packages/release/bioc/html/graph.html]
Bioconductor — ROC. [http://www.bioconductor.org/packages/release/bioc/html/ROC.html]
Gerner M, Nenadic G, Bergman CM: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics. 2010, 11: 85-10.1186/1471-2105-11-85. [http://www.biomedcentral.com/1471-2105/11/85]
Wang X, Matthews M: Species disambiguation for biomedical term identification. Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing (BioNLP). 2008, Columbus, Ohio: Association for Computational Linguistics, 71-79. [http://www.aclweb.org/anthology/W/W08/W08-0610]
Hearst MA: Automatic acquisition of hyponyms from large text corpora. Proceedings of the 14th conference on Computational linguistics - Volume 2. 1992, Morristown, NJ, USA: Association for Computational Linguistics, 539-545. [http://portal.acm.org/citation.cfm?id=992133.992154]
de Marneffe MC, MacCartney B, Manning CD: Generating typed dependency parses from phrase structure parses. LREC 2006. 2006, [http://nlp.stanford.edu/pubs/LREC06_dependencies.pdf]
Yeh A, Morgan A, Colosimo M, Hirschman L: BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics. 2005, 6 Suppl 1 (Suppl 1): S2-[http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1869012]
Free phylogenetic network software. [http://www.fluxus-engineering.com/sharenet.htm]
Software based on libsequence. [http://molpopgen.org/software/lseqsoftware.html]
Atkinson K: Kevin’s word list page. [http://wordlist.sourceforge.net/]
Zhou W, Torvik VI, Smalheiser NR: ADAM: Another Database of Abbreviations in MEDLINE. Bioinformatics. 2006, 22 (22): 2813-2818. 10.1093/bioinformatics/btl480. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/22/2813]
Sprinzl M, Vassilenko KS: Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 2005, 33 (Database issue): D139-D140. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC539966/]
BioMed central databases. [http://databases.biomedcentral.com/]
Southan C, Cameron G: Database provider survey. Tech. rep., ELIXIR, EMBL-EBI. 2009, [http://dx.doi.org/10.6084/m9.figshare.106310]
Roberts RJ: PubMed Central: The GenBank of the published literature. Proceedings of the National Academy of Sciences of the United States of America. 2001, 98 (2): 381-382. 10.1073/pnas.98.2.381. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=33354]
BMC Bioinformatics | About. [https://www.biomedcentral.com/bmcbioinformatics/about]
Genome Biology | About. [http://genomebiology.com/about]
R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, Vienna, Austria: R Foundation for Statistical Computing, [http://www.R-project.org ISBN 3-900051-07-0]
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics. 2000, 25: 25-29. 10.1038/75556. [http://dx.doi.org/10.1038/75556]
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2011, 39 (Database issue): D32-D37. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3013681]
Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, et al: The Ensembl genome database project. Nucleic Acids Res. 2002, 30: 38-41. 10.1093/nar/30.1.38. [http://nar.oxfordjournals.org/content/30/1/38.full]
Berman HM: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235. [http://nar.oxfordjournals.org/cgi/content/abstract/28/1/235]
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40 (Database issue): D109-D114. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3245020]
MySQL :: The world’s most popualr open source database. [https://www.mysql.com/]
Casbon J, Saqi MAS: S4: structure-based sequence alignments of SCOP superfamilies. Nucleic Acids Res. 2005, 33 (Database issue): D219-D22. [http://nar.oxfordjournals.org/cgi/content/abstract/33/suppl_1/D219]
Q — Analysis software for market research. [http://www.q-researchsoftware.com/]
Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247 (4): 536-540. [http://www.ncbi.nlm.nih.gov/pubmed/7723011]
Home — PubMed — NCBI. [https://www.ncbi.nlm.nih.gov/pubmed]
Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A: NCBI GEO: archive for functional genomics data sets-10 years on. Nucleic Acids Res. 2011, 39 (Database issue): D1005-D10010. [http://nar.oxfordjournals.org/cgi/content/abstract/39/suppl_1/D1005]
Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340. [http://nar.oxfordjournals.org/cgi/content/abstract/32/5/1792]
Bairoch A, Boeckmann B: The SWISS-PROT protein sequence data bankcurrent status. Nucleic Acids Res. 1994, 22 (17): 3578-3580. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=308324]
The UniProt Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40 (Database issue): D71-D75. [http://nar.oxfordjournals.org/cgi/content/abstract/40/D1/D71]
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005, 15 (10): 1451-1455. 10.1101/gr.4086505. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1240089]
Weiss GH: Aspects and Applications of the Random Walk. 1994, North-Holland: Random Materials and Processes
Gooch P: BADREX: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions. CoRR. 2012, abs/1206.4522: [http://arxiv.org/abs/1206.4522]
Cohen KB, Fox L, Ogren PV, Hunter L: Corpus Design for Biomedical Natural Language Processing. Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. 2005, Detroit: Association for Computational Linguistics, 38-45. [http://www.aclweb.org/anthology/W/W05/W05-1306]
We would like to thank Daniel Jamieson (University of Manchester) for his help in establishing the inter-annotator agreement, and to the many developers at the GATE mailing list for their assistance in various GATE development issues during the project. GD is funded by a studentship from the Biotechnology and Biological Sciences Research Council (BBSRC) to GN, DLR and RS.
The authors declare that they have no competing interests.
GD programmed and ran bioNerDS on the Genome Biology and BMC Bioinformatics journal corpora, and drafted the manuscript. AB helped with the data analysis. GN, DLR and RS initially conceptualised the project and provided continual guidance and discussion. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Excel spreadsheet containing the normalised top 50 resource mentions for each journal, and the resulting mean and standard deviations for each year, used to generate Figures 4 and 5. It also contains the plotted data curves showing the approximate Gaussian distribution of the data. (XLSX 165 KB)
Additional file 2: Genome Biology and BMC Bioinformatics dataset Excel spreadsheet containing the top 50 resources mentions for each journal by year on the document level as well as the resulting graphs used within this paper. Also includes long tail curve analysis data. (XLSX 277 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Duck, G., Nenadic, G., Brass, A. et al. bioNerDS: exploring bioinformatics’ database and software use through literature mining. BMC Bioinformatics 14, 194 (2013). https://doi.org/10.1186/1471-2105-14-194
- Gene Ontology
- Noun Phrase
- Head Noun
- Random Walk Process
- Genome Biology