NOBLE – Flexible concept recognition for large-scale biomedical natural language processing
© Tseytlin et al. 2016
Received: 31 July 2015
Accepted: 22 December 2015
Published: 14 January 2016
Natural language processing (NLP) applications are increasingly important in biomedical data analysis, knowledge engineering, and decision support. Concept recognition is an important component task for NLP pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly used alternatives on both a biological and clinical corpus.
NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system’s matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary Lookup Annotator.
We describe key advantages of the NOBLE Coder system and associated tools, including its greedy algorithm, configurable matching strategies, and multiple terminology input formats. These features provide unique functionality when compared with existing alternatives, including state-of-the-art systems. On two benchmarking tasks, NOBLE’s performance exceeded commonly used alternatives, performing almost as well as the most advanced systems. Error analysis revealed differences in error profiles among systems.
NOBLE Coder is comparable to other widely used concept recognition systems in terms of accuracy and speed. Advantages of NOBLE Coder include its interactive terminology builder tool, ease of configuration, and adaptability to various domains and tasks. NOBLE provides a term-to-concept matching system suitable for general concept recognition in biomedical NLP pipelines.
Key terms and definitions
A shortened form of a word, name, or phrase.
The tagging of words comprising a mention to assign them to a concept or text feature .
A computer-based system that automatically matches text terms to a code or concept.
An “object of interest.” ; the referent in the semiotic triangle.
A list or dictionary of entities .
Different forms of the same term that occur due to variations in spelling, grammar, etc. .
One or more words and or punctuation within a text which refer to a specific entity.
A defined group of terms and their relationships to each other, within the context of a particular domain .
A logical category of related terms .
A word of high frequency but limited information value (e.g.determiners) that is excluded from a vocabulary to improve results of a subsequent task .
A linguistic unit that has a definable meaning and/or function .
Two general approaches have been used for biomedical concept recognition . Term-to-concept matching algorithms (previously called ‘auto coders’) are general-purpose algorithms for matching parsed text to a terminology. They typically require expert selection of vocabularies, semantic types, stop word lists, and other gazetteers, but they do not require training data produced through human annotation.
In contrast, machine learning NLP methods are used to produce annotators for specific well-defined purposes such as annotating drug mentions [3, 4] and gene mentions [5, 6]. Conditional random fields, for example, have produced excellent performance for specific biomedical NER tasks , but these systems often require training data from human annotation specific to domain and document genre. More recently, incremental approaches have been advocated for certain tasks (e.g. de-identification) . These methods may be most appropriate where specific classes of mentions are being annotated .
Widely used concept recognition systems
Terminology building tools
Noun-phrase, lexical variants
Java API for MMTx
Single word variations,
Closed Source Binary Utility
Command line utility (MGrep) integrated with RESTful API in OBA
Custom dictionaries (MGrep) with UMLS and Bioportal in OBA
Concept Mapper 
Word Lookup Table
Open Source 
cTAKES Dictionary Lookup Annotator 
Noun-phrase, dictionary lookup
Open Source 
Java API with full integration in UIMA
UMLS (RRF), Bar Separated Value (BSV) file
Example scripts available 
cTAKES Fast Dictionary Lookup Annotator 
Rare Word index
Open Source 
Java API with full integration in UIMA
UMLS (RRF), Bar Separated Value (BSV) file
Example scripts available 
Word Lookup Table
Bigram Lookup Table
Open Source 
Command line utility (Perl)
Custom dictionary format
MedLEE  concept recognition
XML based input/output
Word Lookup Table
Open Source 
Java API, UIMA and GATE wrappers
UMLS (RRF), OWL, OBO, BioPortal
Terminology Loader UI
Our laboratory has previously utilized several concept recognition systems in our NLP pipelines for processing clinical text for particular document genre (e.g. pathology reports ) or for specific NLP tasks (e.g. ontology enrichment [28, 29] and (in collaboration) coreference resolution ). In our experience, currently available systems are limited in their ability to scale to document sets in the tens and hundreds of millions, and to adapt to new document genre, arbitrary terminologies specific to particular genre, and new NLP tasks. We developed NOBLE to perform general-purpose biomedical concept recognition against an arbitrary vocabulary set with sufficient flexibility in matching strategy to support several types of NLP tasks. The system can be used for concept annotation of terms consisting of single words, multiple words, and abbreviations against a controlled vocabulary.
In this manuscript, we present the algorithmic and implementation details of NOBLE Coder and describe its unique functionality when compared with these other systems. We benchmarked the accuracy, speed, and error profile of NOBLE Coder against MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator (DLA), and cTAKES Fast Dictionary Lookup Annotator (FDLA), all of which are general-purpose term-to-concept matching systems that have been widely used for biomedical concept recognition.
To perform the subsequent match (Fig. 1b), input text is broken into a set of normalized words and stop words are excluded. The word set is then ranked by frequency of associated terms. Each word is looked up in the WT table to find terms that are associated with the word and include all of the other words in the input text. This term is then added to a candidate list. Once all of the words in the input text have been processed and a set of candidate terms has been generated, each candidate term is looked up in the TC table and its concept is added to the results queue.
NOBLE matching options with examples
Example input text
Example input vocabulary
Example Output based on ParameterSetting (T, F)
Only more comprehensive concepts are coded
A mapped concept may be fully or partially within the boundaries of another concept
“Deep lateral margin”
Words in text that map to concept term must be continuous (within word gap)
“Deep lateral margin”
Order of words in text must be the same as in the concept term
Input text that only partially matches the concept term is coded
No concepts coded
Examples of NOBLE matching strategies produced by combinations of matching options
Combination of matching options
Provides the narrowest meaningful match with the fewest candidates. Best for concept coding and information extraction.
Yes (gap = 1)
Provides as many matched candidates as possible. Best for information retrieval and text mining.
Attempts to minimize the number of false positives by filtering out candidates that do not appear in exactly the same form as in controlled terminology. Similar to best match, but increases precision at the expense of recall.
Yes (gap = 0)
Allows matching of concepts even if the entire term representing it is not mentioned in input text. Best for concept coding with small, controlled terminologies and poorly developed synonymy.
NOBLE Coder is most reminiscent of the IndexFinder [14, 15] algorithm and system (Table 2). However, NOBLE Coder also differs from IndexFinder in several important respects, with significant consequences for its performance, scalability, and extensibility.
First, NOBLE’s terminology representation is fundamentally different from IndexFinder’s terminology representation. IndexFinder uses in-memory lookup tables with integer identifiers to look up words and terms. Constituent words are represented in a separate lookup table and string representation terms are not persisted. While this representation minimizes memory requirements, it loses information about composition of individual terms. Consequently, all words associated with a term must be counted before selecting it as a candidate. In contrast, NOBLE uses a WT table that includes the constituent words and terms (Fig. 1). This supports two major algorithmic improvements: (1) it is possible to avoid looking up a list of terms associated with words that are already present in previously selected candidate terms; and (2) the algorithm can first consider more informative words with fewer associated terms, because the number of terms associated with each word in the WT table is known.
To increase processing speed, NOBLE Coder creates a cache of the word-to-term table that contains 0.2 % of the most frequently occurring words in a dictionary. This table contains a reduced set of terms that include only the cached words. NOBLE determines if a word is cached before using the main WT table. The threshold was empirically chosen based on a series of experiments comparing a range of thresholds.
Second, NOBLE Coder uses hash tables to represent the WT table, and relies on the underlying JVM to handle references to string objects internally. NOBLE Coder also uses JDBM (Java Database Management)  technology to optionally persist those hash tables to disk. This enables NOBLE to input a terminology of any size, since it does not need to fit into memory. IndexFinder leverages a sorted table of run-length identifiers, and therefore trades complexity at build time for a smaller memory footprint. In contrast, NOBLE Coder stores a list of term sizes, requiring more memory but eliminating the requirement to load the entire vocabulary each time the system is initialized.
Third, NOBLE Coder is arguably more extensible and customizable when compared with IndexFinder. NOBLE supports multiple matching strategies enabling the optional inclusion or exclusion of candidates based on (1) subsumption, (2) overlap, (3) contiguity, (4) order, and (5) partial match. NOBLE has been implemented to support a variety of terminology import formats including RRF, OWL, OBO, and Bioportal. The published implementation of IndexFinder was specific to UMLS.
Finally, NOBLE supports dynamic addition and subtraction of concepts to the terminology while processing. This feature can be used as part of a cycle of terminology enrichment by human reviewers or by automated methods.
NOBLE Coder is implemented in Java and is part of a suite of Java NLP tools developed at University of Pittsburgh. In addition to the WT and TC tables described earlier, NOBLE Coder uses additional tables to store word statistics, terminology sources, and concept metadata such as hierarchy information, definitions, semantic types, and source vocabularies. NOBLE Coder uses JDBM 3.0 library as a NoSQL solution to persist these tables to disk.
NOBLE Coder supports several additional features and tools that enhance its utility and usability. These features include (1) matching regular expressions, (2) resolving acronyms to their expanded terms, (3) filtering concepts by source or semantic types, (4) skipping problem words, and (5) selecting a single candidate for a matched term. Another feature that expands usability is the User Interface (UI) for running the concept recognition system and generating results as both tab-delimited files and interactive annotated documents in HTML format.
NOBLE Coder is open source and freely available for academic and non-profit use under the terms of the GNU Lesser General Public License v3. Source code, documentation, and executable, with UIMA and GATE wrappers, are available at https://sourceforge.net/projects/nobletools/. For rapid evaluation, users may run NOBLE Coder in Java Webstart at http://noble-tools.dbmi.pitt.edu/noblecoder.
We have used NOBLE Coder as part of a new NLP pipeline for TIES v5.0  used to processed more than 25 million de-identified clinical documents. The TIES system  is used by clinical and translational researchers at our institution and other institutions across the country.
We compared NOBLE Coder to five widely used, general-purpose concept recognition systems: MGrep, MMTx, Concept Mapper, cTAKES Dictionary Lookup Annotator (DLA), and cTAKES Fast Dictionary Lookup Annotator (FDLA). Because NOBLE Coder was originally written to replace MMTx in our TIES system, we chose to evaluate MMTx rather than its Prolog analog, MetaMap. As described by its NLM developers , MMTx is a Java implementation of MetaMap, which produces only minor differences in comparison. These discrepancies result in large part from MMTx’s tokenization and lexicalization routines. MMTx continues to be widely used throughout the biomedical informatics community, in part because of (1) its simpler installation using a single server and (2) its more flexible vocabulary building process.
System versions and configurations
Parameter settings used for concept recognition systems
Concept recognition system
Parameters changed from standard
Best-mapping = false
Longest match = true
Contiguous match = true
cTAKES Dictionary Lookup Annotator
cTAKES Fast Dictionary Lookup Annotator
Best Match Strategy
To minimize differences due to platform, all systems were run in the context of a cTAKES UIMA pipeline that included standard front-end cTAKES Text Annotators for Tokenization and Sentence Detection. Each coder was entirely wrapped as a UIMA Annotator, with the exception of MGrep. MGrep was partially wrapped as a UIMA Annotator as described below.
Corpora and vocabularies
We used two publicly available, human-annotated corpora for comparison of the five systems. These corpora were selected because they represent very different tasks (clinical text annotation versus biomedical literature annotation) and very different vocabularies (OBO ontologies versus SNOMED-CT), allowing us to better assess generalizability during the benchmarking process. For each corpus vocabulary, we built annotator-ready vocabulary footprints as described below.
The Colorado Richly Annotated Full Text (CRAFT) corpus
The Colorado Richly Annotated Full Text (CRAFT) Corpus  is a publicly available, human annotated corpus of full-text journal articles from a variety of disciplines. Articles focus on mouse genomics. All articles are from the PubMed Central Open Access Subset. Human concept annotations are provided as Knowtator XML files with standoff markup. The evaluation set contained a total of 67 document, 438,705 words, and 87,674 human annotations.
Machine annotations by all five systems were generated using the terminologies and version provided with the CRAFT corpus (CHEBI.obo, CL.obo, GO.obo, NCBITaxon.obo, PR.obo and SO.obo). Based on published recommendations by the corpus developers , we elected to exclude Entrez gene from the human annotations. Therefore, we did not utilize the Entrez gene vocabulary, relying instead on the PR ontology for gene mentions.
The Shared Annotated Resources (ShARe) corpus
We obtained the Shared Annotated Resources (ShARe) corpus  under a Data Use Agreement with the Massachusetts Institute of Technology (MIT). The ShARe corpus consists of de-identified clinical free-text notes from the MIMIC II database, version 2.5, that was used for the SemEval 2015 challenge . Document genres include discharge summaries, ECG (electrocardiogram) reports, echocardiogram reports, and radiology reports. Human annotations are provided as Knowtator XML files with standoff markup. For the purposes of this study, we used the entire training set of documents. The evaluation set contained a total of 199 documents, 94,243 words, and 4,211 human annotations.
Machine annotations by all five systems were generated using the same terminology, version, and semantic types as designated by the ShARe schema [37, 38]. SNOMED CT was obtained from UMLS version 2011AA and filtered by semantic type to include only Congenital Abnormality, Acquired Abnormality, Injury or Poisoning, Pathologic Function, Disease or Syndrome, Mental or Behavioral Dysfunction, Cell or Molecular Dysfunction, Experimental Model of Disease, Anatomical Abnormality, Neoplastic Process, and Signs and Symptoms.
Vocabulary build process
For all concept recognition systems except NOBLE Coder, significant additional effort was required to import the vocabulary into a format that could be used by the system. Additionally, some systems only worked when the concept code was in the form of a UMLS CUI string (e.g. cTAKES DLA and FDLA), while others (e.g. MGrep) expected integers. In order to standardize formats and to account for these different representations, we used UMLS Rich Release Format (RRF) as the least common denominator for all vocabulary builds. We note that NOBLE already ingests many different formats (including OBO, OWL, BioPortal and RRF). For the sake of consistency, we utilized RRF as the input format for NOBLE Coder too.
To construct coder-ready terminologies from CRAFT, we developed Java code to convert OBO to Rich Release Format (RRF). These transformational tools are also made publicly available at https://sourceforge.net/projects/nobletools/. We generated UMLS-like identifiers in a cui-to-code mapping table to enable subsequent comparison to the gold annotations.
To construct coder-ready terminologies for ShARe, we used UNIX command line tools (grep and awk) to extract SNOMED CT terminology, filter to the required semantic types , and output RRF.
Vocabulary build steps required for each system
Concept Recognition Systems
Dictionary Data Structure Used by Coder
Used MetamorphoSys to convert RRF to ORF and used bundled data file builder to create terminology for each corpus; this process required significant user interaction and took many hours
Sent RRF files for both corpora to the MGrep authors and received from them a tab delimited text file that could be used with the MGrep system enriched with LVG; there is limited publicly available information about the vocabulary format required by MGrep
Wrote custom Java code to convert RRF files to an XML file formatted in the Concept Mapper valid syntax
cTAKES Dictionary Lookup Annotator
Wrote custom Java code to convert RRF files to seed a Lucene Index
cTAKES Fast Dictionary Lookup Annotator
Wrote custom Java code to convert RRF into Bar Separated Values (BSV) file that FDLA imports
Directly imported RRF files
Expert annotations for both corpora were published as Knowtator XML files. We used the Apache XMLDigester package to read these files and transform their content to a set of comma separated value (CSV) files containing the annotation span, and concept unique identifier. Natural Language Processing was performed with the cTAKES UIMA Processing Pipeline (Fig. 3). Documents were initially pre-processed using cTAKES off-the-shelf Tokenizer and Sentence Splitter modules. They were then processed with each of the individual concept recognition components. Each concept recognition system generated a set of concept annotations that were converted into a CSV format matching the format of the gold annotation set.
Measurement of run times
Run time was defined as the time to process all documents in the corpus (preprocessing and coding), as reported by the NLP Framework Tools. Each concept recognition system was run ten times in the context of a simple UIMA pipeline. Measurements were made on a single PC with Intel I7 × 8 processor and 32 GB of RAM, running Ubuntu 12.04. No significant competing processes were running. A total of 120 runs were performed representing each of the six coders, across both corpora, over ten runs. Runs were randomly sequenced to minimize order effect. Resources were freed and garbage was collected before each new run was started. Results are reported as median milliseconds with interquartile range (IQR).
Because MGrep runs as a command line program, we used a two-step process for MGrep in which we first produced a set of annotations cached in JDBM (now MapDB) , and then wrapped a series of hash map lookups with a UIMA Annotator. Run times for MGrep were calculated as the sum of these two steps. In this way, we avoided overly penalizing MGrep for start-up time, which would be incurred with each text chunk.
NLP performance metrics
Standard NLP evaluation metrics were used, including TP, FP, FN, Precision, Recall, and F-1 measure. The quality of the software annotations was determined via comparison to the Expert Annotation Set using GATE’s AnnotDiffTool , which computes a standard Precision, Recall, and F-1 measure using True Positive, False Positive, and False Negative distributions. We ran in average mode, which also considers partial positives (PP). PPs are defined as any inexact overlap between the software proposed annotation and the expert annotation. Scoring with average mode allocates one-half the Partial Positive count into both the Precision and Recall calculations.
To evaluate the overall error profile of NOBLE Coder against other systems, we examined the distribution of all errors (both false positive and false negative) made by NOBLE Coder and determined when these specific NOBLE Coder errors were also made by one or more of the other systems. From this set of all NOBLE Coder errors, we then randomly sampled 200 errors made by NOBLE Coder on each corpus, including 100 false positives and 100 false negatives and yielding a total of 400 errors. Each error in this sampled set was then manually classified by a human judge to determine the distribution of errors based on 10 categories of error types. Categories included errors that could result in false positives (e.g. word sense ambiguity), false negatives (e.g. incorrect stemming), or both false positives and false negatives (e.g. boundary detection). Our categories built on previous error categories used in a study of MMTx , which we further expanded. All 400 errors studied were classified as one of these error types. We then analyzed the frequency with which these specific NOBLE Coder errors were also made by the other systems. For error analyses, we considered only absolute matches and not partial positives, which would have significantly complicated the determination of overlap in errors among all six systems.
Concept Recognition System
Median runtime over 10 runs (ms)**
cTAKES Dictionary Lookup
cTAKES Fast Lookup
cTAKES Dictionary Lookup
cTAKES Fast Lookup
On CRAFT, NOBLE Coder achieved an F1 value of 0.43, ranking second behind the cTAKES Dictionary Lookup Annotator (F1 = 0.47), and slightly better than MMTx (F1 = 0.42), cTAKES Fast Dictionary Lookup Annotator (F1 = 0.41), and Concept Mapper (F1 = 0.40). All five of these coders markedly outperformed MGrep (F1 = 0.19).
On ShARe, NOBLE Coder achieved an F1 value of 0.59, ranking third behind cTAKES Fast Dictionary Lookup Annotator (F1 = 0.62) and MGrep (F1 = 0.62) which were tied. NOBLE Coder was slightly more accurate than MMTx (F1 = 0.58), and performed better than cTAKES Dictionary Lookup Annotator (F1 = 0.53), and Concept Mapper (F1 = 0.51).
With respect to speed, Concept Mapper was the fastest on both corpora, followed by cTAKES Fast Dictionary Lookup Annotator and NOBLE Coder. All three of the fastest annotators completed the CRAFT corpus with a median run speed of less than twenty seconds, and the ShARe corpus with a median run speed of less than seven seconds. MGrep was only slightly slower than NOBLE Coder was and completed annotations of the CRAFT and ShARe corpora in 27.5 and 7.1 s, respectively. In contrast, MMTx completed the annotation of the CRAFT corpus in 10 min and the ShARe corpus in 1 min. cTAKES Dictionary Lookup Annotator was the slowest on both corpora, completing the annotation of the CRAFT corpus in 68 min and the ShARe corpus in 4.3 min.
Analysis of sampled NOBLE Coder errors
Type of error
Incorrectly incorporates words from earlier or later in the sentence, considering them to be part of the concept annotated
0 (0 %)
3 (1.5 %)
Incorrectly assigns more general or more specific concept than gold standard
FP and FN
18 (9 %)
13 (6.5 %)
Concept annotated incorrectly because context or background knowledge was needed
FP and FN, usually FN
72 (36 %)
81 (40.5 %)
Exact match missed
Concept not annotated despite exactly matching the preferred name or a synonym
2 (1 %)
3 (1.5 %)
Annotated concept was not deemed relevant by gold annotators
33 (16.5 %)
51 (25.5 %)
Abbreviation defined in the dictionary had a case-insensitive match, because it did not match a defined abbreviation pattern
18 (9 %)
0 (0 %)
Alternative application of terminology
Gold used obsolete term, term is not in SNOMED, or same term existed in multiple ontologies, resulting in different annotations for same mention
31 (15.5 %)
10 (5 %)
Concept annotated was identical to gold but text span was different than gold
FP and FN
10 (5 %)
20 (10 %)
Word sense ambiguity
Concept annotated was used in different word sense
4 (2 %)
0 (0 %)
Missing or incorrect annotation due to word inflection mismatch between dictionary term and input text
FP and FN, usually FN
12 (6 %)
19 (9.5 %)
200 (100 %)
200 (100 %)
Among the errors sampled from the CRAFT corpus, other error types identified included (with decreasing frequency) an alternative application of the terminology, concept hierarchy errors, abbreviation detection errors, wording mismatch, text span mismatch, word sense ambiguity, and missed exact matches. Among the errors sampled from the ShARe corpus, other error types identified included (with decreasing frequency) text span mismatch, wording mismatch, concept hierarchy errors, an alternative application of the terminology, and boundary detection.
In this manuscript, we introduce NOBLE Coder, a concept recognition system for biomedical NLP pipelines. The underlying algorithm of NOBLE Coder is based on a word lookup table, making it most similar to the IndexFinder [14, 15] and Concept Mapper  systems. However, NOBLE Coder is different from these systems in a number of respects, with consequences for its performance, scalability, and extensibility (Table 1).
Our benchmarking tests showed that NOBLE Coder achieved comparable performance when compared to widely used concept recognition systems for both accuracy and speed. Our error analysis suggests that the majority of the NOBLE Coder errors could be related to the complexity of the corpus, expert inferences made by the annotator, annotation guideline application of the human annotators, and missing (but reasonable) annotations created by NOBLE Coder that were not identified by the gold annotators. Errors due to known weaknesses of concept recognition systems (such as word sense ambiguity and boundary detection) occurred relatively infrequently in both corpora.
Overall, NLP performance metrics (precision, recall, and F1) were low for all systems, particularly on the CRAFT corpus, which was not unexpected. Results of our component benchmarking tests should not be directly compared to evaluations of entire systems, including those tested against these corpora with reported results in the high 70s F-score [37, 38]. We specifically sought to separate out and test the concept recognition system alone without the use of noun phrasing, part of speech tagging, pre-processing of the vocabulary, or filtering results to specific semantic types. When concept recognition is used within the context of a pipeline that performs these other functions, performance can be expected to be higher.
Nevertheless, our results provide additional data to other comparative studies by evaluating recall, precision, and speed for all of these systems across two very different gold standards. No previous study has compared performance on more than one reference standard. Our results demonstrate that some systems generalize more easily than others do. Additionally, no previous study has directly compared the cTAKES Dictionary Lookup Annotator and Fast Dictionary Lookup Annotator with other alternatives. Our results demonstrate the remarkable speed of the cTAKES Fast Dictionary Lookup Annotator with increased accuracy on the ShARe corpus.
The results of our benchmarking tests are consistent with findings of other recent studies that demonstrate the greater precision and greater speed of MGrep over MetaMap [23–25]. However, the low recall of MGrep, for example on the CRAFT corpus, may limit its utility to tasks in which a UMLS-derived vocabulary is used with lexical variant generation. To evaluate speed fairly across systems, we ran all systems within the context of a cTAKES UIMA pipeline. The command line version of MGrep can be expected to run at faster speeds.
A recent study of concept recognition systems using the CRAFT corpus showed that Concept Mapper achieved higher F1 measures when compared to both MetaMap and NCBO annotator (which implements MGrep) across seven of eight ontologies represented in the gold set annotations . We did not specifically compare accuracy against specific ontology subsets, and therefore our results are not directly comparable.
NOBLE Coder has some similarities to Concept Mapper including the use of a word-to-term lookup table and the ability to create different matching strategies. But there are also a large number of differences between the systems, including NOBLE Coder’s use of JDBM, native imports of vocabularies as NoSQL tables, and additional parameters for setting the mapping strategy (e.g. word order and partial match) when compared with ConceptMapper. A particularly important difference is that the NOBLE algorithm is greedy, which is likely to improve performance with very large terminology sets and corpora.
NOBLE Coder also has significant advantages over in-memory systems. Use of heap memory is an advantage if the system will be run on 32-bit machines or in any way that is RAM limited. Disk-based concept recognition systems such as NOBLE Coder, MMTx, and disk-configured deployments of cTAKES DLA and FDLA would likely run with smaller footprints than their in-memory counterparts. Additionally, NOBLE Coder can enable the application programmer to directly manage heap growth, with its utilization of MapDB just-in-time caching technology. None of the other systems has this capability. Memory management aspects of NOBLE make it an excellent choice for NLP processing in cloud-based environments where higher RAM requirements also increase cost.
In addition to performance metrics and speed, usability within common NLP tasks and scenarios is an important criterion to consider when selecting processing resources for an NLP pipeline. We observed significant differences in the degree of effort required to achieve a specific behavior such as Best Match. Systems that expose and explicitly document parameters that can be altered to achieve such behaviors may be preferred for some applications. The terminology build process remains one of the most complex and challenging aspects of working with any general concept recognition system. Of the systems described in Table 1 that support more than a single pre-determined vocabulary, only MMTx and NOBLE provide both tooling and documentation to build the necessary terminology resources. Most systems also have significant limitations in the formats and types of vocabularies that can be used. In large part, NOBLE was developed to address this limitation. It currently supports a variety of terminology import formats including RRF, OWL, OBO, and Bioportal. Finally, of the systems tested, only NOBLE supports dynamic addition and subtraction of concepts to the terminology while processing. This feature can be used as part of a cycle of terminology enrichment by human reviewers or by automated methods.
Results of our benchmarking tests may provide guidance to those seeking to use a general-purpose concept recognition system within the context of their biomedical NLP pipelines. Given specific use cases, individual researchers must prioritize not only (1) precision, (2) recall, and (3) speed, but also (4) generalizability, (5) ability to accommodate specific terminologies and (6) terminology formats, and (7) ease of integration within existing frameworks. The reader should also be aware that any general-purpose annotator must ultimately be evaluated within the context of a larger pipeline system.
NOBLE Coder is an open source software component that can be used for biomedical concept recognition in NLP pipelines. We describe the system and show that it is comparable to the highest performing alternatives in this category. NOBLE Coder provides significant advantages over existing software including (1) ease of customization to user-defined vocabularies, (2) flexibility to support a variety of matching strategies and a range of NLP tasks, and (3) disk-based memory management, making the software particularly well suited to cloud hosted environments.
Availability and requirements
Project name: NOBLE Coder and NOBLE Tools
Project home page: http://noble-tools.dbmi.pitt.edu/
Operating system(s): Linux, Windows, MacOS
Programming language: Java
Other requirements: None; can be used as standalone application, or within UIMA or GATE
License: Gnu LGPL v3
Any restrictions (to use by non-academics): University of Pittsburgh holds the copyright to the software. Use by non-profits and academic groups is not restricted. Licensing for use by a for-profit commercial entity may be negotiated through University of Pittsburgh Office of Technology Management.
We gratefully acknowledge support of our research by the National Cancer Institute (R01 CA132672 and U24 CA180921). Many thanks to Fan Meng and Manhong Dai (University of Michigan) for the use of the MGrep software and for their assistance with the MGrep vocabulary builds. We also thank Guergana Savova, Sean Finan, and the cTAKES team at Boston Children’s Hospital for assistance in use of the cTAKES pipeline and dictionary annotators. Finally, we are indebted the creators of the ShARe and CRAFT corpora, whose significant efforts make these essential resources available for research use.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Friedman C, Elhadad N. Natural language processing in health care and biomedicine. In: Shortliffe EH, Cimino JJ, editors. Biomedical Informatics. London: Springer; 2014. p. 255–84.View ArticleGoogle Scholar
- Cohen KB, Hunter L. Getting started in text mining. PLoS Comput Biol. 2008;4(1):e20. doi:10.1371/journal.pcbi.0040020.PubMedPubMed CentralView ArticleGoogle Scholar
- Doan S, Collier N, Xu H, Duy PH, Phuong TM. Recognition of medication information from discharge summaries using ensembles of classifiers. BMC Med Inform Decis Mak. 2012;12(36). doi:10.1186/1472-6947-12-36Google Scholar
- Tikk D, Solt I. Improving textual medication extraction using combined conditional random fields and rule-based systems. J Am Med Inform Assoc. 2010;17(5):540–4. doi:10.1136/jamia.2010.004119.PubMedPubMed CentralView ArticleGoogle Scholar
- Hsu C-N, Chang Y-M, Kuo C-J, Lin Y-S, Huang H-S, Chung I-F. Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinformatics. 2008;24(13):286–94. doi:10.1093/bioinformatics/btn183.View ArticleGoogle Scholar
- Mitsumori T, Fation S, Murata M, Doi K, Doi H. Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinformatics. 2005;6 Suppl 1:S8. doi:10.1186/1471-2105-6-S1-S8.PubMedPubMed CentralView ArticleGoogle Scholar
- Hanauer D, Aberdeen J, Bayer S, Wellner B, Clark C, Zheng K, et al. Bootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs. Int J Med Inform. 2013;82(9):821–31. doi:10.1016/j.ijmedinf.2013.03.005.PubMedView ArticleGoogle Scholar
- Lu Z, Kao H-Y, Wei C-H, Huang M, Liu J, Kuo C-J, et al. The gene normalization task in BioCreative III. BMC Bioinformatics. 2011;12 Suppl 8:S2. doi:10.1186/1471-2105-12-S8-S2.PubMedPubMed CentralView ArticleGoogle Scholar
- Aronson AR, editor. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2001; 17-21.Google Scholar
- Aronson AR. MetaMap: mapping text to the UMLS metathesaurus. Bethesda, MD: NLM, NIH, DHHS; 2006. p. 1-26. Available at http://skr.nlm.nih.gov/papers/references/metamap06.pdf
- Aronson AR, Lang F-M. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17(3):229–36. doi:10.1136/jamia.2009.002733.PubMedPubMed CentralView ArticleGoogle Scholar
- MetaMap Transfer (MMTx). http://mmtx.nlm.nih.gov/MMTx/. Accessed 23 Jul 2015.
- Dai M, Shah NH, Xuan W, Musen MA, Watson SJ, Athey BD et al. An efficient solution for mapping free text to ontology terms. AMIA Summit on Translat Bioinforma. 2008Google Scholar
- Zou Q, Chu WW, Morioka C, Leazer GH, Kangarloo H. IndexFinder: a knowledge-based method for indexing clinical texts. AMIA Annu Symp Proc. 2003; 763–767Google Scholar
- Zou Q, Chu WW, Morioka C, Leazer GH, Kangarloo H. IndexFinder: a method of extracting key concepts from clinical texts for indexing. Proc AMIA Symp. 2003:763-7Google Scholar
- Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11(5):392–402. doi:10.1197/jamia.M1552.PubMedPubMed CentralView ArticleGoogle Scholar
- Berman JJ. Doublet method for very fast autocoding. BMC Med Inform Decis Mak. 2004;4:16. doi:10.1186/1472-6947-4-16.PubMedPubMed CentralView ArticleGoogle Scholar
- Berman JJ. Automatic extraction of candidate nomenclature terms using the doublet method. BMC Med Inform Decis Mak. 2005;5:35. doi:10.1186/1472-6947-5-35.PubMedPubMed CentralView ArticleGoogle Scholar
- Tanenblatt MA, Coden A, Sominsky IL. The ConceptMapper approach to named entity recognition. 2010. p. 546–51. Proc of 7th Language Resources and Evaluation Conference.Google Scholar
- Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–13. doi:10.1136/jamia.2009.001560.PubMedPubMed CentralView ArticleGoogle Scholar
- cTAKES Dictionary Lookup. https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+Dictionary+Lookup. Accessed 23 Jul 2015.
- cTAKES 3.2 - Fast Dictionary Lookup. https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+-+Fast+Dictionary+Lookup. Accessed 23 Jul 2015.
- Bhatia N, Shah NH, Rubin DL, Chiang AP, Musen MA. Comparing concept recognizers for ontology-based indexing: MGrep vs. MetaMap. AMIA Summit on Translat Bioinforma. 2009Google Scholar
- Shah NH, Bhatia N, Jonquet C, Rubin D, Chiang AP, Musen MA. Comparison of concept recognizers for building the Open Biomedical Annotator. BMC Bioinformatics. 2009;10 Suppl 9:S14. doi:10.1186/1471-2105-10-S9-S14.PubMedPubMed CentralView ArticleGoogle Scholar
- Stewart SA, Von Maltzahn ME, Abidi SSR. Comparing Metamap to MGrep as a tool for mapping free text to formal medical lexicons. 2012. p. 63–77. Proc of 1st International Workshop on Knowledge Extraction and Consolidation from Social Media.Google Scholar
- Funk C, Baumgartner WJ, Garcia B, Roeder C, Bada M, Cohen KB, et al. Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinformatics. 2014;15:59. doi:10.1186/1471-2105-15-59.PubMedPubMed CentralView ArticleGoogle Scholar
- Crowley RS, Castine M, Mitchell K, Chavan G, McSherry T, Feldman M. caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. J Am Med Inform Assoc. 2010;17(3):253–64. doi:10.1136/jamia.2009.002295.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu K, Chapman WW, Savova GK, Chute CG, Sioutos N, Crowley RS. Effectiveness of lexico-syntactic pattern matching for ontology enrichment with clinical documents. Methods Inf Med. 2011;50(5):397–407. doi:10.3414/ME10-01-0020.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu K, Mitchell KJ, Chapman WW, Savova GK, Sioutos N, Rubin DL, et al. Formative evaluation of ontology learning methods for entity discovery by using existing ontologies as reference standards. Methods Inf Med. 2013;52(4):308–16. doi:10.3414/ME12-01-0029.PubMedView ArticleGoogle Scholar
- Zheng J, Chapman WW, Miller TA, Lin C, Crowley RS, Savova GK. A system for coreference resolution for the clinical narrative. J Am Med Inform Assoc. 2012;19(4):660–7. doi:10.1136/amiajnl-2011-000599.PubMedPubMed CentralView ArticleGoogle Scholar
- Browne AC, Divita G, Aronson AR, McCray AT. UMLS language and vocabulary tools. AMIA Annu Symp Proc. 2003:798Google Scholar
- The JDBM project. http://jdbm.sourceforge.net/. Accessed 23 Jul 2015.
- TIES. http://ties.dbmi.pitt.edu/. Accessed 23 Jul 2015.
- ShARe/CLEF eHealth evaluation lab. SHARE-Sharing Annotated Resources. https://sites.google.com/site/shareclefehealth/home. Accessed 23 Jul 2015.
- CRAFT: The Colorado Richly Annotated Full Text corpus. SourceForge.net. http://bionlp-corpora.sourceforge.net/CRAFT/. Accessed 23 Jul 2015.
- Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, et al. Concept annotation in the CRAFT corpus. BMC Bioinformatics. 2012;13:161. doi:10.1186/1471-2105-13-161.PubMedPubMed CentralView ArticleGoogle Scholar
- Elhadad N, Pradhan S, Chapman WW, Manandhar S, Savova GK. SemEval-2015 task 14: Analysis of clinical text. Proc of Workshop on Semantic Evaluation. Association for Computational Linguistics. 2015:303-10Google Scholar
- Pradhan S, Elhadad N, Chapman WW, Manandhar S, Savova GK. SemEval-2014 task 7: analysis of clinical text. Proc of Workshop on Semantic Evaluation. Association for Computational Linguistics. 2014:54-62Google Scholar
- GATE annotation diff tool. GATE-General Architecture for Text Engineering. http://gate.ac.uk/sale/tao/splitch10.html#x14-27600010.2. Accessed 23 Jul 2015.
- Divita G, Tse T, Roth L. Failure analysis of MetaMap transfer (MMTx). Medinfo. 2004;11(Pt 2):763–7.Google Scholar
- Definitions of terms used in Information Extraction. Message Understanding Conference. 2005. http://www-nlpir.nist.gov/related_projects/muc/. Accessed 20 Oct 2015.
- de Keizer NF, Abu-Hanna A, Zwetsloot-Schonk J. Understanding terminological systems I: terminology and typology. Methods Inf Med. 2000;39(1):16–21.PubMedGoogle Scholar
- Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf Med. 1998;37(4-5):394–403.PubMedPubMed CentralGoogle Scholar
- Zeng ML. Construction of controlled vocabularies, a primer. 2005. Online.Google Scholar
- Smith A, Osborne M. Using gazetteers in discriminative information extraction. 2006. p. 133–40. Tenth Conference on Natural Language Learning.Google Scholar
- Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc. 2011;18(5):544–51. doi:10.1136/amiajnl-2011-000464.PubMedPubMed CentralView ArticleGoogle Scholar
- Gruber TR. A translation approach to portable ontology specifications. Knowledge Acquisition. 1993;5(2):199–220.View ArticleGoogle Scholar
- de Keizer NF, Abu-Hanna A. Understanding terminological systems II: experience with conceptual and formal representation of structure. Methods Inf Med. 2000;39(1):22–9.PubMedGoogle Scholar
- Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. 2008. Cambridge University Press.View ArticleGoogle Scholar
- Liu K, Hogan WR, Crowley RS. Natural Language Processing methods and systems for biomedical ontology learning. J Biomed Inform. 2011;44(1):163–79. doi:10.1016/j.jbi.2010.07.006.PubMedPubMed CentralView ArticleGoogle Scholar
- Trask RL. What is a word? University of Sussex Working Papers in Linguistics. 2004.Google Scholar
- MetaMap - a tool for recognizing UMLS concepts in text. http://metamap.nlm.nih.gov/. Accessed 23 Jul 2015.
- Jonquet C, Shah NH, Musen MA. The Open Biomedical Annotator. Summit on Translat Bioinforma. 2009;56–60. Google Scholar
- Index of /d/uima-addons-current/ConceptMapper. http://uima.apache.org/d/uima-addons-current/ConceptMapper. Accessed 23 Jul 2015.
- cTAKES 3.2 Dictionaries and Models. https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+Dictionaries+and+Models. Accessed 23 Jul 2015.
- The Doublet Method medical record scrubber. http://www.julesberman.info/aacom10.htm. Accessed 23 Jul 2015.
- Noble tools. http://noble-tools.dbmi.pitt.edu/noblecoder. Accessed 23 Jul 2015.
- MGrep. Available from University of Michigan.Google Scholar
- Concept Mapper. https://uima.apache.org/sandbox.html#concept.mapper.annotator. Accessed 23 Jul 2015.