Comparison of concept recognizers for building the Open Biomedical Annotator
- Nigam H Shah1Email author,
- Nipun Bhatia†2,
- Clement Jonquet†1,
- Daniel Rubin1,
- Annie P Chiang1 and
- Mark A Musen1
© Shah et al; licensee BioMed Central Ltd. 2009
Published: 17 September 2009
The National Center for Biomedical Ontology (NCBO) is developing a system for automated, ontology-based access to online biomedical resources. The system's indexing workflow processes the text metadata of diverse resources such as datasets from GEO and ArrayExpress to annotate and index them with concepts from appropriate ontologies. This indexing requires the use of a concept-recognition tool to identify ontology concepts in the resource's textual metadata. In this paper, we present a comparison of two concept recognizers – NLM's MetaMap and the University of Michigan's Mgrep. We utilize a number of data sources and dictionaries to evaluate the concept recognizers in terms of precision, recall, speed of execution, scalability and customizability. Our evaluations demonstrate that Mgrep has a clear edge over MetaMap for large-scale service oriented applications. Based on our analysis we also suggest areas of potential improvements for Mgrep. We have subsequently used Mgrep to build the Open Biomedical Annotator service. The Annotator service has access to a large dictionary of biomedical terms derived from the United Medical Language System (UMLS) and NCBO ontologies. The Annotator also leverages the hierarchical structure of the ontologies and their mappings to expand annotations. The Annotator service is available to the community as a REST Web service for creating ontology-based annotations of their data.
Introduction and background
There continues to be a tremendous increase in the amount, diversity, and rate of generation of high-throughput datasets as well as exponential growth in the biomedical literature. Since 1999, Gene Ontology (GO) annotations of gene products have enabled queries to accurately identify gene products associated with a particular cellular component, biological process or a molecular function. Similarly, creation of annotations for public data resources based on other shared ontologies would enable researchers to locate datasets, tissue samples, and clinical trials that relate to a given disease. This capability would permit a whole new class of integrative analyses . However, due to the size of the data and the complexity of the task involved, adding ontology-based annotations to online data repositories manually on a case-by-case basis is unlikely ever to scale .
At the National Center of Biomedical Ontology (NCBO), we are developing methods to annotate large numbers of data resources automatically, and have developed a prototype system for ontology-based annotation and indexing of biomedical data . The key functionality of this system is to provide a service that enables users to locate biomedical data resources related to particular ontology concepts. The system processes the textual metadata of diverse biomedical data resources (such as gene-expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed abstracts), annotating and indexing them with concepts from appropriate ontologies.
A critical step that our system performs is to recognize a given ontology concept in the text metadata of a record in some online data resource. This task is generally referred to as concept recognition. A core aspect of concept recognition is a lexicon (or dictionary) usually derived from taxonomy or ontology to which text is mapped. In the biomedical domain, the United Medical Language System (UMLS) is an extensive resource that incorporates a number of disparate terminologies and ontologies and that provides a cross-referencing of related concepts. However, efforts to map public, open biomedical resources to semantically rich thesauri such as the UMLS metathesaurus have been scattered. Barring a few initiatives, [1, 4] most efforts to date have focused on mapping text from patient records to UMLS, rather than on mapping metadata from online biomedical resources [5, 6].
Most previous work in concept recognition in bioinformatics has been restricted to the identification of protein and gene names [7–9], with a few groups attempting to identify concepts representing relationships among entities . This trend is obvious when looking at popular tools such as EBIMed and TextPresso, all of which identify genes or proteins in documents, but struggle to identify disease names [10, 11]. The same emphasis was visible in the BioCreative text-processing challenge, which was primarily concerned with recognizing gene and protein names .
In the field of clinical informatics, the efforts to recognize concepts in text have focused on finding disease names in electronic medical records, discharge summaries, clinical guideline descriptions, and clinical-trial summaries [5, 6, 12]. However, electronic medical records are seldom made "public" as online biomedical resources. As a result, current methods and tools are usually not portable across a different problem category – such as processing the metadata of public, open biomedical resources.
In recent times, there has been a shift in the focus of research from individual genes and proteins to entire biological systems . As a result, researchers need services that can processes the metadata of diverse resources to annotate and index them with concepts from appropriate ontologies, and that can enable the researchers to locate resources related to particular ontology concepts. Concept recognition is a key step for such systems.
NLM's MetaMap was one of the first tools for recognizing UMLS concepts . It is widely regarded as the gold standard for this task. Recently, there have been a number of tools such as Mgrep  and MTag  that also perform concept recognition. The advent of these new tools has made the task of evaluating concept recognizers particularly important.
We conducted a survey of existing concept recognizers based on their published reports, and selected MetaMap and Mgrep as the two tools to evaluate for our purposes. This paper provides comparison of NLM's MetaMap and the University of Michigan's Mgrep . We choose Mgrep because it is claimed to be a fast and scalable tool for concept recognition with a high degree of customizability vis-à-vis dictionaries and resources. Considering the vast number of biomedical resources and ontologies available, factors of speed, scalability, and customizability are of prime concern in developing a concept-recognition system.
In the remaining part of the paper, we first give a brief outline of the concept-recognition task and discuss our data sources and dictionaries. We explain the evaluation methodology adopted and the results obtained. We then discuss the performance of concept recognizers based on a number of performance metrics such as precision and recall. We also analyze the suitability of a concept recognizer based on a number of subjective parameters such as ease of use, ability to customize, and scalability. We then describe how we used Mgrep to build the Open Biomedical Annotator Web Service. We conclude with a summary of our findings.
Size and number of elements of data sources
A dictionary with respect to a concept recognizer is the set of terms or concepts that we aim to recognize in the data. Dictionaries from concept recognition are most commonly derived from taxonomies and ontologies about the domain of interest. Analogous to data sources, dictionaries can be specialized along different axes, such as diseases and anatomical parts. As most of the work in biomedical informatics has primarily focused in recognizing genes or proteins  the dictionaries for genomics and proteomics are comprehensive and extensively evaluated. The same is not true for dictionaries pertaining to diseases, body parts, biological processes, drug names, and so on.
The size and number of concepts in each of the dictionaries SNOMED-CT, Diseases, FMA and Biological Processes (from GO)
FMA (Body Parts)
We constructed a workflow for performing the evaluations and to provide a platform to plug in the data sources, on which to run the concept-recognition tools and to map the tool-specific output to a common format. Ideally, this task should have been done using a framework such as IBM's UIMA, but both MetaMap and Mgrep are not available as UIMA components. First, we randomly selected 200 lines from each data source and converted these sources from their native format to a format suitable for input to the concept-recognition tool. For example, the Array Express data are commonly available in XML format; however, University of Michigan's Mgrep requires the data to be in a three-column tab-delimited format. The next step involved running the concept-recognition tool and obtaining the processed file in a format specific to each tool. In the final step, we converted the output files of the different concept recognizers to a common format to ensure uniformity and to aid in performing comparative analysis. A total of four experts examined the resultant files for scoring true positives and false positives. We attempted to estimate recall by assuming a false negative result if no concept was identified. In addition, our evaluation considered customizability and scalability.
We define the qualitative measure of customizability of a concept recognizer as the ease with which a dictionary and a data source can be configured for it.
We define scalability by how easily a concept recognizer handles different sizes of dictionary and resource.
Total number of concepts recognized by Mgrep and MetaMap across all resources using the biological process and diseases dictionaries
Total number of concepts recognized by Mgrep and MetaMap across all resources using the Foundational Model of Anatomy and SNOMED-CT as dictionaries
FMA (Body Parts)
Precision of Mgrep and MetaMap using Biological Processes as the dictionary
Precision of Mgrep and MetaMap using the 'diseases' dictionary
In general, Mgrep has a higher precision in recognizing Biological Processes. When considering precision, Mgrep outperforms MetaMap in almost all cases, with the exception for MetaMap in recognizing Biological Processes in records from ClinicalTrials.gov.
Building the Open Biomedical Annotator
Currently, there are over 1000 public biomedical data resources listed in the Nucleic Acids Research (NAR) online Molecular Biology Database Collection. There are many more that are not listed by NAR. Across all such databases, ontology based annotation of their records is not as widespread as desired. There are several reasons for this limitation:
Annotation often needs to be done manually either by expert curators or by the authors of the data (e.g., when a new Medline entry is created, it is manually indexed with MeSH terms);
The number of biomedical ontologies available for use is large; ontologies change often and frequently overlap. The ontologies are not in the same format and are not always accessible via Application Programming Interfaces (APIs).
Annotation is often a boring additional task without immediate reward for the user.
Even though ontologies are available as a one-stop-shop via BioPortal , the task of actually using the ontologies for annotation is non-trivial. Moreover, it has been shown that manual annotation efforts are unlikely to scale and that automated methods are required . One of the core aims of NCBO is to provide annotation tools that enable the use of ontologies for annotation and that reduce the manual overhead for creating ontology-based annotations.
Based on the results of our comparison between Mgrep and MetaMap (see discussion), and because Mgrep had significantly faster execution time and can work with non-UMLS dictionary sources (which MetaMap cannot), we decided to use Mgrep as the initial concept recognizer for building the Open Biomedical Annotator Web service. A detailed description of the Annotator Web service is provided in ; we briefly review the key features here.
The Annotator service: (1) processes the raw textual metadata of online biomedical resources and tags them with relevant biomedical ontology concepts and (2) returns the annotations to end users. The Annotator Web service allows end users to utilize ontologies for annotation of biomedical data with minimal effort.
The choice of the set of ontologies used to create the dictionary depends of the type of biomedical data the Web service is used to annotate. For instance, if a user wants to annotate gene-expression datasets with disease names, then SNOMED-CT and the NCI Thesaurus could be used. The output of the first step is a set of direct annotations.
This primary set of annotations serves as input for the semantic expansion components, which enhance the annotations extracted from the first step using the hierarchical structure of ontologies as well as mappings between them. For example: an is-a transitive closure component traverses an ontology parent-child hierarchy to create new annotations with parent concepts. For instance, if data are annotated with a concept from the NCI Thesaurus, such as melanoma, this component generates a new annotation with the term skin neoplasm, because the NCI Thesaurus provides the knowledge that melanoma is a kind of skin neoplasm. A semantic-distance component uses a given notion of concept similarity (or semantic distance) to obtain related concepts and create new annotations. An ontology-mapping component creates new annotations based on existing mappings between different ontologies. For example, an annotation done with concept C0025202 (melanoma) in the NCI Thesaurus can be expanded to another one within SNOMED-CT because the UMLS metathesaurus provides the mapping information. The Annotator Web service is designed in manner that allows multiple semantic expansion components to be plugged-in, selected, and parameterized by a user when requesting the service. As the result of the second step, the direct annotations and several sets of semantically expanded annotations are extracted and returned to the user.
Annotations performed with the service have implicit semantics that declare that a given dataset (or record) is about (or references) a certain concept. Concepts are identified by UMLS Concept Unique Identifier (CUI) or National Center for Biomedical Ontology (NCBO) Uniform Resource Indicator (URI). The context of the annotation asserts whether the annotation is direct or semantically expanded. In the latter case, the component used to produce the expanded annotation is described along with the concept from which the new annotation is derived. For example, the annotation [C0431097-ISA_CLOSURE-C0025202] states that the given text was annotated with the concept C0431097 (malignant melanocytic lesion) using the is-a relations of the concept C0025202 (melanoma).
Evaluation of the service API
We have conducted tests simulating both single users and up to 10 concurrent users accessing the service API. For each test, we selected 400 records from Medline and submitted the title and abstract to the Annotator service. The records were selected at random. For each record, we measured the response time and recorded the number of words in the title and abstract. On average, the service responds in 1.8 seconds when the mean input word count is 180 words. The service responds in 2.3 seconds when the mean input word count is 280 words. When simulating 10 simultaneous users, the response time is between 4.5 and 5.0 seconds for 280 words.
Discussion and future work
We identified the following considerations in selecting a concept recognizer for creating an automated ontology-based annotation service: (1) ability to work with non UMLS terminologies; (2) ability to work offline vs. online (annotation of user-submitted data as a service); (3) high speed as well as accuracy in terms of precision and recall.
By design, NIH's MetaMap is very tightly coupled with the UMLS. This makes mapping text to UMLS concepts very easy. However, generating a custom dictionary for annotation that uses concepts from outside UMLS is non-trivial. MetaMap requires the dictionary to be in a specific format with certain database tables always present. Some applications, such as the Open Biomedical Resources Index under development by the NCBO [1, 17], use a number of different dictionaries from not only UMLS but also other sources for which terms are not present in the UMLS. Formatting such dictionaries into the format required by MetaMap is not always possible without a major effort. With respect to the input data, MetaMap is very adaptable and easy to customize. It does not require the input sources to be structured in any particular way.
In terms of speed of execution, MetaMap requires much more processing time than does Mgrep. For example, Mgrep can process 1/5th of the data from ClinicalTrials.gov in 7 seconds, whereas MetaMap runs for over 8 minutes. This makes MetaMap unsuitable for developing an online annotation service. However the powerful lexical capability of MetaMap results in MetaMap finding about four times more concepts than Mgrep.
One of the standout features of Mgrep is its fast execution and scalability across all the dictionaries and data resources tested. However, Mgrep identifies a large number of concepts that are redundant – concepts recognized at the same position in the input string – and overall the number of unique concepts recognized is less than with MetaMap (Tables 3 and 4).
Mgrep is easily customizable to accept variation in the formats of both the input data and the dictionary, making it very easy to use for custom applications. It requires the dictionary to be in an easy to create two-column, tab-delimited file and similarly requires the resources to be in tab-delimited files. Mgrep places no rigid requirements on the structure and presence of concepts.
Mgrep shows higher precision than does MetaMap across most resources and dictionary types; possibly at the expense of some loss in recall. In the past we have used sampling along with a simplifying assumption – that at least one concept must be assigned to each record – to estimate recall . In this study, that assumption is not always valid; e.g. There is no expectation that each record from Goldminer will be annotated with a biological process term. Hence we do not provide an estimate of recall. Recently a group from EBI has released a manually annotated corpus used in evaluating recall of recognizing disease names . In future work, we can use this corpus for estimating recall of Mgrep for one dictionary – the 'diseases' dictionary.
The resulting service that can be integrated in current programs and workflows; current response times for the Annotator Web service are about 1.8 seconds for 180 words and 2.3 seconds for 280 words;
Our service uses public ontologies both to create annotations and to expand them;
Our service has access to one of the largest available sets of publicly available biomedical ontologies from the UMLS metathesaurus and the NCBO BioPortal repository. The current implementation of the service uses a selection of 206 biomedical ontologies that gives a dictionary of 4,021,662 unique concepts and 7,637,125 terms;
Future work will concentrate on three main areas that will determine the widespread adoption of the Annotator Web service: (1) enhancement of the concept-recognition step by using advanced natural languages processing techniques and eventually recognize 'relations,' (2) customizability of the service parameters, and (3) ability to plug-in concept recognizers other than Mgrep in the service. There are existing groups that already provide concept recognition as a service [22, 23]. However, none of them have access to the scope of ontologies that our service has access to. We are actively working with several such groups to provide access to our ontologies for use in their concept recognition engines as well as to allow access to their concept recognizers within our Annotator Web service.
MetaMap places a rigid constraint on the dictionary structure and cannot be used for applications that require dictionaries outside of the UMLS (such as those from the Open Biomedical Ontology library). Because of its slow speed, it cannot be used for many real-time applications or for applications in which either the data sources or the dictionary changes frequently, requiring recurrent reprocessing. Mgrep has extremely fast execution speed, but fewer concepts are recognized. If future versions of Mgrep provide the ability to generate lexical variants, recall would be enhanced and Mgrep could become the concept recognizer of choice for applications that need to process large datasets, that require large dictionaries, or that involve frequent reprocessing.
Ontology based annotation of biomedical data plays a crucial role for enabling data interoperability and the making of translational discoveries . This situation is also true for e-science generally. The need to switch from the current Web to a semantic Web with semantically rich content annotated using ontologies has been clearly identified . Meeting this need requires services (usable by humans and software agents) that can be integrated into existing data curation and annotation workflows.
We have used Mgrep to create a Web service for ontology based annotation of biomedical data. Our Annotator service has access to a large dictionary, which is composed of UMLS and NCBO ontologies. Our Annotator service is not limited to the syntactic recognition of terms, but also leverages the structure of the ontologies to expand annotations.
The annotator service workflow is currently used in a project within NCBO to annotate a large number of public biomedical resources . The Annotator Web service is also available to the community for creating ontology-based annotation of their data. The service can be customized to their specific needs (in terms of annotation parameters and biomedical ontologies used).
NHS, CJ, NB, DR and MAM acknowledge support from NIH grant U54 HG004028. This work is supported by the National Center for Biomedical Computing (NCBC) National Institute of Health roadmap initiative; NIH grant U54 HG004028. We acknowledge assistance of Manhong Dai and Fan Meng at University of Michigan in setting up and using Mgrep.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 9, 2009: Proceedings of the 2009 AMIA Summit on Translational Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S9.
- Shah NH, et al.: Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics 2009, 10(Suppl 2):S1.PubMed CentralView ArticlePubMedGoogle Scholar
- Baumgartner WA Jr, et al.: Manual curation is not sufficient for annotation of genomic databases. Bioinformatics 2007, 23(13):i41–8.PubMed CentralView ArticlePubMedGoogle Scholar
- Jonquet C, Musen MA, Shah NH: A System for Ontology-Based Annotation of Biomedical Data. in International Workshop on Data Integration in The Life Sciences (DILS). Evry, France 2008.Google Scholar
- Butte AJ, Kohane IS: Creation and implications of a phenome-genome network. Nat Biotechnol 2006, 24(1):55–62.PubMed CentralView ArticlePubMedGoogle Scholar
- Reeve LH, Han H: CONANN: An Online Biomedical Concept Annotator. Lecture Notes in Computer Science 2007, 4544: 264.View ArticleGoogle Scholar
- Hersh W, Leone TJ: The SAPHIRE server: a new algorithm and implementation. Proc Annu Symp Comput Appl Med Care 1995, 858–62.Google Scholar
- Hirschman L, et al.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005, 6(Suppl 1):S1.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou G, et al.: Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics 2005, 6(Suppl 1):S7.PubMed CentralView ArticlePubMedGoogle Scholar
- Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21(14):3191–2.View ArticlePubMedGoogle Scholar
- Muller HM, Kenny EE, Sternberg PW: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2004, 2(11):e309.PubMed CentralView ArticlePubMedGoogle Scholar
- Rebholz-Schuhmann D, et al.: Protein annotation by EBIMed. Nat Biotechnol 2006, 24(8):902–3.View ArticlePubMedGoogle Scholar
- Moskovitch R, et al.: A Comparative Evaluation of Full-text, Concept-based, and Context-sensitive Search. J Am Med Inform Assoc 2007, 14(2):164–174.PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 2006, 7(2):119–29.View ArticlePubMedGoogle Scholar
- Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001, 17–21.Google Scholar
- Dai M, et al.: An Efficient Solution for Mapping Free Text to Ontology Terms. AMIA Summit on Translational Bioinformatics. San Francisco, CA 2008.Google Scholar
- Jin Y, et al.: Automated recognition of malignancy mentions in biomedical literature. BMC Bioinformatics 2006, 7: 492.PubMed CentralView ArticlePubMedGoogle Scholar
- Shah NH, et al.: Annotation and query of tissue microarray data using the NCI Thesaurus. BMC Bioinformatics 2007, 8: 296.PubMed CentralView ArticlePubMedGoogle Scholar
- Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith NB, Jonquet C, Rubin DL, Smith B, Storey MA, Chute CG, Musen MA: Bioportal: Ontologies and Integrated Data Resources at the Click of a Mouse. Nucleic Acids Res 2009.Google Scholar
- Jonquet C, Shah NH, Musen MA: The Open Biomedical Annotator. AMIA Summit on Translational Bioinformatics. San Francisco 2009.Google Scholar
- NCBO: Annotator Web Service.2009. [http://www.bioontology.org/wiki/index.php/Annotator_Web_service] [cited 2009 May 31]Google Scholar
- Jimeno A, et al.: Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 2008, 9(Suppl 3):S3.PubMed CentralView ArticlePubMedGoogle Scholar
- Rebholz-Schuhmann D, et al.: Text processing through Web services: calling Whatizit. Bioinformatics 2008, 24(2):296–8.View ArticlePubMedGoogle Scholar
- Hancock D, et al.: Terminizer – Assisting Mark-Up of Text Using Ontological Terms. Nature Precedings 2009. [http://precedings.nature.com/documents/3128/version/1]Google Scholar
- Handshuh S, Staab S: Annotation for the Semantic Web (Frontiers in Artificial Intelligence and Applications). Fairfax, VA: IOS Press, US; 2003.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.