There continues to be a tremendous increase in the amount, diversity, and rate of generation of high-throughput datasets as well as exponential growth in the biomedical literature. Since 1999, Gene Ontology (GO) annotations of gene products have enabled queries to accurately identify gene products associated with a particular cellular component, biological process or a molecular function. Similarly, creation of annotations for public data resources based on other shared ontologies would enable researchers to locate datasets, tissue samples, and clinical trials that relate to a given disease. This capability would permit a whole new class of integrative analyses [1]. However, due to the size of the data and the complexity of the task involved, adding ontology-based annotations to online data repositories manually on a case-by-case basis is unlikely ever to scale [2].
At the National Center of Biomedical Ontology (NCBO), we are developing methods to annotate large numbers of data resources automatically, and have developed a prototype system for ontology-based annotation and indexing of biomedical data [3]. The key functionality of this system is to provide a service that enables users to locate biomedical data resources related to particular ontology concepts. The system processes the textual metadata of diverse biomedical data resources (such as gene-expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed abstracts), annotating and indexing them with concepts from appropriate ontologies.
A critical step that our system performs is to recognize a given ontology concept in the text metadata of a record in some online data resource. This task is generally referred to as concept recognition. A core aspect of concept recognition is a lexicon (or dictionary) usually derived from taxonomy or ontology to which text is mapped. In the biomedical domain, the United Medical Language System (UMLS) is an extensive resource that incorporates a number of disparate terminologies and ontologies and that provides a cross-referencing of related concepts. However, efforts to map public, open biomedical resources to semantically rich thesauri such as the UMLS metathesaurus have been scattered. Barring a few initiatives, [1, 4] most efforts to date have focused on mapping text from patient records to UMLS, rather than on mapping metadata from online biomedical resources [5, 6].
Most previous work in concept recognition in bioinformatics has been restricted to the identification of protein and gene names [7–9], with a few groups attempting to identify concepts representing relationships among entities [10]. This trend is obvious when looking at popular tools such as EBIMed and TextPresso, all of which identify genes or proteins in documents, but struggle to identify disease names [10, 11]. The same emphasis was visible in the BioCreative text-processing challenge, which was primarily concerned with recognizing gene and protein names [7].
In the field of clinical informatics, the efforts to recognize concepts in text have focused on finding disease names in electronic medical records, discharge summaries, clinical guideline descriptions, and clinical-trial summaries [5, 6, 12]. However, electronic medical records are seldom made "public" as online biomedical resources. As a result, current methods and tools are usually not portable across a different problem category – such as processing the metadata of public, open biomedical resources.
In recent times, there has been a shift in the focus of research from individual genes and proteins to entire biological systems [13]. As a result, researchers need services that can processes the metadata of diverse resources to annotate and index them with concepts from appropriate ontologies, and that can enable the researchers to locate resources related to particular ontology concepts. Concept recognition is a key step for such systems.
NLM's MetaMap was one of the first tools for recognizing UMLS concepts [14]. It is widely regarded as the gold standard for this task. Recently, there have been a number of tools such as Mgrep [15] and MTag [16] that also perform concept recognition. The advent of these new tools has made the task of evaluating concept recognizers particularly important.
We conducted a survey of existing concept recognizers based on their published reports, and selected MetaMap and Mgrep as the two tools to evaluate for our purposes. This paper provides comparison of NLM's MetaMap and the University of Michigan's Mgrep [15]. We choose Mgrep because it is claimed to be a fast and scalable tool for concept recognition with a high degree of customizability vis-à-vis dictionaries and resources. Considering the vast number of biomedical resources and ontologies available, factors of speed, scalability, and customizability are of prime concern in developing a concept-recognition system.
In the remaining part of the paper, we first give a brief outline of the concept-recognition task and discuss our data sources and dictionaries. We explain the evaluation methodology adopted and the results obtained. We then discuss the performance of concept recognizers based on a number of performance metrics such as precision and recall. We also analyze the suitability of a concept recognizer based on a number of subjective parameters such as ease of use, ability to customize, and scalability. We then describe how we used Mgrep to build the Open Biomedical Annotator Web Service. We conclude with a summary of our findings.
Concept recognition
In the domain of biomedical informatics, the task of concept recognition can be understood as mapping biomedical text to a representation of biomedical knowledge consisting of inter-related concepts, usually codified as an ontology or a thesaurus. Figure 1 illustrates the task of a concept recognizer. Most concept recognizers take as input a resource and a dictionary – which can be a flat list or taxonomy of hierarchically related terms – and produce annotated files. The concept recognizer in Figure 1 recognizes the string 'deficient' in the resource and maps it to the concept 'Deficiency' in the dictionary. Most concept recognizers leverage natural-language processing and computational linguistic techniques to some extent.