Volume 10 Supplement 9
Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
© South et al; licensee BioMed Central Ltd. 2009
Published: 17 September 2009
Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers.
Using a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types.
Annotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield.
Challenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents.
Much of the detailed phenotypic information that is necessary for translational research is only available in clinical note documents and the breadth of clinical information that can be extracted from these documents is profound. Over the last decade researchers have employed a variety of methods ranging from simple keyword based approaches to increasingly complex natural language processing (NLP) systems to extract information from electronic clinical note documents [1–4]. However, significant modifications must be made to customize NLP systems to extract relevant phenotypic and other types of clinical data from different electronic medical record (EMR) systems. In addition, highly templated note documents like those that exist in the US Veteran's Administration Health Care System (VA EMR) pose specific challenges, and at the same time provide opportunities for development of NLP systems used for information extraction (IE) tasks. Equally challenging is to apply annotation methods to build annotated corpora and associated tasks that can be used to build reference standards required for performance evaluation of those systems. Manual annotation tasks are time consuming, expensive, and require considerable effort on the part of human reviewers.
The graphical user interface used at all Veteran's Administration Medical Centers in the US (VA) is called the Computerized Patient Records System (CPRS) and it provides several user tools that allow direct entry of free text information. One such tool, called the Text Integration Utilities (TIU) package, provides concurrent charting functions giving users the ability to electronically enter free text information into a diverse range of clinical report types. VA provider notes may contain free text information entered as traditional narratives. They may also contain copied and pasted sections from other provider note documents, or may contain highly templated note sections. The TIU package also allows providers to create custom pre-compiled documents or template structures that can be modified by individual clinicians or tailored for the operational needs of each hospital or specific VA service [5–7].
Templated clinical notes provide pre-defined section headings that require free text entry of information in a narrative style. In addition, long strings of symptoms may be present that require completion of check boxes, and embedded information such as headers that include patient name and demographics, active medications, vital signs, or laboratory results stored elsewhere in the VA EMR. Templated notes may also contain user defined formatting, additional white space denoting note sections, or other visual cues. It is assumed that the use of highly templated note documents encourages consistent data collection, allows data consistency checks, and aids in the process of order generation, clinician reminders, and communication. Use of templated note documents and standard section headings is one example where structured data collection has been applied to unstructured data sources.
Standardized documentation of clinical encounters focuses on the use of a predefined conceptual flow of note sections and logically ordered methods of recording pertinent patient information. These structures provide a defined method of clinical diagnosis, documenting performance of medical procedures, and follow-up of patient care. These expectations for documentation are established by medical education and training, as well as professional societies, and standards organizations and form the basis for medical communication, coding, billing and reimbursements. More recently with the adoption of the Clinical Document Architecture (CDA) model, the structure and semantics of clinical documentation is being driven towards greater standardization .
This pilot project illustrates a practical approach to annotation methods that may aid in information extraction of clinical information from electronic clinical documents. We also sought to demonstrate an open source tool that can be used to conduct annotation of electronic note documents and identify concepts and attributes of interest for a specific clinical use case. Our goal was to build an annotated corpus identifying specific concepts denoting phenotypic, procedural, and medication use information for Inflammatory Bowel Disease (IBD). This includes the complex diseases of Crohn's and ulcerative colitis that have underlying genetic dispositions and are characterized by episodes of exacerbations, and could be considered representative of chronic diseases of interest to translational research. We focus on evaluating the presence of concepts for IBD in specific note sections and document types and demonstrate a practical approach to manual annotation tasks for a specific clinical use case. This approach may reduce the burden of document review when these methods are applied to large clinical data repositories.
This project was carried out at the VA Salt Lake City Health Care System in Salt Lake City, Utah which provides care for nearly 40,000 patients in Utah and surrounding states. Each year the VA provides care to almost 6 million veterans with an estimated 638,000 note documents entered each day at VA facilities nationwide.
Study population and document corpus
In a previous study we conducted a semi-automated review of note documents extracted from the VA EMR using a combination of NLP and string searching coupled with a negation algorithm to identify patients with Inflammatory Bowel Disease (IBD) (n = 91) . For this pilot study we selected the 62 patients from Salt Lake City and a random sample of associated electronic clinical notes for these patients that were generated in a 6-month period (n = 316).
1) Templated note sections
these are structured note sections that contain check lists and are usually in the form of clinical terms with square brackets, boxes, yes/no pick lists etc. These are usually associated with signs, symptoms and evaluation criteria and are found in documents such as nursing and pre-operative assessments. The individual elements of a templated section must be included to infer clinical information and can only be interpreted as a complete string in the context of the template (Figure 1).
2) Pre-defined headings
these denote semi-structured elements and mainly serve as prompts and placeholders for the provider to complete. Examples include chief complaint, history of present illness, medications, laboratory data, etc. Free text following these headings can stand on its own and be generally interpretable by the reader of the note without the associated heading (Figure 2).
Development of the annotation schema and guidelines
Annotation of clinical documents
Concept relevance – describes how relevant the specific concept is with in the context of the heading or template. Answers the question: is the concept necessary and relevant for diagnosis given this clinical use case (Table 1 and Figure 5)?
Examples of concepts by concept class and concept attributes
Granular (clinical inference)
Signs and Symptoms
Developing a rules-based consensus set
We reviewed disagreements identified from the completed and merged clinician annotation projects derived from the annotation task. We then developed specific rules to build a consensus set that we could apply programmatically using the following use case specific logic: 1) We selected annotations where spans from each annotator overlap and attributes have the same values; 2) In the case where annotation spans overlapped, but were not identical we selected for the shorter span; 3) We preserved concepts where one reviewer identified the concept and the other did not; 4) In instances where annotations overlapped, but there was disagreement at the attribute level, we retained the values selected by the senior physician annotator.
Annotator agreement and levels of evaluation
We estimate agreement between the two annotators for specific annotation tasks as described by Hripcsak [18, 19] and Roberts , using Cohen's Kappa where true negatives were available and F-measure otherwise. We also report the distribution of concepts by concept class and specific attribute, clinical document type, and note section.
The note corpus corresponding with the patient encounters selected for this pilot study included 316 notes with 92 unique note titles. We classified note documents into the following categories: primary care associated including new and established patient visits (40%), ancillary services for occupational therapy, nutrition and short addenda (31%), specialty clinic including the Gastro-intestinal (GI) clinic (15%), emergency department (8%) and peri-procedure related notes (6%). Clinician annotators completed a total number of 1,046 annotations related to our specific use case (IBD) that included annotations for concepts indicating signs and symptoms (395, 38%), diagnoses (249, 24%), procedures (239, 23%), and medications (163, 15%). The annotation task took a total of 28 hours with each annotation requiring an average of 50 seconds to identify a concept and associated attributes.
Annotator agreement estimates
Estimated agreement across various levels of analysis
Unit of Analysis
Kappa (95% CI)
Signs and Symptoms
0.61 (0.57, 0.86)
0.83 (0.80, 0.87)
0.63 (0.56, 0.68)
0.82 (0.76, 0.86)
0.72 (0.70, 0.74)
Reason for Service
Concept and concept attribute level analysis
Yield of concept classes by document type
Annotated Concepts per Document (# concepts)
Signs and Symptoms
Other Specialty Clinic
In addition, we also examined the occurrence of concepts annotated within different sections of the clinical documents. Major note sections where clinicians annotated concepts included assessment, chief complaint, family history, health care maintenance (HCM), history of presenting illness (HPI), medications, past medical history, plan, problem lists, review of systems, and physical examination. Of these sections, assessment contained the majority of annotated concepts (171, 16.3%), with the HPI section following closely (167, 16.0%). Family history and plan sections contained the least numbers of annotated concepts, having 1 (0.1%) and 9 (0.9%) concepts respectively.
Concept classes and note sections by affirmed concept attributes
Granular (clinical inference)
Signs and Symptoms
All annotated medications, and the majority of annotated diagnoses (98%), procedures (87%), and signs and symptoms (65%) were deemed granular at the atomic level (concept stands on its own). However none of the identified concepts denoting signs and symptoms were believed granular enough at the level of clinical inference for IBD. On the other hand, clinician reviewers determined that most annotated medications (82%) and diagnoses (77%) were granular at the clinical inference level. Over 95% of annotated concepts were considered relevant to IBD due to the fact that the notes were drawn from encounters of patients known to have IBD.
Distribution of contextual attributes by concept classes
Signs and Symptoms
Reason for service
1st degree relative
2nd degree relative
We have identified specific challenges and opportunities posed by highly templated clinical note documents including identifying note types or sections that will provide the highest concept yield, and adequately training NLP systems to accurately process templated note sections. "Unchecked" boxes in checklists also pose a dilemma for clinical inferencing. Depending on the clinical question, resources could be directed to process and review those note types with the highest expected yield. Moreover, other types of information could certainly be extracted from clinical narratives besides those in our annotation schema. Also algorithmic approaches could be developed and applied to identify specific note sections and templated note structures. There may also be opportunities to code section headings and template types using the UMLS or a terminology such as SNOMED-CT that allows coordination of concepts. Note sections could also be extracted in a standardized format using the HL7 CDA model.
Our results and conclusions are drawn from data representing an example of only one chronic disease. We purposefully selected documents from patients known to have IBD and did not review documents for patients not known to have IBD. We arrived at a rules-based consensus set that was derived by looking at a subset of note documents containing the highest number of concepts. This was a practical approach considering the duration of time required for clinician annotators to individually annotate the full corpus of 316 documents.
There is also an implied need to add a measure of uncertainty to our annotation schema since agreement was low at the concept attribute level. Additionally, it is necessary to conduct rigorous and adequate discussions of the lexicon used for and common interpretations and definitions of how concept attributes are to be applied prior to and during annotation tasks [11, 19, 21]. It became evident that clinicians over the course of the annotation task used an evolving understanding of our annotation schema and developed internal definitions that may have drifted over time. We could not quantify this drift given our study design and data from the resulting annotated corpus.
The results of this pilot study will inform further work at the VA, where major efforts are underway to build annotated corpora and apply NLP methods to large data repositories. We provide an example of a fairly complex annotation schema applied to highly templated note documents. When confronted with a large data repository of electronic clinical documents, it is likely that it is only necessary to apply IE tools on certain note types and/or note sections to identify phenotypic information useful for translational research. However, defining specific information to be annotated depends on the clinical questions asked and at what level one wishes to extract information from clinical text.
These methods could be expanded to further enhance medical terminologies with the goal of building ontologic representations and knowledge bases for specific medical domains. Active learning methods could also be applied to combine the tasks of expert human annotation and training of NLP systems. Finally, we propose that the CDA could be used to identify specific note types and sections to reduce the burden of searching notes for relevant clinical question dependent information.
This study was supported using resources and facilities at the VA Salt Lake City Health Care System, the Consortium for Healthcare Informatics Research (CHIR), VA HSR HIR 08-374, and the CDC Utah Center of Excellence in Public Health Informatics 1 PO1 CD000284-01. The authors also wish to thank Stephane Meystre and Charlene Weir for their helpful comments on revisions to this manuscript.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 9, 2009: Proceedings of the 2009 AMIA Summit on Translational Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S9.
- Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF: Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008, 128–44.Google Scholar
- Brown S, Elkin P, Rosenbloom ST, Fielstein EM, Speroff T: eQuality for All: Extending Automated Quality Measurement of Free Text Clinical Narratives. AMIA 2009, in press.Google Scholar
- Fielstein EM, Brown SH, McBrine CS, Clark TK, Hardenbrook SP, Speroff T: The effect of standardized, computer-guided templates on quality of VA disability exams. AMIA Annu Symp Proc 2006, 249–53.Google Scholar
- Penz JF, Wilcox AB, Hurdle JF: Automated identification of adverse events related to central venous catheters. J Biomed Inform 2007, 174–82. 10.1016/j.jbi.2006.06.003Google Scholar
- Weir CR, Hurdle JF, Felgar MA, Hoffman JM, Roth B, Nebeker JR: Direct text entry in electronic progress notes. An evaluation of input errors. Methods Inf Med 2003, 42(1):61–7.PubMedGoogle Scholar
- Brown SH, Lincoln M, Hardenbrook S, Petukhova ON, Rosenbloom ST, Carpenter P, et al.: Derivation and evaluation of a document-naming nomenclature. J Am Med Inform Assoc 2001, 8(4):379–90.PubMed CentralView ArticlePubMedGoogle Scholar
- Brown SH, Lincoln MJ, Groen PJ, Kolodner RM: VistA – U.S. Department of Veterans Affairs national-scale HIS. Int J Med Inf 2003, 69(2–3):135–56. 10.1016/S1386-5056(02)00131-4View ArticleGoogle Scholar
- Dolin RH, Alschuler L, Boyer S, Beebe C, Behlen FM, Biron PV, et al.: HL7 Clinical Document Architecture, Release 2. J Am Med Inform Assoc 2006, 13(1):30–9. 10.1197/jamia.M1888PubMed CentralView ArticlePubMedGoogle Scholar
- Gundlapalli AV, South B, Phansalkar S, Kinney A, Shen S, Delisle S, et al.: Application of Natural Language Processing to VA Electronic Health Records to Identify Phenotypic Characteristics for Clinical and Research Purposes. Proc AMIA Trans Bioinf 2008, 836–40.Google Scholar
- Musen MA, Gennari JH, Eriksson H, Tu SW, Puerta AR: PROTEGE-II: computer support for development of intelligent systems from libraries of components. Medinfo 1995, 8(Pt 1):766–70.PubMedGoogle Scholar
- Ogren PV, Savova G, Buntrock JD, Chute CG: Building and evaluating annotated corpora for medical NLP systems. AMIA Annu Symp Proc 2006, 1050.Google Scholar
- Chapman W, Chu D, Dowling JN: ConText: An Algorithm for Identifying Contextual Features from Clinical Text. BioNLP 2007: Biological, translational, and clinical language processing. Prague, CZ 2007.Google Scholar
- Kashyap V, Turchin A, Morin L, Chang F, Li Q, Hongsermeier T: Creation of structured documentation templates using Natural Language Processing techniques. AMIA Annu Symp Proc 2006, 977.Google Scholar
- Tange HJ, Schouten HC, Kester AD, Hasman A: The granularity of medical narratives and its effect on the speed and completeness of information retrieval. J Am Med Inform Assoc 1998, 5(6):571–82.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith A, editor: Information retrieval in medicine: The electronic medical record as a new domain. 69th Annual Meeting of the American Society of Information Science and Technology (ASIST); Austin, Texas 2006.Google Scholar
- SNOMED CT User Guide – July 2008 International Release Journal [serial on the Internet] 2008. [http://www.ihtsdo.org/fileadmin/user_upload/Docs_01/SNOMED_CT_Publications/SNOMED_CT_User_Guide_20080731.pdf].
- Cimino JJ, Zhu X: The practical impact of ontologies on biomedical informatics. Yb Med Inform 2006, 124–35.Google Scholar
- Hripcsak G, Wilcox A: Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance. J Am Med Inform Assoc 2002, 9(1):1–15. 10.1197/jamia.M1217PubMed CentralView ArticlePubMedGoogle Scholar
- Hripcsak G, Heitjan DF: Measuring agreement in medical informatics reliability studies. J Biomed Inform 2002, 35(2):99–110. 10.1016/S1532-0464(02)00500-2View ArticlePubMedGoogle Scholar
- Roberts A, Gaizauskas R, Hepple M, et al.: The CLEF corpus: semantic annotation of clinical text. AMIA Annu Symp Proc 2007, 625–9.Google Scholar
- Chapman WW, Dowling JN, Hripscak G: Evaluation of training with an annotation schema for manual annotation of clinical conditions from emergency department reports. Int J Med Inform 2008, 77(2):107–13. 10.1016/j.ijmedinf.2007.01.002View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.