Automatic classification of sentences to support Evidence Based Medicine

Kim, Su Nam; Martinez, David; Cavedon, Lawrence; Yencken, Lars

doi:10.1186/1471-2105-12-S2-S5

Volume 12 Supplement 2

Fourth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBio) 2010

Proceedings
Open access
Published: 29 March 2011

Automatic classification of sentences to support Evidence Based Medicine

Su Nam Kim^1,2,
David Martinez^1,2,
Lawrence Cavedon^1,2,3 &
…
Lars Yencken^1,2

BMC Bioinformatics volume 12, Article number: S5 (2011) Cite this article

11k Accesses
98 Citations
1 Altmetric
Metrics details

Abstract

Aim

Given a set of pre-defined medical categories used in Evidence Based Medicine, we aim to automatically annotate sentences in medical abstracts with these labels.

Method

We constructed a corpus of 1,000 medical abstracts annotated by hand with specified medical categories (e.g. Intervention, Outcome). We explored the use of various features based on lexical, semantic, structural, and sequential information in the data, using Conditional Random Fields (CRF) for classification.

Results

For the classification tasks over all labels, our systems achieved micro-averaged f-scores of 80.9% and 66.9% over datasets of structured and unstructured abstracts respectively, using sequential features. In labeling only the key sentences, our systems produced f-scores of 89.3% and 74.0% over structured and unstructured abstracts respectively, using the same sequential features. The results over an external dataset were lower (f-scores of 63.1% for all labels, and 83.8% for key sentences).

Conclusions

Of the features we used, the best for classifying any given sentence in an abstract were based on unigrams, section headings, and sequential information from preceding sentences. These features resulted in improved performance over a simple bag-of-words approach, and outperformed feature sets used in previous work.

Introduction

Evidence Based Medicine (EBM) is an approach to clinical practice whereby medical decisions are informed by primary evidence, such as the results of randomized control trials (RCTs). Evidence-based practice requires efficient information access to such evidence, and also retrieval and analysis of documents relevant to a specified clinical topic. Evidence-based practitioners use specific criteria when judging whether an RCT is relevant to a given question. They generally follow the PICO criterion [1]: Population (P) (i.e., participants in a study); Intervention (I); Comparison (C) (if appropriate); and Outcome (O) (of an Intervention). Variations and extensions of this classification have been proposed, such as the PECODR tagset [2]. To better serve the information needs of the EBM community, we explore the use of classification techniques to identify relevant key sentences in a given document, and classify these against specified medical criteria. Such information could be leveraged in various ways: e.g., to improve search performance; to enable structured querying with specific categories; and to aid users in more quickly making judgements against specified PICO criteria.

In this paper, we build a classifier that performs two tasks. First, it identifies the key sentences in an abstract, filtering out those that do not provide the most relevant information. Second, it classifies sentences according to medical tags (based on the PICO criteria) used by our medical research partners. We project these two tasks into an (N+1)-way classification task, with N semantic labels for key sentences and 1 label (i.e. Other) for labeling non-key sentences. For this purpose, we have built a corpus of 1,000 medical abstracts, hand-annotated at the sentence level by domain experts, which we use to develop and evaluate our system.

A major difference of our approach from previous work is the combination of key-sentence identification and classification, whereas others (e.g., [3, 4]) have separated these tasks and assumed that all sentences are relevant at the classification step. Many sentences in abstracts do not fall into any of the pre-defined categories (due to vagueness, diversion from the central topic, etc.), and the identification of such extraneous material is useful.

Our classification techniques use an extensive set of features, derived from context, semantic relations, structure and sequencing of the text. In particular, for sequence information we use features from previous sentences in the given abstract, and use predicted labels as features in a novel way. We employ Conditional Random Fields (CRF) [5], which are well suited for learning over sequential data (such as cohesive, structured text). In the following sections, we first describe related work in Section . In Section , we provide details of our experimental setup including the construction of the corpus, the learners, and features. We present our results and error analysis in Sections and respectively. We conclude and discuss future directions in Section .

Related work

The generalised use of PICO and similar schemas by clinicians when performing search, and their improvement on performance in user studies [6], has fueled interest in the development of automatic aids for this task.

Demner-Fushman and Lin [7] were the first to present automatic classifiers for PICO-elements. In their work, they used the MetaMap parser [8], hand-built rules, and statistical classifiers to identify sentences or phrases in abstracts relevant to each PICO element. Only for the element Outcome did they use a supervised classifier (Naive Bayes) with a large set of features, including n-grams, position, and semantic information from MetaMap. They trained this classifier over 275 hand-annotated abstracts, and reported accuracies in the range of 74%-93% depending on the type of abstract and the evaluation threshold. It is important to note that this is the only previous work in the literature that uses the Other tag as we do. Demner-Fushman and Lin [7] also applied their final PICO classifiers to a novel weighting formula for medical information retrieval (IR), significantly improving the baseline for the task. In a related paper [9], the same authors applied PICO classification to the task of clustering medical results, showing that it improved information delivery. The main limitations of their classifiers were the small size of the annotated data, and the reliance on hand-crafted rules for some of the PICO classes. One drawback of their IR system was the use of parameters that were hand-assigned or estimated over a small dataset. More recently, Chung [4] performed PICO classification by combining rhetorical roles with PICO elements, in order to achieve higher performance and alleviate the hand-annotation cost. Chung uses four rhetorical roles, namely Aim, Method, Results, and Conclusions; she requires that each sentence in an abstract fall into one of these roles. Chung then focuses on categorising one PICO class at a time, for a more fine-grained analysis. There are two limitations to this approach: (i) the overall classification performance across medical tags is not known; and (ii) sentences are forced to always have one semantic tag. We address these limitations by focusing directly on labels of interest for EBM, allowing sentences to be labeled as Other, and by allowing multiple labels per sentence when required. We believe that this is a more realistic setting than the one presented in previous work, and will provide better insight on the performance we should expect for this kind of task. This makes our approach not directly comparable to [4], but we are able to apply her technique and features and evaluate the performance of her system over medical labels only. We also extend her approach by adding new types of features.

Other work on sentence classification has focused on rhetorical role classification, which aims at identifying the roles of sentences in text (e.g. Motivation, Result, etc.). Training and test data for this task is easy to obtain from structured scientific abstracts, which provide section headings. This approach has been used in many supervised systems [3, 10–13]. With respect to feature representations, previous work has relied mostly on contextual features, such as n-grams and words in specific locations. Heuristics derived from sequential features of abstracts, such as relative location of sentences and section headings have recently been explored [3, 4]. In terms of finding suitable machine learners, well-known machine learning techniques have been applied to the tasks, including Naïve Bayes (NB) [7], Support Vector Machines (SVM) [4], Hidden Markov Models (HMM) [14], and Conditional Random Fields (CRF) [4]. Also, [12] proposed a probability-based learner inspired by the sequence of abstracts.

Recent work by [15] has shown the difficulty of identifying PICO elements in text, and has proposed a location-based information retrieval weighting strategy, motivated by the distribution of PICO elements. The authors also applied a weighting model based on the PICO information from the query, obtaining significant improvements from both approaches. However, their annotation of PICO tags was based on open text, disregarding sentence boundaries, which led to agreement problems between the annotators. Further, their classifier was built using the section headings of structured abstracts (e.g. Patients, Outcome’, etc.) without human supervision, which could introduce noise.

Method

In this section we describe the construction of the corpus, the classifiers and features, and the experimental setting.

Data collection

We extracted 1,000 abstracts from MEDLINE for annotation. Our focus was on medical research, and in order to extract relevant abstracts we used queries from two institutions that develop systematic reviews of the literature: The Global Evidence Mapping Initiative (GEM) [16], and The Agency for Healthcare Research and Quality (AHRQ) [17]. GEM focuses on traumatic brain injury and spinal cord injury, and they provided the results of hand-constructed queries targeting diverse aspects of this subdomain. We randomly extracted 500 abstracts from a list of 74,000 query results for our annotation.

In order to diversify the contents of the corpus, the remaining 500 abstracts were randomly sampled from a set of AHRQ queries covering different medical issues (e.g. “Systematic Review of the Literature Regarding the Diagnosis of Sleep Apnoea”).

Some of the abstracts used in our experiments (376 out of 1,000) are structured, which means that they contain section headings (e.g. Aim, Method, etc.). These headings are helpful in capturing the rhetorical structure of the text, and we use them as features (when available). See the abstract of this paper you are reading for an example of a structured abstract, and Figure 1 for an example of an unstructured abstract.

Annotation

In order to define our tagset we first adopted the 7-way annotation scheme presented in [18]. (We thank the authors for kindly providing a sample of their data for our work, as well as initial definitions for semantic tags.) After analysing this data, we decided to drop two of their categories (“Statistics” and “Supposition”) because their work showed significant agreement problems on those classes. We also decided to add the category Study Design, based on feedback by medical experts at GEM on the utility of this category. Thus, our annotation categories are as follows:

Background: Material that informs and may place the current study in perspective, e.g. work that preceded the current; information about disease prevalence; etc;
Population: The group of individual persons, objects, or items comprising the study’s sample, or from which the sample was taken for statistical measurement;
Intervention: The act of interfering with a condition to modify it or with a process to change its course (includes prevention);
Outcome: The sentence(s) that best summarizes the consequences of an intervention;
Study Design: The type of study that is described in the abstract;
Other: Any sentence not falling into one of the other categories and presumed to provide little help with clinical decision making, i.e. non-key or irrelevant sentences.

The 1,000 abstracts were annotated by a medical student over 80 hours, with the continuous collaboration of a senior medical expert. Each sentence could be annotated with multiple classes. In order to make annotation easier, we built the “Annotex” tool, which provides an interface to the sentence-segmented corpus. A screenshot of the tool interface is shown in Figure 1.

In order to measure agreement, 60 of the abstracts were blindly annotated by one of the authors, and Cohen’s kappa was calculated. The original annotation was not changed in any case. The averaged score over all classes was 0.62, which indicates “substantial agreement” [19]. The kappa values for the different classes are given in Table 1. The table shows that most classes have good agreement scores, and only Study Design seems problematic. This annotated data is available for further research, and can be obtained by emailing the contact author.

Table 1 Kappa values per class.

Fourth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBio) 2010

Automatic classification of sentences to support Evidence Based Medicine

Abstract

Aim

Method

Results

Conclusions

Introduction

Related work

Method

Data collection

Annotation

Conditional random fields

Features

Lexical information

Semantic information

Structural information

Sequential information

Experimental setting

Results

Benchmark system

Adding lexical and semantic information

Adding structural information

Adding sequential information

Evaluation over external dataset

Error analysis

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us