The past decade has seen great progress in the field of biomedical text mining (BIO-TM). This progress has been stimulated by the rapid publication rate in biosciences and the need to improve access to the growing body of textual information available via resources such as the National Library of Medicine's PubMed system . In recent past, considerable work has been conducted in many areas of BIO-TM. Basic domain resources such as biomedical dictionaries, ontologies, and annotated corpora have grown increasingly sophisticated, and a variety of novel techniques have been proposed for the processing, extraction and mining of information from biomedical literature. Current systems range from those capable of named-entity recognition to those dealing with e.g. document classification, information extraction, segmentation, and summarization, among many others [2–6].
While much of the early research on BIO-TM concentrated on technical developments (i.e. adapting basic language processing techniques for biomedical language), in recent years, there has been an increasing interest in users' needs . Studies exploring the TM needs of biomedical researchers have appeared [8–10], along with practical tools for the use of scientists [11–14]. However, user-centered studies are still lacking in many areas of research and further evaluation of existing technology in the context of real-life tasks is needed to determine which tools and techniques are actually useful .
In this article we will focus on one active area of BIO-TM research - textual information structure of scientific documents - and will investigate its practical usefulness for a real-life biomedical task. The interest in information structure (also called discourse, rhetorical, argumentative or conceptual structure, depending on the theory or framework in question) stems from the fact that scientific documents tend to be fairly similar in terms of how their information is structured. For example, many documents provide some background information before defining the precise objective of the study in question, and conclusions are typically preceded by a description of the results obtained. Many readers of scientific literature are interested in specific information in certain parts of documents, e.g. in the general background of the study, the methods used in the study, or the results obtained). Accordingly, many BIO-TM tasks have focused on the extraction of information from the relevant parts of documents only. Classification of documents according to the categories of information structure has proved useful e.g. for question-answering, summarization and information retrieval [16–18].
To date, a number of different schemes have been proposed for (typically) sentence-based classification of scientific literature according to categories of information structure, e.g. [16, 19–25]. The simplest of these schemes merely classify sentences according to section names seen in scientific documents, for example, the Objective, Methods, Results and Conclusions sections appearing frequently (with different variations) in biomedical abstracts [20, 21, 24]. Some other schemes are based on components of scientific argumentation. A well-known example of such a scheme is the Argumentative Zoning (AZ) scheme originally developed by Teufel and Moens  which assumes that the act of writing a scientific paper corresponds to an attempt of claiming ownership for a new piece of knowledge. Including categories such as Other, Own, Basis and Contrast, AZ aims to model the argumentative or rhetorical process of convincing the reviewers that the knowledge claim of the document is valid.
Also schemes based on conceptual structure of documents exist - for example, the recent Core Scientific Concepts (CoreSC) scheme . CoreSC treats scientific documents as humanly readable representations of scientific investigations. It seeks to retrieve the structure of an investigation from the paper in the form of generic high-level concepts such as Hypothesis, Model, and Experiment (among others). Furthermore, schemes aimed at classifying statements made in scientific literature along qualitative dimensions have been proposed. The multi-dimensional classification system of Shatkay et al. , developed for the needs of diverse users, classifies sentences (or other fragments of text) according to dimensions such as Focus, Polarity, Certainty, Evidence and Trend.
Different schemes of information structure have been evaluated in terms of inter-annotator agreement, i.e. the agreement with which two or several human judges label the same element of text with the same categories. Some of the schemes have been further evaluated in terms of machine learning: the accuracy with which an automatic classifier trained on human-annotated data is capable of assigning text to scheme categories, e.g. [16, 21, 24, 26]. Also evaluation in the context of BIO-TM tasks such as question-answering, summarization, and information retrieval has been conducted [16–18]. These evaluations have produced promising results. However, evaluation in the context of real-life tasks in biomedicine has been lacking, although such evaluation would be important for determining the practical usefulness of the schemes for end-users.
In this paper, we will investigate the usefulness of information structure for Cancer Risk Assessment (CRA). Performed manually by human experts (e.g. toxicologists, biologists), this real-life task involves examining scientific evidence in biomedical literature (e.g. that available in the MEDLINE database ) to determine the relationship between exposure to a substance and the likelihood of developing cancer from that exposure . The starting point of CRA is a large-scale literature review which focuses, at the first instance, on scientific abstracts published on the chemical in question. Risk assessors read these abstracts, looking for a variety of information in them, ranging from the overall aim of the study to specific methods, experimental details, results and conclusions . This process can be extremely time consuming since thorough risk assessment requires considering all the published literature on a chemical in question. A well-studied chemical may well have tens of thousands of abstracts available (e.g. MEDLINE includes over 27,500 articles for cadmium). CRA is therefore an example of a task which might well benefit from annotations according to textual information structure.
Our study focuses on three different schemes: those based on section names, AZ and CoreSC, respectively. We examine the applicability of these schemes to biomedical abstracts used for CRA purposes. Since AZ and CoreSC have been developed for full journal articles, our study provides an idea of their applicability to tasks involving abstracts. We describe the annotation of a corpus of CRA abstracts according to the three schemes, and compare the resulting annotations in terms of inter-annotator agreement and the distribution and overlap of scheme categories. Our evaluation shows that for all the schemes, the majority of categories appear in scientific abstracts and can be identified by human annotators with good or moderate agreement (depending on the scheme in question). Interestingly, although the three schemes are based on entirely different principles, our comparison of annotations reveals a clear subsumption relation between them.
We introduce then a machine learning approach capable of automatically classifying sentences in the CRA corpus according to scheme categories. Our results show that all the schemes can be identified using automatic techniques, with the accuracy of 89%, 90% and 81% for section names, AZ and CoreSC, respectively. This is an encouraging result, particularly considering the fairly small size of the CRA corpus and the challenge it poses for automatic classification.
Finally, we introduce a user test - conducted by experts in CRA - which evaluates the usefulness of the different schemes for real-life CRA. This test focuses on two schemes: the coarse-grained scheme based on section names and the finest-grained CoreSC scheme. It evaluates whether risk assessors find relevant information in literature faster when presented with unannotated abstracts or abstracts annotated (manually or automatically) according to one of the schemes. The results of this test are promising: both schemes lead into significant savings in risk assessors' time. Although manually annotated abstracts yield biggest savings in time (16-46%, compared with the time it takes to locate information in unannotated abstracts), considerable savings are also obtained with automatically annotated abstracts (11-33% in time). Interestingly, although CoreSC helps to save more time than section names, the difference between the two schemes is so small that it is not statistically significant.
In sum, our work shows that existing schemes aimed at capturing information structure can be applied to biomedical abstracts relatively straightforwardly and identified automatically with an accuracy which is high enough to benefit a real-life task.
The rest of this paper is organized as follows: The Methods section introduces the CRA corpus, the annotation tool, and the annotation guidelines, together with the automatic classification methods and the methods of direct and user-based evaluation. The Results section describes first the annotated corpus. The results of the inter-annotator agreement tests, comparison of the schemes in annotated data, the automatic classification experiments, and the user-test are then reported. The Discussion and Conclusions section concludes the paper with comparison to related research and directions for future work.