A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment

Background Many practical tasks in biomedicine require accessing specific types of information in scientific literature; e.g. information about the results or conclusions of the study in question. Several schemes have been developed to characterize such information in scientific journal articles. For example, a simple section-based scheme assigns individual sentences in abstracts under sections such as Objective, Methods, Results and Conclusions. Some schemes of textual information structure have proved useful for biomedical text mining (BIO-TM) tasks (e.g. automatic summarization). However, user-centered evaluation in the context of real-life tasks has been lacking. Methods We take three schemes of different type and granularity - those based on section names, Argumentative Zones (AZ) and Core Scientific Concepts (CoreSC) - and evaluate their usefulness for a real-life task which focuses on biomedical abstracts: Cancer Risk Assessment (CRA). We annotate a corpus of CRA abstracts according to each scheme, develop classifiers for automatic identification of the schemes in abstracts, and evaluate both the manual and automatic classifications directly as well as in the context of CRA. Results Our results show that for each scheme, the majority of categories appear in abstracts, although two of the schemes (AZ and CoreSC) were developed originally for full journal articles. All the schemes can be identified in abstracts relatively reliably using machine learning. Moreover, when cancer risk assessors are presented with scheme annotated abstracts, they find relevant information significantly faster than when presented with unannotated abstracts, even when the annotations are produced using an automatic classifier. Interestingly, in this user-based evaluation the coarse-grained scheme based on section names proved nearly as useful for CRA as the finest-grained CoreSC scheme. Conclusions We have shown that existing schemes aimed at capturing information structure of scientific documents can be applied to biomedical abstracts and can be identified in them automatically with an accuracy which is high enough to benefit a real-life task in biomedicine.


Background
The past decade has seen great progress in the field of biomedical text mining (BIO-TM). This progress has been stimulated by the rapid publication rate in biosciences and the need to improve access to the growing body of textual information available via resources such as the National Library of Medicine's PubMed system [1]. In recent past, considerable work has been conducted in many areas of BIO-TM. Basic domain resources such as biomedical dictionaries, ontologies, and annotated corpora have grown increasingly sophisticated, and a variety of novel techniques have been proposed for the processing, extraction and mining of information from biomedical literature. Current systems range from those capable of named-entity recognition to those dealing with e.g. document classification, information extraction, segmentation, and summarization, among many others [2][3][4][5][6].
While much of the early research on BIO-TM concentrated on technical developments (i.e. adapting basic language processing techniques for biomedical language), in recent years, there has been an increasing interest in users' needs [7]. Studies exploring the TM needs of biomedical researchers have appeared [8][9][10], along with practical tools for the use of scientists [11][12][13][14]. However, user-centered studies are still lacking in many areas of research and further evaluation of existing technology in the context of real-life tasks is needed to determine which tools and techniques are actually useful [15].
In this article we will focus on one active area of BIO-TM researchtextual information structure of scientific documents -and will investigate its practical usefulness for a real-life biomedical task. The interest in information structure (also called discourse, rhetorical, argumentative or conceptual structure, depending on the theory or framework in question) stems from the fact that scientific documents tend to be fairly similar in terms of how their information is structured. For example, many documents provide some background information before defining the precise objective of the study in question, and conclusions are typically preceded by a description of the results obtained. Many readers of scientific literature are interested in specific information in certain parts of documents, e.g. in the general background of the study, the methods used in the study, or the results obtained). Accordingly, many BIO-TM tasks have focused on the extraction of information from the relevant parts of documents only. Classification of documents according to the categories of information structure has proved useful e.g. for question-answering, summarization and information retrieval [16][17][18].
To date, a number of different schemes have been proposed for (typically) sentence-based classification of scientific literature according to categories of information structure, e.g. [16,[19][20][21][22][23][24][25]. The simplest of these schemes merely classify sentences according to section names seen in scientific documents, for example, the Objective, Methods, Results and Conclusions sections appearing frequently (with different variations) in biomedical abstracts [20,21,24]. Some other schemes are based on components of scientific argumentation. A well-known example of such a scheme is the Argumentative Zoning (AZ) scheme originally developed by Teufel and Moens [16] which assumes that the act of writing a scientific paper corresponds to an attempt of claiming ownership for a new piece of knowledge. Including categories such as Other, Own, Basis and Contrast, AZ aims to model the argumentative or rhetorical process of convincing the reviewers that the knowledge claim of the document is valid.
Also schemes based on conceptual structure of documents exist -for example, the recent Core Scientific Concepts (CoreSC) scheme [25]. CoreSC treats scientific documents as humanly readable representations of scientific investigations. It seeks to retrieve the structure of an investigation from the paper in the form of generic high-level concepts such as Hypothesis, Model, and Experiment (among others). Furthermore, schemes aimed at classifying statements made in scientific literature along qualitative dimensions have been proposed. The multi-dimensional classification system of Shatkay et al. [23], developed for the needs of diverse users, classifies sentences (or other fragments of text) according to dimensions such as Focus, Polarity, Certainty, Evidence and Trend.
Different schemes of information structure have been evaluated in terms of inter-annotator agreement, i.e. the agreement with which two or several human judges label the same element of text with the same categories. Some of the schemes have been further evaluated in terms of machine learning: the accuracy with which an automatic classifier trained on human-annotated data is capable of assigning text to scheme categories, e.g. [16,21,24,26]. Also evaluation in the context of BIO-TM tasks such as question-answering, summarization, and information retrieval has been conducted [16][17][18]. These evaluations have produced promising results. However, evaluation in the context of real-life tasks in biomedicine has been lacking, although such evaluation would be important for determining the practical usefulness of the schemes for end-users.
In this paper, we will investigate the usefulness of information structure for Cancer Risk Assessment (CRA). Performed manually by human experts (e.g. toxicologists, biologists), this real-life task involves examining scientific evidence in biomedical literature (e.g. that available in the MEDLINE database [27]) to determine the relationship between exposure to a substance and the likelihood of developing cancer from that exposure [28]. The starting point of CRA is a large-scale literature review which focuses, at the first instance, on scientific abstracts published on the chemical in question. Risk assessors read these abstracts, looking for a variety of information in them, ranging from the overall aim of the study to specific methods, experimental details, results and conclusions [29]. This process can be extremely time consuming since thorough risk assessment requires considering all the published literature on a chemical in question. A well-studied chemical may well have tens of thousands of abstracts available (e.g. MED-LINE includes over 27,500 articles for cadmium). CRA is therefore an example of a task which might well benefit from annotations according to textual information structure.
Our study focuses on three different schemes: those based on section names, AZ and CoreSC, respectively. We examine the applicability of these schemes to biomedical abstracts used for CRA purposes. Since AZ and CoreSC have been developed for full journal articles, our study provides an idea of their applicability to tasks involving abstracts. We describe the annotation of a corpus of CRA abstracts according to the three schemes, and compare the resulting annotations in terms of inter-annotator agreement and the distribution and overlap of scheme categories. Our evaluation shows that for all the schemes, the majority of categories appear in scientific abstracts and can be identified by human annotators with good or moderate agreement (depending on the scheme in question). Interestingly, although the three schemes are based on entirely different principles, our comparison of annotations reveals a clear subsumption relation between them.
We introduce then a machine learning approach capable of automatically classifying sentences in the CRA corpus according to scheme categories. Our results show that all the schemes can be identified using automatic techniques, with the accuracy of 89%, 90% and 81% for section names, AZ and CoreSC, respectively. This is an encouraging result, particularly considering the fairly small size of the CRA corpus and the challenge it poses for automatic classification.
Finally, we introduce a user test -conducted by experts in CRA -which evaluates the usefulness of the different schemes for real-life CRA. This test focuses on two schemes: the coarse-grained scheme based on section names and the finest-grained CoreSC scheme. It evaluates whether risk assessors find relevant information in literature faster when presented with unannotated abstracts or abstracts annotated (manually or automatically) according to one of the schemes. The results of this test are promising: both schemes lead into significant savings in risk assessors' time. Although manually annotated abstracts yield biggest savings in time (16-46%, compared with the time it takes to locate information in unannotated abstracts), considerable savings are also obtained with automatically annotated abstracts (11-33% in time). Interestingly, although Cor-eSC helps to save more time than section names, the difference between the two schemes is so small that it is not statistically significant.
In sum, our work shows that existing schemes aimed at capturing information structure can be applied to biomedical abstracts relatively straightforwardly and identified automatically with an accuracy which is high enough to benefit a real-life task.
The rest of this paper is organized as follows: The Methods section introduces the CRA corpus, the annotation tool, and the annotation guidelines, together with the automatic classification methods and the methods of direct and user-based evaluation. The Results section describes first the annotated corpus. The results of the inter-annotator agreement tests, comparison of the schemes in annotated data, the automatic classification experiments, and the user-test are then reported. The Discussion and Conclusions section concludes the paper with comparison to related research and directions for future work.

Methods
The three schemes Full journal articles are more complex and richer in information than abstracts [30]. As a distilled summary of key information in full articles, abstracts may exhibit an entirely different distribution of scheme categories than full articles. For the practical tasks involving abstracts, it would be useful to know which of the existing schemes are applicable to abstracts and which of them can be identified in them automatically with sufficient accuracy. We chose three different schemes for our investigation -those based on section names, Argumentative Zones, and Core Scientific Concepts: • Section Names -S1: The first scheme differs from the other two in the sense that it is actually developed for abstracts. It is based on section names found in some scientific abstracts. We use the 4-way classification from [21] where abstracts are divided into Objective, Method, Results and Conclusions. Hirohata et al. show that this 4-way classification is the most frequently used classification in MEDLINE abstracts [27]. They also provide a mapping of the four section names to their synonymous names appearing in MEDLINE. Table 1 provides a short description of each category and its abbreviation (for this and for other schemes). For example, the Objective category (OBJ) of this scheme aims to capture the background and the aim of the research described in abstracts.
• Argumentative Zoning -S2: The second scheme is based on Argumentative Zoning (AZ) of documents. AZ provides an analysis of the rhetorical progression of the scientific argument. It follows the knowledge claims made by authors. Teufel and Moens [16] introduced AZ and applied it first to computational linguistics papers. Mizuta et al. [19] modified the scheme for biology papers. More recently, Teufel et al. [22] introduced a refined version of AZ and applied it to chemistry papers. As the recent refined version of AZ is too fine-grained for abstracts (many of the categories do not appear in abstracts at all) and is not directly applicable to biomedical texts (the annotation guidelines need to be supplemented with domain-specific terminology and rules for individual categories -see [22] for details), we adopt the earlier version of AZ developed for biology papers [19]. From the ten categories of Mizuta et al., we select seven which (according to our pilot investigation) actually appear in abstracts: those shown in Table 1. Note that we have re-named some of the original category names, mostly for improved annotation accuracy.
• Core Scientific Concepts -S3: The third scheme is the recent concept-driven and ontology-motivated scheme of Liakata et al. [25]. This scheme views papers as written representations of scientific investigations and aims to uncover the structure of the investigation as Core Scientific Concepts (CoreSC). Like AZ, CoreSC has been previously applied to chemistry papers [25,31]. The CoreSC is a 3-layer annotation scheme but we only consider the first layer in the current work. The second layer pertains to properties of the categories (e.g. "advantage" vs. "disadvantage" of METH, "new" vs. "old" METH). Such level of granularity is rare in abstracts. The 3rd layer involves co-reference identification between the same instances of each category, which is also not of concern in abstracts. We adopt the eleven categories in the first layer of CoreSC (shown in Table 1). S3 is thus the most fine-grained of our schemes.

Data: cancer risk assessment abstracts
We used as our data the corpus of CRA abstracts described in [29]. It contains MEDLINE abstracts from 15 biomedical journals (e.g. Carcinogenesis, Chemicobiological Interaction, Environmental Health Perspectives, Mutation Research, Toxicological Sciences) which are used frequently for CRA purposes and which jointly provide a good coverage of the main types of scientific evidence relevant for the task. From these 15 journals, all the abstracts from years 1998 to 2008 which include one of the following eight chemicals were included: 1,3-Butadiene, Benzo(a)pyrene, Chloroform, Diethylnitrosamine, Diethylstilbestrol, Fumonisin B1, Phenobarbital, and Styrene. These chemicals were selected by CRA experts on the basis that they (i) are well-researched using a range of scientific tests (human, animal and cellular) and (ii) represent the two most frequent Mode of Action types (MOAs): genotoxic and non-genotoxic. A MOA is an important concept in CRA: it determines the key events leading to cancer formation. Chemicals acting by a genotoxic MOA induce cancer by interacting with DNA, while chemicals acting by a nongenotoxic MOA induce cancer without interfering directly with DNA (see [29] for details). We selected 1000 abstracts

Annotation of abstracts Annotation guidelines
We annotated the 1000-abstract version of the CRA corpus according to each of the schemes. We used the annotation guidelines of Liakata for S3 [unpublished data by Maria Liakata] and developed the guidelines for S1 and S2 ourselves. The new guidelines were developed via trial annotations and discussions. They provide a generic description of each scheme and its purpose, define the unit of annotation (a sentence), introduce all the scheme categories, provide advice for conflict resolution (e.g. which categories to prefer when two or several are possible within the same sentence), and include examples of annotated abstracts. Each guideline is 15 pages long. We have made them available at http:// www.cl.cam.ac.uk/~yg244/10crab.html.

Annotation tool
We used the annotation tool of Korhonen et al. [29] for corpus annotation. This tool was originally developed for the annotation CRA abstracts according to the scientific evidence they contain. We modified it so that it could be used to annotate abstracts according to our schemes. It works as a Firefox plug-in. Using this tool, experts can open each MEDLINE abstract assigned to them, assign a scheme category to each sentence by highlighting it and selecting the appropriate category from a menu with a single mouse click. Highlighted sentences are displayed using colors which correspond to the different scheme categories as defined in the annotation guidelines. A screen-shot illustrating the annotation tool is provided in Figure 1. The figure shows an example abstract annotated according to each of the three schemes.

Description of annotation
Using the guidelines and the tool, the CRA corpus was annotated according to each of the schemes. Previous related annotation efforts have varied in terms of the expertise required from annotators. For example, Mizuta et al. [19] used a single annotator to annotate full biology articles according to S2. This person was a PhD level linguist with no training in biology. In contrast, Liakata et al. [25] used domain experts (mainly PhD students in chemistry) to annotate chemistry papers according to S3. Teufel et al. [22], in turn, used a mixed group of three annotators to annotate chemistry papers according to their recent refined AZ scheme: a PhD level computational linguist, a chemist, and a computational linguist with some experience in chemistry. We used a single annotator (a PhD student in computational linguistics with no training in biomedical sciences) to annotate the whole CRA corpus. However, following Teufel et al., we measured inter-annotator agreement between annotators who have different expertises: the computational linguist, one domain expert (a PhD level toxicologist who is also a CRA expert) and one PhD level linguist with no training in biomedicine. The inter-annotator agreement was measured on a subset of the corpus as described later in Results section.
The annotation proceeded scheme by scheme, independently, so that annotations of one scheme were not based on any of the other two. The annotation started from the coarse-grained S1, then proceeding to S2 and finally to the finest-grained S3. The inter-annotator agreement was measured using Cohen's Kappa [32], which is the portion of agreement (P (a)) corrected for chance (P(e)):

Comparison of annotations
The three schemes we investigate were developed independently and have separate guidelines. Thus, even though they seem to have some categories in common (e.g. METH, RES, CON) this does not necessarily guarantee that the categories cover the same information across all three schemes. We therefore investigated the relation between the schemes and the degree of overlap or complementarity between them. We created contingency tables and calculated the chisquared Pearson statistic, the chi-squared likelihood ratio, the contingency coefficient and Cramer's V for pairwise comparison of schemes. However, since none of these measures give an indication of the differential association between schemes (i.e. whether it goes both directions and to what extent) we also calculated the Goodman-Kruskal lambda L statistic [33]. This gives us the reduction in error for predicting the categories of one annotation scheme, if we know the categories assigned according to the other.
In addition we examined the correspondence between the actual categories of the three schemes using the paradigm of Kang et al. [34]. Kang et al. discuss a framework for subsumption checking between classes in different ontologies. They argue that if for the set of mutual instances between two classes, instances of one consistently belong to the other, we can assume that a subsumption relation holds. They suggest setting a fault tolerance threshold to cater for erroneous annotations: where T k is the threshold, w(a i ), w(b j ) are the weights for instances and X, Y are the classes. We set w(a i ) = w (b j ) = 1, and set T k = 0.1 as a reasonable threshold, allowing at most 10% of instances to be allocated to other categories for the subsumption to hold.

Automatic identification of information structure
Use of information structure in real-life biomedical applications requires a method capable of automatically assigning sentences in documents to appropriate scheme categories. To find out whether our schemes are machine learnable in the CRA abstract corpus, we conducted a series of classification experiments. These experiments involved extracting a range of linguistic features from each sentence in our corpus and given these features and the scheme labels in the annotated corpus, using supervised machine learning to automatically assign each sentence to the most likely category (e.g. BKG, METH, RES) of the scheme in question. Previous works in this research area have used standard text classification features (ranging from bags of words to more sophisticated features such as grammatical relations in sentences) and various well-known classifiers such as Naive Bayes [16], Support Vector Machines [26], Maximum Entropy [35], Hidden Markov Models [20] and Conditional Random Fields [21]. We used for our experiment mainly features and classifiers which have proved successful in previous works. These will be described in detail in the subsequent sections.

Features
The first step in automatic classification is to select features for classification. We chose a number of general purpose features suitable for all the three schemes. With the exception of our novel verb class feature, these features are similar to those employed in related works, e. g. [16,20,21,26,35]: • History. There are typical patterns in the information structure so that certain categories tend to appear before others. For example, RES tends to be followed by CON rather than by BKG. Therefore, we used the category assigned to the previous sentence as a feature.
• Location. Categories tend to appear in typical positions in a document, e.g. BKG occurs often in the beginning and CON at the end of the abstract. We divided each abstract into ten equal parts (1-10), measured by the number of words, and defined the location (of a sentence) feature by the parts where the sentence begins and ends.
• Word. Like many text classification tasks, we employed all the words in the corpus as features.
• Bi-gram. We considered each bi-gram (combination of two adjacent word features) as a feature.
• Verb. Verbs are central to the meaning of sentences, and can vary from one category to another. For example, experiment is frequent in METH and conclude in CON. Previous works have used the matrix verb of each sentence as a feature. Because the matrix verb is not the only meaningful verb, we used all the verbs instead.
• Verb Class. Because individual verbs can result in sparse data problems, we also experimented with a novel feature: a lexical-semantic verb class (e.g. the class of EXPERIMENT verbs for verbs such as measure and inject). We obtained 60 classes by clustering verbs appearing in full cancer risk assessment articles using the approach of Sun and Korhonen [36].
• Part-of-Speech -POS. Tense tends to vary from one category to another, e.g. past is common in RES and past participle in CON. We used the part-ofspeech (POS) tag of each verb assigned by the C&C tagger [37] as a feature.
• Grammatical Relation -GR. Structural information about heads and dependents has proved useful in text classification. We used grammatical relations (GRs) returned by a parser as features. They consist of a named relation, a head and a dependent, and possibly extra parameters depending on the relation involved, e.g. (dobj investigate mouse). We created features for each subject (ncsubj), direct object (dobj), indirect object (iobj) and second object (obj2) relation in the corpus. • Subj and Obj. As some GR features may suffer from data sparsity, we collected all the subjects and objects (appearing with any verbs) from GRs and used them as features. The value of such a subject (or object) feature equals 1 if it occurs in a particular sentence (and 0 if it does not occur in the sentence).
• Voice. There may be a correspondence between the active and passive voice and categories (e.g. passive is frequent in METH). We therefore used voice as a feature.

Pre-processing and feature extraction
We developed a tokenizer to detect the boundaries of sentences and to perform basic tokenization, such as separating punctuation from adjacent words e.g. in tricky biomedical terms such as 2-amino-3,8-diethylimidazo[4,5-f]quinoxaline. We used the C&C tools [37] adapted to biomedical literature for POS tagging, lemmatization and parsing. The lemma output was used for extracting Word, Bi-gram and Verb features. The parser produced GRs for each sentence from which we extracted the GR, Subj, Obj and Voice features. We only considered the GRs relating to verbs. The "obj" marker in a subject relation indicates a verb in passive voice (e. g. (ncsubj observed_14 differenc_5 obj)). To control the number of features we removed the words and GRs with fewer than 2 occurrences and bi-grams with fewer than 5 occurrences, and lemmatized the lexical items for all the features.

Classifiers
We used Naive Bayes (NB), Support Vector Machines (SVM), and Conditional Random Fields (CRF) for classification. These methods have been used to discover information structure in previous related works, e.g. [16,21,26]. NB is a simple and fast method, and SVM and CRF have been used successfully in a wide range of text classification tasks.
NB applies Bayes' rule and Maximum Likelihood Estimation with strong independence assumptions. It aims to select the class c with maximum probability given the feature set F: arg max c P(c|F) SVM constructs hyperplanes in a multidimensional space that separates data points of different classes. Good separation is achieved by the hyperplane that has the largest distance from the nearest data points of any class. The hyperplane has the form w · x -b = 0, where w is the normal vector to the hyperplane. We want to maximize the distance from the hyperplane to the data points, or the distance between two parallel hyperplanes each of which separates the data. The parallel hyperplanes can be written as: w · x -b = 1 and w · x -b = 1, and the distance between the two is 2 |w| . The problem reduces to: Minimize |w| Subject to w · x ib ≥ 1 for x i of one class, and w · x ib ≤ -1 for x i of the other.
CRF is an undirected graphical model which defines a distribution over the hidden states (e.g. label sequences) given the observations. The probability of a label sequence y given an observation sequence x can be written as: where F j (y, x, i) is a real-valued feature function of the states, observations, and the position in the sequence; l j is the weight of F j , and Z(x) is a normalization factor. The l parameters can be learned using the LBFGS algorithm, and arg max y p(y|x) can be inferred using the Viterbi algorithm.

Evaluation methods
The results were measured in terms of accuracy (acc), precision (p, degree of correctness), recall (r, degree of completeness), and F-Measure (f, harmonic mean of p and r): We used 10-fold cross validation to avoid the possible bias introduced by relying on any one particular split of the data. The data were randomly divided into ten parts of approximately the same size. Each individual part was retained as test data and the remaining nine parts were used as training data. The process was repeated ten times with each part used once as the test data. The resulting ten estimates were then combined to give a final score. We compare our classifiers against a baseline method based on random sampling of category labels from training data and their assignment to sentences on the basis of their observed distribution.

User test in the context of Cancer Risk Assessment
We developed a user test so that we could evaluate and compare the practical usefulness of information structure schemes for CRA.
Two schemes were selected for this test: the coarsegrained S1 and the fine-grained S3. S2 was excluded because it proved fairly similar to S1 in terms of its performance in machine learning experiments (e.g. the number and type of categories which were actually identified in abstracts; see the Results section for details).
The user test was designed independently from the schemes. The idea was to ask cancer risk assessors to look for the information they typically look for in biomedical abstracts during an early stage of their work (when seeking to obtain an overview of the scientific data available on a chemical in question). The test was designed to compare the time it takes for risk assessors to find relevant information in (i) unannotated abstracts and (ii) abstracts annotated according to the schemes. Longer reading times have been shown to indicate greater cognitive load during language comprehension [40]. Minimizing the reading time is desirable as it can help to reduce the high cost of manual CRA. Intuitively, when risk assessors look for information about e.g. the methods used in a study, they should find this information faster when pointed to those sentences which discuss methods according to our schemes. However, whether this really helps to a significant degree (in particular when using automatically annotated abstracts), was an open question -along with which scheme (the coarse or fine-grained) might be more useful for the task.
As a starting point, cancer risk assessors working in Karolinska Institutet (Stockholm, Sweden) provided us with a list of questions they consider when studying abstracts for CRA. As the questions were of varying style and granularity and focused on various parts of abstracts, they seemed ideal for the evaluation of the schemes. The majority were adopted for the user test; however, some of the open-ended questions requiring text-based inference and more elaborate answers (e.g. The endpoints of the study?) were simplified to merely test whether and how fast the information in question could be found (e.g. Are endpoints mentioned?). This yielded a more controlled experiment which was better suited for comparing the performance of different users. We ended up with the following questionnaire containing seven questions where each question has either a verbal, 'yes' or 'no', or multiple choice answer: Q1 What was the aim(s) of the study?
Verbal answer Q2 What was the main type of the study?
Four possible answers from which users have to select one: an animal study, human study, In Vitro study or a combined study. Depending on the answer selected, three follow-up questions apply which each require a 'yes' or 'no' answer. For example, the following questions apply to a human study: Q3a Is exposure length mentioned? Q3b Is group size mentioned? Q3c Are endpoints mentioned? Q4 Positive Results?

Verbal answer
We designed an on-line form which shows an abstract (and the name of the chemical which the abstract focuses on) on the top of the page and each question on the bottom of the page. The questions are displayed to experts one at a time, in the sequential order shown above. The idea of the test is to record the time it takes for an expert to answer each question. This is done by asking them to press 'start', 'next' and 'complete' buttons during different phases of the test, as appropriate.
Two screen-shots illustrating the test are shown in Figure 2. They show the same abstract annotated according to S3 for questions 1 and 2, respectively. As illustrated in the screen-shorts, although the whole abstract is shown to experts with each question, only 1-2 scheme categories are highlighted (with colors) per question, as to attract experts' attention. Those are the scheme categories which are most likely to contain an answer to the particular question. We settled for this option after conducting a pilot study which showed that users found abstracts annotated according to all the (or all the potentially relevant) scheme categories confusing rather than helpful.
Highlighting only the 1-2 most relevant categories required creating a mapping between the questions and the potentially relevant scheme categories and investigating which of the categories are really the most important ones for answering each question. We asked an expert (one of the risk assessors working at Karolinska Institutet) to examine 30 abstracts which had been manually annotated for S1 and S3 and to indicate, for each question, all the possible categories where an answer to the question could be found. This pilot study showed that although it was often possible to find an answer in several categories, there were 1-2 dominant categories which nearly always included the answer. For S1, a single dominant category could be identified for each of the questions. For S3, a single category was found for five of the questions, and two questions had two equally dominating categories which were both included because they were usually mutually exclusive. Table 2 shows all the possible categories per question, and the 1-2 dominant ones per scheme which we used in our test.
Three experts participated in our test: two professor level experts with a long experience in CRA (over 25 years each) -A and B -and one more junior expert: C who has a PhD in toxicology and over 5 years of experience in CRA. We selected 120 abstracts from the CRA corpus for this test, in random but subject to the constraint that they were similar in length and focused on one of the four chemicals: butadiene, diethylnitrosamine, diethylstilbestrol, and phenobarbital. Each expert was presented with the same set of 120 abstracts. The abstracts were divided into 5 groups (each including around 24 abstracts) so that each expert was presented with: S0: unannotated abstracts, S1: abstracts annotated manually according to S1, S3: abstracts annotated manually according to S3, S1': abstracts automatically annotated according to S1 using the SVM classifier, and S3': abstracts automatically annotated according to S3 using the SVM classifier.
The results were measured in terms of the (i) total time it took for the experts to examine each abstract in the five groups above, and (ii) the percentage of time each expert saved when examining scheme annotated abstracts vs. unannotated ones. We also measured the statistical significance of the differences using the Mann-Whitney U Test [41,42]. The results were measured in p-value, and the chosen significance level was 0.05. Finally, we examined whether automatic annotations affected the quality of the expert answers. We did this by comparing the agreement in expert answers between S0, S1' and S3' annotated abstracts.

The annotated corpus and inter-annotator agreement
The corpus annotation work took 45, 50 and 90 hours in total for S1, S2 and S3, respectively. Table 3 shows the distribution of sentences per scheme category in the resulting corpus. We see that for S1, all the four categories appear in abstracts with sufficient frequency, with RES being the most frequent category (accounting for 40% of the corpus). For S2, RES is also the most frequent category (again accounting for 40% of the corpus). Four other S2 categories appear in the corpus data with reasonable frequency: BKG, OBJ, METH and CON, which cover 8-18% of the corpus each. Two categories  are very low in frequency, only covering 1% of the corpus each: REL and FUT. Also for S3, RES is the most frequent category (accounting for 32% of the corpus). For S3, six other categories cover 6-14% of the corpus each (BKG, OBJT, EXP, METH, OBS and CON), while four categories cover 1-4% (HYP, MOT, GOAL, and MOD). All the scheme categories we set to explore thus did appear in abstracts, but some categories belonging to the schemes that have been developed for full papers are rare. However, some of these categories have proven infrequent also in full papers [25,26]. We measured the inter-annotator agreement on 300 abstracts (i.e. a third of the corpus) using three annotators: one linguist, one expert in CRA, and the computational linguist who annotated all the corpus. We calculated Cohen's kappa [32] between each pair of annotators and averaged the results. The inter-annotator agreement was = 0.84, = 0.85, and = 0.50 for S1, S2, and S3, respectively. According to [43], the agreement 0.81-1.00 is perfect and 0.41-0.60 is moderate. S1 and S2 are thus the easiest schemes for the annotators and S3 the most challenging. This is not surprising as S3 is the scheme with the finest granularity. Its reliable identification may require a longer period of training and possibly improved guidelines. Moreover, previous annotation efforts using S3 have used domain experts for annotation [25,31]. For S3 the best agreement was between the domain expert and the linguist ( = 0.60). For S1 and S2 the best agreement was between the linguist and the computational linguist ( = 0.87 and = 0.88, respectively). Table 4, 5, and 6 present a confusion matrix for S1, S2 and S3, respectively. A confusion matrix shows the categories the domain expert (E) and the linguist (L) (dis) agreed on. For S1, we can see that the annotators had trouble distinguishing between OBJ and METH, and RES and CON. For instance, there are 88 sentences labeled with METH by the domain expert and with OBJ by the linguist. Also there are 158 sentences labeled with CON by the expert and with RES by the linguist. Similar confusions can be observed between OBJ and METH, and RES and CON for S2, and between GOAL and OBJT, EXP and METH, OBS and RES, and RES and CON for S3. These problems may arise from the sentences that have two (or more) parts representing different categories (e.g. RES and CON in a single sentence). In addition to improving guidelines on these cases and providing annotators with longer training, one possible solution would be to improve the annotation strategy, for example, to use a smaller unit of annotation than a sentence, e.g. a clause.

Comparison of the schemes in terms of annotations
We used the resulting annotations to compare the degree of overlap between the schemes. Table 7 shows the results of our pairwise comparison. The chi-squared Pearson statistic, the chi-squared likelihood ratio, the contingency coefficient and Cramer's V each show a definite correlation between the rows and columns for the three schemes. When calculating the Goodman-Kruskal lambda L statistic [33], using the categories of S1 as the independent variables, we obtained a lambda of over 0.72 which suggests a 72% reduction in error in predicting S2 categories and 47% reduction in error in predicting S3 categories. With S2 categories being the independent variables, we obtain a reduction in error of 88% when predicting S1 and 55% when predicting S3 categories. The lower lambdas for predicting S3 are hardly surprising as S3 has 11 categories as opposed to  4 and 7 for S1 and S2 respectively. S3 on the other hand has strong predictive power in predicting the categories of S1 and S2 with lambdas of 0.86 and 0.84 respectively. In terms of association, S1 and S2 seem to be thus more strongly associated, followed by S1 and S3 and then S2 and S3. The correspondence between the actual categories of the three schemes is visualized in Figure 3. Take S3 BKG and S1 OBJ for example. The former maps to the latter for 96% of cases, whereas the latter maps to a number of categories in S3, namely 49% BKG, 19% GOAL, 11% MOT and 19% OBJT. It therefore would seem that S3 BKG is subsumed by S1 OBJ but not the other way round. According to the subsumption checking approach of [34], if we take X to be S3 BKG and Y to be S1 OBJ, we get 0.039 < 0.1; therefore the subsumption relation S3 BKG ⊆ S1 OBJ holds. Take, for another example, S3 BKG and S2 BKG. The former maps to the latter in 97% of cases, whereas the latter maps to 78% BKG, 11.6% HYP, 11% MOT, 9% METH in S3. The subsumption relation is one-way: S3 BKG ⊆ S2 OBJ (with 0.03 < 0.1). Similarly, S2 BKG maps to S1 OBJ in 97% of cases, whereas S1 OBJ maps to 61.4% BKG, 30% OBJ and 9% METH in S2. The subsumption relation S2 BKG ⊆ S1 OBJ holds (with 0.029 < 0.1). Therefore, we have a subsumption relation of the type: S3 BKG ⊆ S2 BKG ⊆ S1 OBJ.
We follow the same procedure for the rest of the categories. The subsumption relations between scheme categories are summarized below: S3 OBS ⊆ S2 RES ≡ S1 RES S3 RES ⊆ S2 RES ≡ S1 RES S3 CON ⊆ (S2 CON ∪ S2 RES ∪ S2 FUT ∪ S2 REL) ⊆ (S1 CON ∪ S1 RES) Based on the above analysis, it is clear that all categories in S3 are subsumed by categories in S2 which are in turn subsumed or equivalent to categories in S1. It is therefore reasonable to assume a subsumption relation between the three schemes of the type S3 ⊆ S2 ⊆ S1. This also agrees with the values of the Kruskall-lambda statistic above, according to which if we know S3    Table 7 Association measures between schemes S1, S2, S3 S1 vs S2 S1 vs S3 S2 vs S3 categories the likelihood of predicting S2 and S1 categories is high (84% and 86% reduction in error respectively) and decreases if we try to predict S3 when knowing S2 (55% error reduction) or S1 (47% error reduction).
This subsumption relation is an interesting outcome given that the three different schemes have such different origins. Table 8 shows F-measure results when using each individual feature alone, and Table 9 when using all the features but the individual feature in question. In these two tables, we only report the results for SVM which performed better than other methods. Although we have results for most scheme categories, the results for some are missing due to the lack of sufficient training data (see Table 3), or due to a small feature set (e.g. History alone).

Automatic classification
Looking at individual features alone, Word, Bi-gram and Verb perform the best for all the schemes, and History and Voice perform the worst. In fact History performs very well on the training data, but on the test data we can only use estimates rather than the actual labels; an uncertain estimate of the feature at the beginning of the abstract will introduce further uncertainty later on, leading to poor overall results. The Voice feature works only for RES and METH for S1 and S2, and for OBS for S3. This feature is probably only meaningful for some of the categories. When using all but one of the features, S1 and S2 suffer the most from the absence  Figure 3 Comparison of the three schemes in terms of manual annotations. The figure shows pairwise interpretation of categories of one scheme in terms of the categories of the other: S2 to S1 mapping in A, S3 to S1 mapping in B and S3 to S2 mapping in C.  of Location, while S3 from the absence of Word/POS. Verb Class on its own performs worse than Verb, however when combined with other features it performs better: leave-Verb-out outperforms leave-Verb Class-out. Table 10 shows the results for the baseline (BL), and the best results for NB, SVM, and CRF. NB, SVM, and CRF perform clearly better than BL for all the schemes. SVM performs the best among all the classifiers. CRF performs fairly well on S1 and S2 but not on the most fine-grained S3. NB performs well on S1, but not equally well on S2 and S3, which have a higher number of categories, some of which are low in frequency (see Table 3).
For S1, SVM finds all the four scheme categories with the accuracy of 89%. F-measure is 90 for OBJ, RES and CON and 81 for METH. For S2, the classifier finds six of the seven categories, with the accuracy of 90% and the average F-measure of 91 for the six categories. As with S2, METH has the lowest performance (at 85 Fmeasure); the one missing category (REL) appears in our abstract data with very low frequency (see Table 3).
For S3, SVM uncovers as many as nine of the eleven categories with the accuracy of 81%. Six categories perform well, with F-measure higher than 80. EXP, BKG and GOAL have F-measure of 70, 62 and 62, respectively. Like the missing categories HYP and MOD, GOAL is very low in frequency. The lower performance of the higher frequency EXP and BKG is probably due to low precision in distinguishing between EXP and METH, and BKG and other categories, respectively.

User test
The results of the user test are presented in Table 11 and 12. Table 11 shows the total time it took for the experts (A, B, and C) to do the user test for abstracts belonging to groups S0, S1, S3, S1' and S3', respectively (see the User Test in the Methods section for details of the experts and abstract groups), along with the percentage of time the experts saved when examining scheme annotated abstracts vs. unannotated ones. Columns 2-8 show the results for each individual question and column 9 shows the overall performance. TIME stands for the sample mean (measured in seconds), and SAVE for    the percentage of the time savings. Table 12 shows, for the three experts, the statistical significance of the differences between all the eight scheme pairs (e.g. S0 vs. S1, S0 vs. S3, etc). The statistical significance is indicated using p-values of the Mann-Whitney U Test (see Methods section for details).
Looking at the overall performance figures, the average time spent with unannotated abstracts (S0) was 69.5 seconds for A, 116.1 for B, and 102.9 for C. All the experts spent significantly less time with scheme annotated abstracts (S1, S3, S1' and S3') than with unannotated ones (S0): the percentage of time saved ranges between 11% and 46%. Even A, who was the fastest expert with unannotated abstracts, saved 16%, 22%, 11% and 17% time with S1, S3, S1' and S3', respectively. For other users the savings in time were bigger.
As expected, the more accurate manually annotated abstracts (S1, S3) help save more time (33% on average per expert) than automatically annotated ones (S1', S3') (19.5% on average per expert). For instance, in the case of C, S1 and S3 saved 36% and 38% of time, respectively, whereas S1' and S3' saved 17%. However, automatic annotations still clearly helped experts conduct their task faster.
Looking at individual questions, for all the users, no significant difference (p > .05) was found in results between S1' and S1 for Q3a, Q4, and Q5, and between S3' and S3 for Q3c and Q4. These are the questions which map to the frequent scheme categories with high F-Measures in machine learning experiments: RES, CON for S1', and OBS for S3', as shown in Table 10. We can therefore expect future improvements in the automatic detection of lower frequency scheme categories lead to improved performance also in user tests.
Comparing the two schemes, for each user, S3 (the fine-grained CoreSC scheme) saved more time than S1 (the coarse-grained section name -based scheme) for the majority of questions: Q1, Q3a, Q3b, Q4, and Q5. Similarly, S3' saved more time than S1' for Q1, Q2, Q3b, Q3c, and Q4 for the majority of users. These include both broader questions requiring verbal answers like Q1 and Q5, and more specific questions requiring 'yes' vs. 'no' answers like Q3a-c. Although the majority of these differences between the two schemes are not statistically significant (see Table 12), the small benefit of S3 (and the fact that S1 rarely beats S3) is still a clear trend in the data, and shows also in the TOTAL results.
We finally examined whether using automatically annotated abstracts had an impact on the experts' accuracy. We took the abstracts annotated according to S1' and S3', respectively, and compared the results B and C obtained when using these abstracts against the results A obtained when using S0 annotations of the abstracts. Interestingly, when using S1' annotations, 83-85% of the answers produced by B and C agreed with the answers produced by A. When using S3' annotations, the agreement of B and C with A was 93%. This demonstrates that the use of automatic annotations does not result in a significant drop in experts' accuracy, in particular when a fine-grained scheme such as S3 is used.

Discussion and conclusions
The results from our corpus annotation (see Table 3) show that for the coarse-grained S1, all the four Table 11 Time measures for the user test   Q1  Q2  Q3a  Q3b  Q3c  Q4  Q5  TOTAL   TIME SAVE TIME SAVE TIME SAVE TIME SAVE TIME SAVE TIME SAVE TIME SAVE TIME  The table shows the time it takes for the CRA experts (A, B and C) to answer the questions in the questionnaire (sample mean) and the percentage of time they save using scheme annotations.
categories appear frequently in biomedical abstracts. This is not surprising because S1 was actually developed for abstracts. The inter-annotator agreement on this scheme was good and all the categories were also identified by machine learning (three of them with F of 90) yielding high overall accuracy (89%). For S2, all the seven categories appeared in the CRA corpus and six were found by the SVM classifier. Also this scheme had good inter-annotator agreement and obtained very similar accuracy in machine learning experiments than S1: 90%.
For S3, all the eleven categories appeared in the corpus and nine of them (even some of the low frequency ones) were identified using the SVM classifier with the overall accuracy of 81%. This accuracy is surprisingly good given the high number of categories, many of which were low in frequency in the CRA corpus, and considering the low inter-annotator agreement on this scheme (note, however, that we used data annotated according to a single annotator in our machine learning experiments -therefore, the inter-annotator agreement result does not provide a reliable human upper bound for our experiments).
These results show that all the three schemes are applicable to abstracts and can be identified in them automatically with relatively high accuracy. Interestingly, our analysis in section 'Comparison of the schemes in terms of annotations' demonstrates that there is a subsumption relation between the categories of the three schemes. This is surprising since the three schemes were developed based on different principles: S1 on section names, S2 on following the knowledge claims made by authors, and S3 on tracking the structure of a scientific investigation at the level of scientific concepts. Our comparison shows that the main practical difference between the schemes is that S2 and S3 provide finergrained information about the information structure of abstracts than S1 (even with their 2-3 low frequency or missing categories).
Ultimately, an optimal scheme will depend on the level of detail required by the application at hand. Similarly, the level of accuracy required in machine learning performance may be application-dependent. To shed light on these issues, we conducted evaluation in the context of the real-life task of CRA. In this evaluation we focused on the most general S1 and the most detailed S3 schemes only.
The user test was designed independently of the two schemes. Three cancer risk assessors were presented with a questionnaire which involved looking for information (both detailed and general) relevant for CRA in different parts of biomedical abstracts. We evaluated the time it took for the experts to answer the questions when presented with plain unannotated abstracts and those annotated manually and automatically according to S1 and S3.
The results show that all the experts saved significant amounts of time when examining abstracts highlighted for the most relevant scheme categories per question. Although manually annotated abstracts are more useful (yielding overall savings of 16-46%), automatically annotated ones lead to significant savings in time as well (11-33% overall) in comparison with unannotated abstracts. Although no statistically significant differences could be observed between S1 and S3, all the experts performed faster with the majority of questions when presented with S3-labeled abstracts. Interestingly, this tendency could be observed with both manual and automatically annotated abstracts.
It is obvious, looking at the 1-2 dominant scheme categories (which we showed to the experts per question) and comparing them to the full set of possible categories in Table 2 that our CRA questionnaire would not realize the full potential of S3. However, the fact that S3 proved at least as helpful for users as S1 despite the lower machine learning performance is promising. On the other hand, it is encouraging that a scheme as simple as S1 can be used to aid a real-world task with a significant saving in users' time. Our user test -which is, to our knowledge, the first attempt to evaluate information structure schemes directly in the context of real-life biomedical tasksfocused on one step of CRA. This step involves looking for relevant information in abstracts, mainly to determine the usefulness of abstracts for the task. Other steps of CRA, in particular those which focus on more detailed information in full biomedical journal articles, are likely to benefit from the schemes of information structure to a greater degree. We intend to explore this avenue of work in our future experiments.
For real-life tasks involving abstracts, it would be useful to further improve machine learning performance. Previous works have not evaluated S2 or S3 on biomedical abstracts. However, Hirohata et al. [21] have evaluated S1. They showed that the amount of training data used can have a big impact on the task. They used c. 50,000 MEDLINE abstracts annotated (by the authors of the abstracts) as training data for S1. When using a small set of standard text classification features and CRF for classification, they obtained 95.5% per-sentence accuracy on 1000 abstracts. However, when only 1000 abstracts were used for training the accuracy was considerably worse; their reported per-abstract accuracy dropped from 68.8% to less than 50%. This contrasts with our CRF accuracy of 85% on 1000 abstracts. Although it would be difficult to obtain similarly huge training data for S2 and S3, this result suggests that one key to improved performance is larger training data, and this is what we plan to explore especially for S1.
In addition we plan to improve our method. We showed that our schemes are partly overlapping and that similar features and methods tend to perform the best/worst for each of the schemes. It is therefore unlikely that considerable scheme specific tuning will be necessary. However, we plan to develop our features further and to make better use of the sequential nature of information structure. Although CRF proved disappointing in our experiment, it may be worth investigating it further (by e.g. using features gathered from surrounding sentences) and also comparing it (and the SVM method) against methods such as Maximum Entropy which have proved successful in recent related works [35]. The resulting models will be evaluated both directly and in the context of CRA to provide an indication of their practical usefulness for real-world tasks.

Availability and requirements
• Project name: Tool for the user test • Project home page: http://www.cl.cam.ac.uk/ yg244/10crab.html • Operating System(s): Platform independent (tested on Windows and Linux) • Programming language: XUL and Javascript, Perl and Java for classification • Other requirements: Firefox 2.0 (available from http://www.mozilla.com/en-US/firefox/all-older.html) • License: The application will be freely accessible for all users.
• Any restrictions to use by non-academics: none