Enriching a biomedical event corpus with meta-knowledge annotation
© Thompson et al; licensee BioMed Central Ltd. 2011
Received: 28 June 2011
Accepted: 10 October 2011
Published: 10 October 2011
Skip to main content
© Thompson et al; licensee BioMed Central Ltd. 2011
Received: 28 June 2011
Accepted: 10 October 2011
Published: 10 October 2011
Biomedical papers contain rich information about entities, facts and events of biological relevance. To discover these automatically, we use text mining techniques, which rely on annotated corpora for training. In order to extract protein-protein interactions, genotype-phenotype/gene-disease associations, etc., we rely on event corpora that are annotated with classified, structured representations of important facts and findings contained within text. These provide an important resource for the training of domain-specific information extraction (IE) systems, to facilitate semantic-based searching of documents. Correct interpretation of these events is not possible without additional information, e.g., does an event describe a fact, a hypothesis, an experimental result or an analysis of results? How confident is the author about the validity of her analyses? These and other types of information, which we collectively term meta-knowledge, can be derived from the context of the event.
We have designed an annotation scheme for meta-knowledge enrichment of biomedical event corpora. The scheme is multi-dimensional, in that each event is annotated for 5 different aspects of meta-knowledge that can be derived from the textual context of the event. Textual clues used to determine the values are also annotated. The scheme is intended to be general enough to allow integration with different types of bio-event annotation, whilst being detailed enough to capture important subtleties in the nature of the meta-knowledge expressed in the text. We report here on both the main features of the annotation scheme, as well as its application to the GENIA event corpus (1000 abstracts with 36,858 events). High levels of inter-annotator agreement have been achieved, falling in the range of 0.84-0.93 Kappa.
By augmenting event annotations with meta-knowledge, more sophisticated IE systems can be trained, which allow interpretative information to be specified as part of the search criteria. This can assist in a number of important tasks, e.g., finding new experimental knowledge to facilitate database curation, enabling textual inference to detect entailments and contradictions, etc. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event.
Due to the rapid advances in biomedical research, scientific literature is being published at an ever-increasing rate. This makes it highly important to provide researchers with automated, efficient and accurate means of locating the information they require, allowing them to keep abreast of developments within biomedicine [1–5]. Such automated means can be facilitated through text mining, which is receiving increasing interest within the biomedical field [6, 7]. Text mining enriches text via the addition of semantic metadata, and thus permits tasks such as analysing molecular pathways  and semantic searching .
Information extraction (IE) systems facilitate semantic searching by producing classified, structured, template-like representations of important facts and findings contained within documents, called events. As the features of texts and the types of events to be recognised vary between different subject domains, IE systems must be adapted to deal with specific domains. A well-established method of carrying out this adaptation is through training using annotated corpora (e.g., [10–12]). Accordingly, a number of corpora of biomedical texts annotated for events have been produced (e.g., [13–15]), on which IE systems in the biomedical domain can be trained.
Event annotation in these corpora typically includes the identification of the trigger, type and participants of the event. The event trigger is a word or phrase in the sentence which indicates the occurrence of the event, and around which the other parts of the event are organised. The event type (generally assigned from an ontology) categorises the type of information expressed by the event. The event participants, i.e., entities or other events that contribute towards the description of the event, are also part of the event representation, and are often categorised using semantic role labels such as CAUSE (i.e., the entity or other event that is responsible for the event occurring) and THEME (i.e., the entity or other event that is affected by the event) to indicate their contribution towards the event description. Events that contain further events as participants are often referred to as complex events, while simple events only contain entities as participants. Usually, semantic types (e.g. gene, protein, etc.) are also assigned to the named entities (NEs) participating in the event. Other types of participants are also possible, corresponding, for example, to the location or environmental conditions under which the event took place.
In order to illustrate this typical event representation, consider sentence (1).
(1) The results suggest that the narL gene product activates the nitrate reductase operon.
The typical structured representation of the biomedical event described in this sentence, is as follows:
THEME: nitrate reductase operon:operon
CAUSE: narL gene product: protein
The automatic recognition of such structured events facilitates sophisticated semantic querying of documents, which provides much greater power than conventional search techniques. Rather than simply searching for keywords in documents, users can search for specific types of events in documents, through (partial) completion of a template. This template allows different types of restrictions to be placed on the events that are required to be found , e.g.,:
The type of event to be retrieved.
The types of participants that should be present in the event.
The values of these participants, which could be specified in terms of either specific values or NE types.
The fact that event and NE types are often hierarchically structured can provide the user with a large amount of flexibility regarding the generality or specificity of their query.
Despite the increased power and more focussed searching that event-based searching can provide over traditional keyword-based searches, typical event annotations do not capture contextual information from the sentence, which can be vital for the correct interpretation of the event . Let us consider again sentence (1), and in particular the phrase at the beginning of the sentence, i.e., The results suggest that ... This phrase allows us to determine the following about the event that follows:
It is based on an analysis of experimental results.
It is stated with a certain amount of speculation (according to the use of the verb suggest, rather than a more definite verb, such as demonstrate).
Altering the words in the context of the event can affect its interpretation in both subtle and significant ways. Consider the examples below:
(2a) It isknownthat the narL gene product activates the nitrate reductase operon.
(2b) Weexaminedwhether the narL gene product activates the nitrate reductase operon.
(2c) The narL gene product didnotactivate the nitrate reductase operon.
(2d) These resultssuggestthat the narL gene productmightbe activated by the nitrate reductase operon.
(2e) The narL gene product partially activated the nitrate reductase operon
(2f)Previous studieshave shown that the narL gene product activates the nitrate reductase operon.
If only the event type and participants are considered, then the events in sentences (2a-f) are identical to the event in sentence (1). However, the examples clearly illustrate that it is important to consider the context in which the event occurs, since a wide range of different types of information may be expressed that relate directly to the interpretation of the event.
In sentence (2a), the word known tells us that the event is a generally accepted fact, while in (2b), the interpretation is completely different. The word examined shows that the event is under investigation, and hence the truth value of the event is unknown. The presence of the word not in sentence (2c) shows that the event is negated, i.e. it did not happen. In sentence (2d), the presence of the word might (in addition to suggest) adds further speculation regarding the truth of the event. The word partially in (2e) does not challenge the truth of the event, but rather conveys the information that the strength or intensity of the event is less than may be expected by default. Finally, the phrase previous studies in (2f) shows that the event is based on information available in previously published papers, rather than relating to new information from the current study.
We use the term meta-knowledge to collectively refer to the different types of interpretative information available in the above sentences. There are several tasks in which biologists have to search and review the literature that could benefit from the automatic recognition of meta-knowledge about events. These tasks include building and updating models of biological processes, such as pathways  and curation of biological databases [18, 19]. Central to both of these tasks is the identification of new knowledge that can enhance these resources, e.g., to build upon an existing, but incomplete model of a biological process , or to ensure that a database is kept up to date. New knowledge should correspond to experimental findings or conclusions that relate to the current study, which are stated with a high degree of confidence, rather than, e.g., more tentative hypotheses. In the case of an analytical conclusion, it may be important to find appropriate evidence that supports this claim  before allowing it to be added to the database.
Other users may be interested in checking for inconsistencies or contradictions in the literature. The identification of meta-knowledge could also help to flag such information. Consider, for example, the case where an event with the same ontological type and identical participants is stated as being true in one article and false in another. If the textual context of both events shows them to have been stated as facts, then this could constitute a serious contradiction. If, however, one of the events is marked as being a hypothesis, then the consequences are not so serious, since the hypothesis may have been later disproved. The automatic identification of meta-knowledge about events can clearly be an asset in such scenarios, and can prevent users from spending time manually examining the textual context of each and every event that has been extracted from a large document collection in order to determine the intended interpretation.
In response to the issues outlined above, we have developed a new annotation scheme that is specifically tailored to enriching biomedical event corpora with meta-knowledge, in order to facilitate the training of more useful systems in the context of various IE tasks performed on biomedical literature. As illustrated by the example sentences above, a number of different types of meta-knowledge may be encoded in the context of an event, e.g., general information type (fact, experimental result, analysis of results), level of confidence/certainty towards the event, polarity of the event (positive or negative), etc. In order to account for this, our annotation scheme is multi-dimensional, with each dimension encoding a different type of information. Each of the 5 dimensions has a fixed set of possible values. For each event, the annotation task consists of determining the most appropriate value for each dimension. Textual clue expressions that are used to determine the values are also annotated, when they are present.
Following an initial annotation experiment by two of the authors to evaluate the feasibility of the scheme , we applied our scheme to the complete GENIA event corpus . This consists of 1000 MEDLINE abstracts, containing a total of 36,858 events. The annotation was carried out by two annotators, who were trained in the application of the scheme, and provided with a comprehensive set of annotation guidelines. The consistency and quality of the annotations produced were ensured though double annotation of a portion of the corpus.
To our knowledge, the enriched corpus represents a unique effort within the domain, in terms of the amount of meta-knowledge information annotated at such a fine-grained level of granularity (i.e., events). As the GENIA event corpus is currently the largest biomedical corpus annotated with events, the enrichment of this entire corpus with meta-knowledge annotation constitutes a valuable resource for training IE systems to recognise not only the core information about events and their participants, but also additional information to aid in their correct interpretation and provide enhanced search facilities. The corpus and annotation guidelines may be downloaded for academic purposes from http://www.nactem.ac.uk/meta-knowledge/.
Construction of classified inventories of lexical markers (i.e., words or phrases) which can accompany statements to indicate their intended interpretation.
Production of corpora annotated with various different types of meta-knowledge at differing levels of granularity.
The presence of specific cue words and phrases has been shown to be an important factor in classifying biomedical sentences automatically according to whether or not they express speculation [22, 23]. Corpus-based studies of hedging (i.e., speculative statements) in biological texts [24, 25] reinforce the above experimental findings, in that 85% of hedges were found to be conveyed lexically, i.e., through the use of particular words and phrases, rather than through more complex means, e.g., by using conditional clauses. The lexical means of hedging in biological texts have also been found to be quite different to academic writing in general, with modal auxiliaries (e.g., may, could, would, etc.) playing a more minor role, and other verbs, adjectives and adverbs playing a more significant role . It has additionally been shown that, in addition to speculation, specific lexical markers can denote other types of information pertinent to meta-knowledge identification, e.g., markers of certainty , as well as deductions or sensory (i.e. visual) evidence .
Based on the above, we can determine that lexical markers play an important role in distinguishing several different types of meta-knowledge, and also that there is a potentially wide range of different markers that can be used. For example,  identified 190 hedging cues that are used in biomedical research articles. Our own previous work  on identifying and categorising lexical markers of meta-knowledge demonstrated that such markers are to some extent domain-dependent. In contrast to other studies, we took a multi-dimensional approach to the categorisation, acknowledging that different types of meta-knowledge may be expressed through different words in the same sentence. As an example, consider sentence (3).
(3) The DNA-binding properties of mutations at positions 849 and 668may indicatethat the catalytic role of these side chains is associated with their interaction with the DNA substrate.
Firstly, the word indicate denotes that the statement following that is to be interpreted as an analysis based on the evidence given at the beginning of the sentence (rather than, e.g., a well-known fact or a direct experimental observation). Secondly, the word may conveys the fact that the author only has a medium level of confidence regarding this analysis.
Although such examples serve to demonstrate that a multi-dimensional approach recognising meta-knowledge information is necessary to correctly capture potential nuances of interpretation, it is important to note that taking a purely lexical approach to recognising meta-knowledge is not sufficient (i.e., simply looking for words from lists of cues that co-occur in the same sentences as events of interest). The reasons for this include:
a) The presence of a particular marker does not guarantee that the "expected" interpretation can be assumed . Some markers may have senses that vary according to their context. As noted in , "Every instance should ... be studied in its sentential co-text" (p.125).
b) Although lexical markers are an important part of meta-knowledge recognition, there are other ways in which meta-knowledge can be expressed. This has been demonstrated in a study involving the annotation of rhetorical zones in biology papers (e.g., background, method, result, implication, etc.) , based on a scheme originally proposed in . An analysis of features used to determine different types of zone in the annotated papers revealed that, in addition to explicit lexical markers, features such as the main verb in the clause, tense, section, position in the sentence within the paragraph and presence of citations in the sentence can also be important.
Thus, rather than assigning meta-knowledge based only on categorised lists of clue words and expressions, there is a need to produce corpora annotated with meta-knowledge, on which enhanced IE systems can be trained. By annotating meta-knowledge information for each relevant instance (e.g., an event), regardless of the presence of particular lexical markers, systems can be trained to recognise other types of features that can help to assign meta-knowledge values. However, given that the importance of lexical markers in the recognition of meta-knowledge has been clearly illustrated, explicit annotation of such markers should be carried out as part of the annotation process, whenever they are present.
There are several existing corpora with some degree of meta-knowledge annotation. These corpora vary in both the richness of the annotation added, and the type/size of the units at which the meta-knowledge annotation has been performed. Taking the unit of annotation into account, we can distinguish between annotations that apply to continuous text spans, and annotations that have been performed at the event level.
Annotations applied to continuous text spans most often cover only a single aspect of meta-knowledge, and are most often carried out at the level of the sentence. The most common types of meta-knowledge annotated correspond to either speculation/certainty level, e.g., [22, 23], or general information content/rhetorical intent, e.g., background, methods, results, insights, etc. This latter type of annotation has been attempted both on abstracts [33, 34] and full papers [31, 32, 35], using schemes of varying complexity, ranging from 4 categories for abstracts, up to 14 categories for one of the full paper schemes. Accurate automatic categorisation of sentences in abstracts has been shown to be highly feasible , and this functionality has been integrated into the MEDIE intelligent search system .
A few annotation schemes consider more than one aspect of meta-knowledge. For example, the ART corpus and its CoreSC annotation scheme [38, 39] augment general information content categories with additional attributes, such as New and Old, to denote current or previous work. The corpus described in  annotates both speculation and negation, together with their scopes. Uniquely amongst the corpora mentioned above,  also annotates the clue expressions (i.e. the negative and speculative keywords) on which the annotations are based.
Although sentences or larger zones of text  constitute straightforward and easily identifiable units of text on which to perform annotation, a problem is that a single sentence may express several different pieces of information, as illustrated by sentence (4).
(4) Inhibition of the MAP kinase cascade with PD98059, a specific inhibitor of MAPK kinase 1, may prevent the rapid expression of the alpha2 integrin subunit.
This sentence contains at least 3 distinct pieces of information:
Description of an experimental method: Inhibition of the MAP kinase cascade with PD98059.
A general fact: PD98059 is a specific inhibitor of MAPK kinase 1.
A speculative analysis: Inhibition of the MAP kinase may prevent the expression of the alpha2 integrin subunit
The main verb in the sentence (i.e., prevent) describes the speculative analysis. In a sentence-based annotation scheme, this is likely to be the only information that is encoded. However, this means that other potentially important information in the sentence is disregarded. Some annotation schemes have attempted to overcome such problems by annotating meta-knowledge below the sentence level, i.e., clauses [41, 42] or segments . In the case of the latter scheme, a new segment is created whenever there is a change in the meta-knowledge being expressed. The scheme proposed for segments is more complex than the sentence-based schemes, in that it covers multiple types of meta-knowledge, i.e., focus (content type), polarity, certainty, type of evidence and direction/trend (either increase or decrease in quantity/quality). It has, however, been shown that training a system to automatically annotate along these different dimensions is highly feasible .
At the level of biomedical events, annotation of meta-knowledge is generally very basic, and is normally limited to negation, e.g., . Negation is also the only attribute annotated in the corpus described in , even though a more complex scheme involving certainty, manner and direction was also initially proposed. To our knowledge, only the GENIA event corpus  goes beyond negation annotation, in that different levels of certainty (i.e. probable and doubtful) are also annotated.
Despite this current paucity of meta-knowledge annotation for events, our earlier examples have demonstrated that further information can usefully be specified at this level, including at least the general information content of the event, e.g. fact, experimental observation, analysis, etc. A possibility would be to "inherit" this information from a system trained to assign such information at the text span level (e.g. sentences or fragments), although this would not provide an optimal solution. The problem lies in the fact that text spans constitute continuous stretches of text, but events do not. The different constituents of an event annotation (i.e., trigger and participants) can be drawn from multiple, discontinuous parts of a sentence. There are almost always multiple events within a sentence, and the different participants of a particular event may be drawn from multiple sentence fragments. This means that mapping between text span meta-knowledge and event-level meta-knowledge cannot be carried out in a straightforward manner. Thus, for the purposes of training more sophisticated event-based information search systems, annotation of meta-knowledge directly at the event level can provide more precise and accurate information that relates directly to the event.
Based on the above findings, we embarked upon the design of an event-based meta-knowledge annotation scheme specifically tailored for biomedical events. In the remainder of this paper, we firstly cover the key aspects of this annotation scheme, followed by a description of the recruitment and training of annotators. We follow this by providing detailed statistics, results and evaluation of the application of the scheme to the GENIA event corpus. Finally, we present some conclusions and directions for further research.
In this section, we begin by providing a general overview of our annotation scheme, followed by a more detailed description of each annotation dimension. Following a brief overview of the software used to perform the annotation, we describe how we conducted an annotation experiment to test the feasibility and soundness of our scheme, prior to beginning full-scale annotation. The section concludes with a brief explanation of the recruitment and training of our annotators.
The aim of our meta-knowledge scheme is to capture as much useful information as possible that is specified about individual events in their textual context, in order to support users of event-based search systems in a number of tasks, including the discovery of new knowledge and the detection of contradictions. In order to achieve this aim, our annotation scheme identifies 5 different dimensions of information for each event, taking inspiration from previous multi-dimensional schemes (e.g. [39, 43, 45]). In addition to allowing several distinct types of information to be encoded about events, a multi-dimensional scheme is advantageous, in that the interplay between the different dimension values can be used to derive further useful information (hyper-dimensions) regarding the interpretation of the event.
Each dimension of the meta-knowledge scheme consists of a set of complete and mutually-exclusive categories, i.e., any given bio-event belongs to exactly one category in each dimension. The set of possible values for each dimension was determined through a detailed study of over 100 event-annotated biomedical abstracts. In order to minimise the annotation burden, the number of possible categories within each dimension has been kept as small as possible, whilst still respecting important distinctions in meta-knowledge that have been observed during our corpus study. Due to the demonstrated importance of lexical clues in the identification of certain meta-knowledge categories, the annotation task involves identifying such clues, when they are present.
This dimension is responsible for capturing the general information content of the event. The type of information encoded is at a slightly different level to some of the comparable sentence-based schemes, which have categories relating to structure or "zones" within a document, e.g. background or conclusion. Rather, our KT dimension attempts to identify a small number of more general information types that can be used to characterise events, regardless of the zone in which they occur. As such, our scheme can be seen as complementary to structure or zone-based schemes, providing a finer-grained analysis of the different types of information that can occur within a particular zone. The KT features we have defined are as follows:
Investigation: Enquiries or investigations, which have either already been conducted or are planned for the future, typically accompanied by lexical clues like examined, investigated and studied, etc.
Observation: Direct observations, sometimes represented by lexical clues like found, observed and report, etc. Event triggers in the past tense typically also describe observations.
Analysis: Inferences, interpretations, speculations or other types of cognitive analysis, always accompanied by lexical clues, typical examples of which include suggest, indicate, therefore and conclude, etc.
Method: Events that describe experimental methods. Denoted by trigger words that describe experimental methods, e.g., stimulate, addition.
Fact: Events that describe general facts and well-established knowledge, typically denoted by present tense event triggers that describe biological processes, and sometimes accompanied by the lexical clue known.
Other: The default category, assigned to events that either do not fit into one of the above categories, do not express complete information, or whose KT is unclear or is assignable from the context. These are mostly non-propositional events, i.e., events which cannot be ascribed a truth value due to lack of available (contextual) information.
This dimension aims to identify the level of certainty associated with occurrence of the event, as ascribed by the authors. It comes into play whenever there is explicit indication that there is less than complete confidence that the specified event will occur. This could be because:
There is uncertainty regarding the general truth value ascribed to the event.
It is perceived that the event may not take place all of the time.
Different degrees of uncertainty and frequency can be considered as points on a continuous scale, and there is an ongoing discussion regarding whether it is possible to partition the epistemic scale into discrete categories . However, the use of a number of distinct categories is undoubtedly easier for annotation purposes and has been proposed in a number of previous schemes. Although recent work has suggested the use of four or more categories [28, 42, 44], our initial analysis of bio-event corpora showed that only three levels of certainty seem readily distinguishable for bio-events. This is in line with , whose analysis of general English showed that there are at least three articulated points on the epistemic scale.
Like the scheme described in , we have chosen to use numerical values for the CL dimension, in order to reduce potential annotator confusions or biases that may be introduced through the use of labels corresponding to particular lexical markers of each category, such as probable or possible. Such labels could in any case be misleading, given that frequency can also come into play in assigning the correct category. Our chosen values of the CL dimension are defined as follows:
L3: The default category. No explicit expression that either:
There is uncertainty or speculation towards the event.
The event does not occur all of the time.
L2: Explicit indication of either:
High (but not complete) confidence or slight speculation towards the event. Typical lexical clues include likely, probably, suggest and indicate.
The event occurs frequently, but not all of the time. Typical lexical clues include normally, often, frequently.
L1: Explicit indication of either:
Low confidence or considerable speculation towards the event. Typical lexical clues include may, might and perhaps.
The event occurs infrequently or only some of the time. Typical lexical markers may include sometimes, rarely, scarcely, etc.
This dimension has been designed to capture the truth value of the assertion encapsulated by the event. We define a negated event as one that describes the absence or non-existence of an entity or a process. That is to say, the event may describe that a process does not or did not happen, or that an entity is absent or does not exist. The recognition of such information is vital, as the interpretation of a negated event instance is completely opposite to the interpretation of a non-negated (positive) instance of the same event. Our scheme permits the following two values for this dimension:
Positive: No explicit negation of the event (default)
Negative: The event has been negated according to the description above. The negation may be indicated through lexical clues such as no, not, fail, lack, etc.
This dimension identifies the rate, level, strength or intensity of the event (in biological terms). Such information has previously been shown to be relevant for biologists. The event annotation scheme for the GREC corpus , which was designed in consultation with biologists, identified expressions of manner as one of the semantic roles associated with event. The proposal for the annotation of protein-protein interactions suggested in  also lists manner as a potentially useful attribute to annotate. Inspired by these works, we build upon the types of manner annotation available in the GREC corpus by adopting a three-way categorisation of manner, as shown below:
High: Explicit indication that the event occurs at a high rate, level, strength or intensity. Clue expressions are typically adjectives or adverbs such as high, strongly, rapidly, potent, etc.
Low: Explicit indication that the event occurs at a low rate, level, strength or intensity. Clue expressions are typically adjectives and adverbs such as slightly, partially, small, etc.
Neutral: The default category. Assigned when there is no explicit indication of either high or low manner, but also in the rare cases when neutral manner is explicitly indicated, using clue words such as normal or medium, etc.
This dimension denotes the source or origin of the knowledge being expressed by the event. Specifically, we distinguish between events that can be attributed to the current study, and those that are attributed to other studies. Information about knowledge source has been demonstrated to be important through its annotation in both the Gene Ontology  and in the corpora presented in  and . This dimension can help in distinguishing new experimental knowledge from previously reported knowledge. Two possible values are distinguished, as follows:
Other: The event is attributed to a previous study. In this case, explicit clues are normally present, and can be indicated either by the use of clue words such as previously, recent studies, etc., or by the presence of citations.
Current: The event makes an assertion that can be attributed to the current study. This is the default category, and is assigned in the absence of explicit lexical or contextual clues, although explicit clues such as the present study may be encountered.
A defining feature of our annotation scheme is the fact that, in addition to the explicitly annotated dimensions, further information can be inferred by considering combinations of some of these dimensions. We refer to these additional types of information as the hyper-dimensions of our scheme, of which we have identified two.
New Knowledge - The isolation of events describing new knowledge is, as we have described earlier, important for certain tasks undertaken by biologists. However, it is not possible to determine whether an event represents new knowledge by considering only a single annotation dimension. For example, events that have been assigned KT = Observation could correspond to new knowledge, but only if they represent observations from the current study, rather than observations cited from elsewhere. In a similar way, a KT = Analysis event drawn from experimental results in the current study could be treated as new knowledge, but generally only if it represents a straightforward interpretation of results, rather than something more speculative. Thus, we consider New Knowledge to be a hyper-dimension, whose value (either Yes or No) can be inferred by considering a combination of the values assigned to the KT, Source and CL dimensions. Table 1 is an inference table that can be used to obtain the appropriate value for New Knowledge, based on the values assigned to the three dimensions mentioned above.
Inference table for New Knowledge hyper-dimension
New Knowledge (Inferred)
Hypothesis - The binary value of this hyper-dimension can be inferred by considering the values of the KT and CL dimensions. Events with a KT value of Investigation can always be assumed to be a hypothesis. However, if the KT value is Analysis, then only those events with a CL value of L1 or L2 (speculative inferences made on the basis of results) should be considered as hypotheses, to be matched with more definite experimental evidence when available. A value of L3 in this instance would normally be classed as an instance of new knowledge, as indicated in Table 1. The cases in which an event can be assumed to be a hypothesis are summarised in Table 2.
Inference table for Hypothesis hyper-dimension
The original annotation of the GENIA event corpus was performed using the X-Conc suite . This is a collection of XML-based tools that are integrated to support the development and annotation of corpora, running as a Java plug-in within the Eclipse software development platform . Customising the information to be annotated and the way in which it is displayed is controlled completely through XML DTD and stylesheet (CSS) files. We decided to use this tool to carry out meta-knowledge annotation of events in the GENIA event corpus, as only minimal customisation of the existing DTD and CSS files would be required.
Prior to annotation of the full GENIA event corpus, a small annotation experiment was conducted to verify the feasibility and soundness of the meta-knowledge annotation scheme . Two of the authors independently applied the annotation scheme to 70 abstracts selected at random from the GENIA pathway corpus , using the annotation manual we had developed. The experiment helped to demonstrate the soundness of both the scheme itself and the guidelines, given that Kappa scores  scores of 0.89 - 0.95 were achieved. Also, the fact that all categories within all dimensions were annotated, at least to a certain extent, suggested that none of the proposed categories was redundant.
In order to ensure the efficacy of the guidelines and the reproducibility of the annotation task, we recruited 2 external annotators to carry out the annotation of our gold standard corpus. An important consideration was the type of expertise required by the annotators. It has previously been found that at least negations and speculations in biomedical texts can be reliably detected by linguists . The scope of our meta-knowledge annotation is wider, involving some scientifically motivated aspects (i.e., KT and Manner), but the assignment of certain dimension values is somewhat linguistically oriented, e.g., it is often the case that clue expressions have a grammatical relationship to the event trigger that they modify. In order to verify the extent to which either domain-specific biological knowledge or linguistic knowledge is required to perform the annotation accurately, we recruited a biology expert and a linguistics expert to carry out the task. Both annotators have near-native competency of English, which we considered to be important to carry out the task accurately.
The annotators undertook training prior to commencing the annotation of the gold standard corpus. This training began with initial introductory sessions, in which the annotation scheme and guidelines were explained, and the X-Conc annotation tool was demonstrated. Subsequently, the annotators carried out practice annotation tasks. For this purpose, we used the same corpus of 70 abstracts from the GENIA pathway corpus that was used to test the feasibility of the scheme, as described above. Both annotators were given the same sets of abstracts to annotate, independently of each other. This allowed us to detect a maximal number of potential annotation errors and discrepancies produced by the annotators, as we could conduct comparisons not only between the annotators themselves, but also against a gold standard corpus. The annotators returned a set of abstracts each week, in response to which we produced detailed feedback reports highlighting annotation errors. These reports were thoroughly discussed with the annotators, in order to maximally enhance and accelerate the learning process. Often, errors made by the annotators revealed potential problems with the annotation guidelines, which were addressed by updating the guidelines accordingly.
In this section, we firstly provide key statistics regarding the meta-knowledge annotations produced, together with a brief discussion regarding the salient characteristics of the corpus. This is followed by a report on the level of agreement achieved between the annotators in the double-annotated part of the corpus, and an examination of the different kinds of discrepancies that were found within these abstracts.
Below, we discuss the general distribution of the annotations amongst the different categories for each dimension, and also provide lists of the most commonly annotated clue expressions.
Distribution of annotated categories for Knowledge Type (KT)
% of total events
The proportion of Analysis events is much smaller but still quite significant, since most abstracts contain at least some analysis of the experimental results obtained. The usual inclusion of a small amount of background factual information to put the current study into context accounts for the average of 3 events per abstract (8% of all events) that are assigned the Fact category. Even briefer are the descriptions of what is to be investigated, with an average of 2 Investigation events per abstract (5% of all events). The scarcity of events describing methods (2.6% of events, or less than 1 event per abstract) shows that providing details of experimental setup is very rare within abstracts.
Most common KT clue expressions
For both Investigation and Observation, the top three most common clue expressions are past tense verbs, while the use of the present tense appears to be more dominant for describing Analysis events. The use of infinitive forms (e.g., to investigate) as clues seems to be a particular feature of the Investigation category. Whilst most clues are verbal forms, words with other parts of speech can sometimes constitute reliable clues (e.g., thus for Analysis, and detectable for Observation).
Distribution of annotated categories for Certainty Level (CL)
% of total events
However, the marking of slight uncertainty is sometimes necessary. The author's analyses of experimental results may have produced important outcomes, but yet they are not confident that their analysis is completely reliable. As stated in , "Scientists gain credibility by stating the strongest claims they can for their evidence, but they also need to insure against overstatement." (p. 257). Such insurance can often be achieved by the use of slight hedging (CL = L2). Greater speculation (CL = L1) is less common, as such credibility is reduced in this case.
As part of the original GENIA event annotation, Uncertainty was annotated as an event attribute. The default value is Certain and the other two values are Probable and Doubtful. In the GENIA event annotation guidelines, these attributes do not have clear definitions. However, Probable can be defined loosely as something that is hypothesized by the author, while Doubtful is something that is investigated. Probable has more in common with our CL dimension, while Doubtful is more closely linked to the Investigation category of our KT dimension. Therefore, the GENIA Uncertainty attribute does not distinguish between degrees of uncertainty in the same way as our meta-knowledge scheme. Comparison of results confirms this - of the events annotated with Uncertainty = Probable, there are similar proportions of events that have been annotated with CL = L1 (33.6% of Probable events) and CL = L2 (42.2% of Probable events). It is also worth noting that the total percentage of events identified with some degree of uncertainty using our scheme (CL = L1 or CL = L2) is 8.1%. This is almost double the percentage of events annotated as Probable (4.3% of all events), showing that our more detailed guidelines for CL annotation have helped to identify a far greater number of events expressing some degree of speculation.
Discrepancies can also be found regarding the Doubtful category. Events annotated with this category constitute 3.7% of all events in the corpus. Whilst, as expected, the vast majority of these correspond to events that have been annotated as KT = Investigation in our meta-knowledge scheme (1022 out of a total of 1349 Doubtful events, i.e. 75.8%), some Doubtful events also correspond to events with other KT values (most notably Analysis with CL values of L3, L2 or L1, which can also occur within the Probable category). This provides evidence that the boundary between Doubtful and Probable may not always have been clear to GENIA corpus annotators. In addition, our scheme identified 1948 events (5.3% of all events) with KT = Investigation, meaning that there were some 900 investigative events, i.e., 2.4% of all events, which were not identified during the original GENIA event annotation.
Most common CL clue expressions
All of the other words in the L2 list express slight speculation or hedging, mostly corresponding to different forms of the verbs suggest and indicate. In Table 4, it was seen that these verbs also rank amongst the most common Analysis clues, showing that it is common for analysis and slight speculation to be simultaneously expressed using a single clue word. For the indication of L1 certainty, modal auxiliary verbs are particularly common, with may accounting for 67.4% of all annotated L1 clues, and might and could constituting a significant proportion of the remainder. The L1 category has a very small number of distinct clue expressions (23), compared to 121 distinct expressions for L2.
Distribution of annotated categories for Polarity
% of total events
(5) AP-1 but not NF-IL-6 DNA binding activity was also detected in C5a-stimulated PBMC; however, its delayed expression (maximal at 4 hours) suggested a less important ROLE in the rapid production of IL-8.
The event encodes the fact that the expression of AP-1 only has a minor role in the rapid production of IL-8. As the GENIA annotation had no special means to encode that an event has low intensity or impact, the original annotator chose to annotate it as a negative event, even though this is not strictly correct. Our annotation scheme, with its Manner dimension, allows the subtle difference between an event having a low impact and an event not happening at all to be encoded. Our scheme annotates low impact events such as the above with Polarity = Positive but Manner = Low.
Distribution of negated events among KT categories
Negated events (% within category)
The low occurrence of negative instances amongst events with KT = Investigation events is quite intuitive - it is the norm to investigate why/whether something does take place, although in some instances there can be investigation into why something does not take place, such as in response to a previous negative finding, such as in (6).
(6) To determine why alveolar macrophages do not EXPRESS AP-1 DNA binding activity, ...
Also, for methods, it is highly unusual to say that a particular method was not applied, unless in contrast to the case where the method was applied, as is the case in (7).
(7) For comparison, we recruited a control group consisting of 32 healthy males and females with similar age distribution and without a history of EXPOSURE to MTBE or benzene.
Most common clue expressions for Polarity = Negative
The word not constitutes around half of all clue expressions for negation (50.4%), and is over 5 times more common than the next most common clue expression, no. Although most of the words in the list have an inherently negative meaning, the third most common word, i.e., independent (together with its associated adverb independently), does not. Closer examination shows that this negative meaning is quite context-dependent, in that it only denotes a negative meaning for events of type Correlation and Regulation (together with its sub-type Positive_Regulation). For Regulation events, a typical example is shown in (8).
(8) An alteration in the E2F-4 profile was INDEPENDENT of viral gene expression.
In (8), the word independent acts as both the event trigger and the negative clue expression. The event denotes the fact that the alteration in the E2F-4 profile was not dependent on viral gene expression occurring. In other words, it is not the case that viral gene expression regulates the alteration in the E2F-4 profile. Events of type Correlation are annotated when there is some kind of association that holds between entities and/or other events. Sentence (9) shows an example of both a positive Correlation event and a negated Correlation event.
(9) LPS-INDUCED NF-kappaB activation is protein tyrosine kinase DEPENDENT and protein kinase C INDEPENDENT
There are three relevant events in (9). Firstly, the word induced is the trigger for a Positive_Regulation event in which NF-kappaB activation is regulated by LPS. The word dependent is the trigger for the second event, which is a positive Correlation event. It shows that that there is and association between the Positive_Regulation event and the protein tyrosine kinase. In contrast, the third event, triggered by independent, shows that no such association holds between the Positive_Regulation event and the protein kinase C. Hence, this is a negated Correlation event.
Some less commonly occurring negative clue expressions also only have negative meanings in very specific contexts. Consider (10).
(10) These cells are DEFICIENT in FasL expression and apoptosis induced upon TCR triggering, although their cytokine (IL-2 and IFN-gamma) production is NORMAL.
In (10), the word deficient indicates a Negative_Regulation event. However, the word normal indicates that no such negative regulation occurs in the case of IL-2 and IFN-gamma production. In the few contexts that normal occurs as a negative polarity marker, it is used in similar contexts, i.e., to contrast with a previously stated Negative_Regulation event. The word silent appears to be usable in similar contexts to negate events of type Positive_Regulation, in contrast to a positive occurrence of such an event.
Distribution of annotated categories for Manner
% of total events
Distribution of events with explicit Manner annotated among KT categories
Events with High or Low Manner annotated
(% within category)
Most common Manner clue expressions
High Manner Clue
Low Manner Clue
little or no
only a partial
to a lesser extent
In the High manner clue word list, a notable item is overexpression. Unlike the other clues in the list, which are independent of event type, this word is specific to events of type Gene_Expression, as it combines the meaning of the event type with the expression of High manner. Comparable examples appear very rare.
Some of the annotated clues for both High and Low manner contain numerical values, meaning that a pattern matching approach may be required when trying to recognise them in unseen texts. For example, the expression n-fold is often used to denote High Manner (often preceding the word increase or decrease), where n may be any numeric value. In other cases, by n% may follow one of these words. To indicate Low manner, the expressions n-fold less or n-fold lower are sometimes used.
Distribution of annotated categories for Source
% of total events
Most common clue expressions for Source = Other
our previous studies
Distribution of categories for the two hyper-dimensions
% of total events
Inter-annotator agreement rates
High levels of agreement were achieved in each annotation dimension, with generally only very small differences between the agreement rates for different dimensions. This provides strong evidence that consistent annotation of meta-knowledge is a task that can be reliably undertaken by following the annotation guidelines, regardless of background (biology or linguistics).
The Polarity dimension has the highest rates of agreement. This could be because it is one of the two dimensions that have only two possible values (together with Source, which has the second highest agreement rate). The two dimensions with three possible values (i.e., CL and Manner) have virtually identical rates of agreement, while KT has the lowest agreement rate (albeit only by a small amount). This is, however, to be expected - KT has 6 possible values and, in many cases, contextual information other than clue expressions is required to determine the correct value. Therefore, it can be a more demanding task than the assignment of other dimensions.
We have studied the cases where there is a discrepancy between the two annotators. Whilst a number of these discrepancies are simple annotation errors, in which a particular dimension value was mistakenly selected during the annotation task, other discrepancies occur when a dimension value is identified by means of a clue expression that is not present in the list provided in the guidelines. In some cases, one of the annotators would notice the new clue, and use it to assign an appropriate category, but the other annotator would miss it. In order to minimise the occurrence of such cases, annotators were asked to flag new clue expressions, so that the lists of clue expressions in the guidelines could be updated to be as comprehensive as possible, and so ease the task of accurate annotation.
One of the largest areas of disagreement was between the KT categories of Observation and Fact. For a number of reasons, distinguishing between these types can often be quite tricky, and sometimes there is no clear evidence to suggest which of the categories should be chosen. Events of both types can be indicated using the present tense, and explicit clue expressions are more frequently absent than present. Often, the extended context of the event (possibly including other sentences) has to be considered before a decision can be made. In some cases, it appears that domain knowledge is required to make the correct decision.
In the remainder of this section, we look at some particular cases of annotation discrepancies, some of which appear to be influenced by the expertise of the annotator.
Long sentences seemed to prove more problematic for the biologist annotator, and meta-knowledge information was sometimes missed when there was a large gap between the clue expression and the event trigger. Consider sentence (11), in which the word indicated should cause both the event with the trigger prevented and the event with the trigger activated to be annotated with KT = Analysis.
(11) Accordingly, electrophoretic mobility shift assays (EMSAs) indicated that pyrrolidine DTC (PDTC) PREVENTED NF-kappaB, and NFAT DNA-binding activity in T cells stimulated with either phorbol myristate acetate plus ionophore or antibodies against the CD3-T-cell receptor complex and simultaneously ACTIVATED the binding of AP-1.
Whilst it is straightforward to understand that indicated affects the interpretation of the event triggered by prevented, it is less easy to spot the fact that it also applies to the event triggered by activated, due to the long description of the T cells, which precedes this trigger.
It appears that having some linguistic expertise is an advantage in order to cope with such cases. The biologist would often fail to consider a clue word as potentially affecting the interpretation of an event unless it occurred in close proximity to the event itself. In contrast, the linguist would normally detect long distance dependencies between clue expressions and triggers without difficulty. This is to be expected, given that the linguist is familiar with grammatical rules. However, given the generally high levels of agreement, such complex cases appear to be reasonably rare.
Other annotation discrepancies reveal further differences in the approaches of the annotators. Whilst some grammatical knowledge appears to be advantageous, using a purely grammatical approach to the recognition of meta-knowledge is not always correct. The semantic viewpoint appears to be the one most naturally taken by the biologist annotator, as is evident in sentences such as (12):
(12) This study demonstrates that GC act as a primary INDUCER of sialoadhesin expression on rat macrophages, and that the response can be ENHANCED by IFN-beta, T cell-derived cytokines, or LPS.
In (12), we focus on the events triggered by inducer and enhanced, which are both of type Positive_Regulation. The word demonstrates is a clue expression for KT = Analysis. Taking a purely grammatical approach, the word demonstrates affects the interpretation of the verbs act and enhanced. Of these, only enhanced is an event trigger. Accordingly, both annotators marked the event triggered by enhanced as KT = Analysis. However, the biologist also annotated the inducer event with KT = Analysis, also marking demonstrates as the clue expression. Considering semantics, this is correct - the actual meaning of the first part of the sentence is that This study demonstrates that GC induces sialoadhesin expression on rat macrophages.
Example (13) illustrates the need to carefully consider the meaning of words and phrases in the context of the event, as well as simply looking for relevant keywords.
(13) Changes of any cysteine residue of the hRAR alpha-LBD had no significant INFLUENCE on the binding of all-trans RA or 9-cis RA.
In (13), one of the annotators had annotated the Regulation event with the trigger influence with Polarity = Negative (clue word: no) and Manner = High (clue word: significant). However, this is incorrect - it is the word significant that is negated, rather than the event itself. As significant would normally be a marker of High manner, negating it means that it should be treated as a Low manner marker. Accordingly, the other annotator correctly identified no significant as the clue phrase for Manner = Low, with the polarity of the event correctly remaining positive.
The interplay between events in the GENIA event corpus can be complex, especially as events can sometimes occur that have no trigger phrase. The links between different events in a sentence often have to be understood before it can be determined to which of these events a particular piece of meta-knowledge should apply. In such cases, a detailed understanding of the domain could be considered to be an advantage. The following sentence fragment (14) illustrates such a case, in which absence constitutes a clue expression for Polarity = Negative for one of the events.
(14) In the absence of TCR- MEDIATED activation, Vpr INDUCES apoptosis...
Three events were identified as part of the original GENIA event annotation:
1) A Positive_Regulation event with the trigger mediated (i.e., positive regulation of activation by TCR). At first glance, it is this to event that the negative polarity appears to apply.
2) A second Positive_Regulation event, with the trigger induces (i.e. positive regulation of apoptosis by Vpr).
3) A Correlation event with no trigger, providing a link between events 1) and 2). In fact, it is this third event to which the negative polarity applies. The event conveys the fact that Vpr induces apoptosis even when there is no TRC-mediated activation, indicating that there is no correlation between events 1) and 2).
The above examples demonstrate that accurate meta-knowledge annotation can be a complex task, which, according to the event in question, may have to take into account the structure and semantics of the sentence in which the event is contained, as well as the semantics of the event itself, and possibly the interplay between events.
Our inter-annotator agreement results suggest, however, that the annotation task can be accurately undertaken, given appropriate guidelines and training. Furthermore, the results provide evidence that high quality meta-knowledge annotations can be produced regardless of the expertise of the annotator. Although we have highlighted certain cases where either domain knowledge or linguistic expertise appears to be a distinct advantage, neither seems to be a prerequisite. This is in agreement with , in which biologist annotators were trained to carry out linguistically-motivated annotation of biomedical events, with good levels of agreement.
We have designed an annotation scheme to enrich corpora of biomedical events with information about their characterisation or interpretation (meta-knowledge), based on their textual context. The scheme is unique within the field in that it allows detailed meta-knowledge to be annotated at the level of the event, through the use of multiple annotation dimensions. These different dimensions, and the interplay between them, aim to facilitate the training of advanced event extraction systems that can detect various differences between events, both subtle and substantial, which existing systems would fail to recognise.
The scheme is designed to be portable, in order to allow integration with the various different schemes for event annotation that are currently in existence. As a first major effort, our scheme has been applied by 2 external annotators to the largest currently available corpus of biomedical events, i.e., the GENIA event corpus, which consists of 1000 MEDLINE abstracts, annotated with a total of 36,858 annotated events. The annotators achieved inter-annotator agreement rates of between 0.84-0.93 Kappa (according to annotation dimension), demonstrating that high levels of annotation quality and consistency can be achieved by following the annotation guidelines. Furthermore, it appears that, subject to the provision of these guidelines and a suitable training programme, meta-knowledge annotation can be performed to a high standard by annotators without specific areas of expertise, as long as they have a good command of the English language.
An examination of the characteristics of the annotated corpus has revealed that, although all categories within all dimensions have been annotated to a certain extent, their distribution is somewhat skewed, with a heavy emphasis on events that describe observations, relatively few speculative events, and a very low percentage of events that can be attributed to work outside the current study. These results correlate with the general characteristics of scientific abstracts. Although we have so far only applied our scheme to abstracts, it is intended also to be suitable for application to full papers, and we hypothesise that some of the categories of our scheme may be more frequently annotated in this context. For example, the background section of a full paper consists mainly of descriptions of work carried out in previous studies, meaning that a greater proportion of events with Source = Other should be observable. The GENIA event annotation scheme is currently being applied to full papers, and it is our intention to apply our meta-knowledge scheme to these papers, both to ensure that the our meta-knowledge scheme is scalable to longer texts, and also to test our hypotheses regarding the different distributions of the annotation dimensions in this context.
As further directions of future work, and inspired by the favourable results of  in training a system to recognise several annotation dimensions, we plan to work on the development of a machine learning system that can predict meta-knowledge information for events, trained on our annotated corpus. It is hoped that the comprehensive annotation of clue expressions for the different annotation dimensions, together with the observations we have made about other relevant features, e.g., tense or prototypical positions of particular event types, will constitute useful features that can be used by the system. In addition, we plan to apply our meta-knowledge scheme to event corpora that use different event annotation schemes, such as GREC  or BioInfer , as well to protein-protein interaction corpora, such as AIMed . Finally, we plan to investigate to what extent our scheme is portable to other scientific domains.
The work described in this paper has been funded by the Biotechnology and Biological Sciences Research Council through grant numbers BBS/B/13640, and BB/F006039/1 (ONDEX) and through the MetaNet4U project (ICT PSP Programme, Grant Agreement: No 270893). The research team at the University of Manchester is hosted by the JISC funded National Centre for Text Mining (NaCTeM). We would like to thank our annotators, Dr. Maria Aretoulaki and Dr. Syed Amir Iqbal, for their hard work and dedication to the task.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.