Biological event composition
© Kilicoglu and Bergler; licensee BioMed Central Ltd. 2012
Published: 26 June 2012
Skip to main content
© Kilicoglu and Bergler; licensee BioMed Central Ltd. 2012
Published: 26 June 2012
In recent years, biological event extraction has emerged as a key natural language processing task, aiming to address the information overload problem in accessing the molecular biology literature. The BioNLP shared task competitions have contributed to this recent interest considerably. The first competition (BioNLP'09) focused on extracting biological events from Medline abstracts from a narrow domain, while the theme of the latest competition (BioNLP-ST'11) was generalization and a wider range of text types, event types, and subject domains were considered. We view event extraction as a building block in larger discourse interpretation and propose a two-phase, linguistically-grounded, rule-based methodology. In the first phase, a general, underspecified semantic interpretation is composed from syntactic dependency relations in a bottom-up manner. The notion of embedding underpins this phase and it is informed by a trigger dictionary and argument identification rules. Coreference resolution is also performed at this step, allowing extraction of inter-sentential relations. The second phase is concerned with constraining the resulting semantic interpretation by shared task specifications. We evaluated our general methodology on core biological event extraction and speculation/negation tasks in three main tracks of BioNLP-ST'11 (GENIA, EPI, and ID).
We achieved competitive results in GENIA and ID tracks, while our results in the EPI track leave room for improvement. One notable feature of our system is that its performance across abstracts and articles bodies is stable. Coreference resolution results in minor improvement in system performance. Due to our interest in discourse-level elements, such as speculation/negation and coreference, we provide a more detailed analysis of our system performance in these subtasks.
The results demonstrate the viability of a robust, linguistically-oriented methodology, which clearly distinguishes general semantic interpretation from shared task specific aspects, for biological event extraction. Our error analysis pinpoints some shortcomings, which we plan to address in future work within our incremental system development methodology.
The overwhelming amount of new knowledge in molecular biology and its exponential growth necessitate sophisticated approaches to managing molecular biology literature. By providing efficient access to the relevant literature, such approaches are expected to assist researchers in generating new hypotheses and, ultimately, new knowledge. Natural language processing (NLP) techniques increasingly support advanced knowledge management and discovery systems in the biomedical domain [1, 2]. In biomedical NLP, biological event extraction is one task that has been attracting great interest recently, largely due to the availability of the GENIA event corpus  and the resulting shared task competition (BioNLP'09 Shared Task on Event Extraction ). In addition to systems participating in the shared task competition , several studies based on the shared task corpus have been reported [5–7], the top shared task system has been applied to PubMed scale , and biological corpora targeted for event extraction in other biological subdomains have been constructed . Furthermore, UCompare, a meta-service providing access to some of the shared task systems , have been made available.
One of the criticisms towards the corpus annotation/competition paradigm in biomedical NLP has been that they are concerned with narrow domains and specific representations, and that they may not generalize well. The GENIA event corpus, for instance, contains only Medline abstracts on transcription factors in human blood cells. Whether models trained on this corpus would perform well on full-text articles or on text focusing on other aspects of biomedicine (e.g., treatment or etiology of disease) remains largely unclear. Since annotated corpora are not available for every conceivable subdomain of biomedicine, it is desirable for automatic event extraction systems to be generally applicable to different types of text and domains without requiring much training data or customization.
An overview of BioNLP-ST'11 tracks
Number of core events
BioNLP-ST'11 provides a good platform to validate some aspects of our general research, in which we are working towards a linguistically-grounded, bottom-up semantic interpretation scheme. In particular, we focus on lower level discourse phenomena, such as modality, negation, and causation and investigate how they interact with each other, as well as their effect on basic propositional semantic content (who did what to who?) and higher discourse/pragmatics structure. We subsume these phenomena of study under the notion of embedding. In our model, we distinguish two layers of predications: atomic and embedding. An atomic predication corresponds to the elementary unit and lowest level of relational meaning: in other words, a semantic relation whose arguments correspond to ontologically simple entities. Atomic predications form the basis for embedding predications, that is, predications taking as arguments other predications. We hypothesize that the semantics of the embedding layer is largely domain-independent and that treating this layer in a unified manner can benefit a number of natural language processing tasks, including event extraction and speculation/negation detection.
We participated in three BioNLP-ST'11 tracks: GENIA, EPI, and ID. In the spirit of the competition, we aimed to demonstrate a generalizable methodology that separated domain-independent linguistic aspects from task-specific concerns and that required little, if any, customization or training for individual tracks. Towards this goal, we use a two-phase approach. The first phase (Composition) is an implementation of the bottom-up semantic interpretation scheme mentioned above. It takes the concerns of general English into account and is intended to be fully general. It is syntax-driven, presupposes simple entities, a trigger dictionary and syntactic dependency relations, and creates a partial semantic representation of the text. Addressing coreference resolution to some extent at this phase, we also aim to move to the inter-sentential level. Our overall structural approach in the composition phase is in the tradition of graph-based semantic representations  and its output bears similarities to the deep-syntactic level of representation proposed in the Meaning-Text Theory . In the second phase (Mapping), we rely on shared task event specifications to map relevant parts of this semantic representation to event annotations. This phase is more domain-specific, although the kind of domain-specific knowledge it requires is largely limited to event specifications and event trigger expressions. In addition to extracting core biological events, our system also addresses speculation and negation detection within the same framework. We achieved competitive results in the shared task competition, demonstrating the feasibility of a general, rule-based methodology while avoiding low recall, often associated with rule-based systems, to a large extent. In this article, we extend the work reported in our previous shared task article , by integrating coreference resolution into the system, providing a more extensive and formal description of the framework and extending the error analysis.
Definition 2. A semantic object T is ontologically simple if it takes no arguments. A predication takes arguments and is an ontologically complex object.
Definition 5. A surface element SU is a single token or a contiguous multi-token unit, which may be associated with an abstract semantic object SEM.
A surface item that is associated with a semantic object is said to be semantically bound (⟦SU⟧ = SEM).
Otherwise, it is said to be semantically free (⟦SU⟧ = ∅).
Atomic predications in the same sentence are indicated with the identifiers e 1, e 2, and e 3 in Figure 2. The predicates that trigger the atomic predications in the sentence are shown in bold. At the syntactic level, atomic predications prototypically correspond to verbal and nominalized predicates and their syntactic arguments. We denote atomic predications as m: SEM (id,t 1..n ), where m corresponds to the predicate mention and t 1..n refer to ontologically simple argument(s) of the atomic predication. SEM is the semantic type of the predicate, and by extension, of the predication. Semantic types of atomic predications are event types from the shared task specifications, where applicable.
Underlined expressions in the sentence (leads, presumed, important, and subsequent) trigger embedding predications (em 4..7) and indicate higher level information relating biological processes indicated by atomic predications: leads, important and subsequent are used to make causal and temporal connections between these processes and presumed to indicate an assumption, though seemingly unproven, towards one of these connections. Syntactically, in addition to verbal and nominalized predicates and their syntactic arguments, embedding predications are also realized via subordination, complementation, and various syntactic modifications. For example, in the example in Figure 2, em 6 is triggered by adjectival modification and em 7 by infinitival complementation.
In the shared task setting, embedding predications correspond to complex regulatory events (e.g., POSITIVE_REGULATION, CATALYSIS) as well as event modifications (NEGATION and SPECULATION), whereas atomic predications largely correspond to simple event types (e.g., GENE_EXPRESSION, PHOSPHORYLATION).
In this paper, MV and POL values will be omitted from representation of atomic predications when they are not relevant to the discussion. We describe the embedding categorization scheme, as well as the modality value and polarity elements in more detail in the next section.
Note that the level of embedding in a sentence can be arbitrarily deep. For example, em 7 takes as argument another embedding predication, em 5, which, in turn, takes atomic predications e 2 and e 3 as arguments.
Definition 7. A predication Pr2 is within the scope of a predication Pr1 (written as Pr1 >Pr2), if one of the following conditions is met:
Pr1 embeds Pr2.
Incorporating entity annotations provided by the shared task organizers (ontologically simple, semantically bound entities of PROTEIN type), the first phase of our system (composition) is essentially concerned with compositionally building predications, illustrated in the first column of Figure 2. The second phase, mapping, deals with converting and filtering these predications to obtain the shared task-specific annotations in the second column of Figure 2.
Definition 8. A modal predicate modifies the status of the embedded predication with respect to a modal scale (e.g., certainty, possibility, necessity).
Four generally accepted types of modal predicates are given below (cf. ), and they are illustrated with sentences from the shared task corpus. The embedded predicate is in bold, and the modal predicate is underlined.
EPISTEMIC indicates judgement about the status of the embedded predication, and affects its factuality. Subtypes include ASSUMPTIVE and SPECULATIVE.
(1) (a) ... phosphorylation of IκBα, which is presumed to be important for the subsequent degradation.
(b) presume: ASSUMPTIVE(em 1 ,0.7,positive,em 2) ^ important(em 2 ...)
EVIDENTIAL indicates evidence surrounding the predication, indirectly affecting its factuality according to the evidence source and reliability. Subtypes are DEDUCTIVE, DEMONSTRATIVE, and REPORTING.
(2) (a) Our previous results show that recombinant gp41 ... stimulates interleukin-10 (IL-10) production ...
(b) show: DEMONSTRATIVE(em 1 ,1,positive,em 2 ,t 1) ^ our-previous-results(t 1 ) ^ stimulate(em 2 ...)
DYNAMIC indicates ability or willingness of an agent towards an event, corresponding to POTENTIAL and VOLITIVE categories, respectively.
(3) (a) Other unidentified ETS-like factors ... are also capable of binding GM5.
(b) capable: POTENTIAL(em 1 ,1,positive,e 2 ) ^ bind(e 2 ...)
DEONTIC indicates obligation or permission from an external authority for an event, corresponding to OBLIGATIVE and PERMISSIVE categories, respectively.
(4) (a) ... future research in this area should be directed toward the understanding ...
(b) should: OBLIGATIVE(em 1 ,0.7,positive,e 2 ) ^ direct(e 2 ...)
We consider three additional modal types: INTENTIONAL, INTERROGATIVE, SUCCESS. These types are mentioned in discussions of modality and are sometimes adopted as separate categories; however, there appears to be no firm consensus on their modal status. We chose to include them in our categorization, since corpus analysis provides clear evidence that they affect the status of predications they embed and that they occur in considerable amounts.
(5) (a) ... we tried to identify downstream target genes regulated by TAL1 ...
(b) try: INTENTIONAL(em 1 ,1,positive,e 2) ^ identify(e 2 ...)
(6) (a) ... we examined whether ... IL-10 up-regulation is mediated by the ... synergistic activation of cAMP and NF-κB pathways.
(b) examine: INTERROGATIVE(em 1 ,1,positive,em 2) ^ mediate(em 2 ...)
(7) (a) In contrast, gp41 failed to stimulate NF-κB binding activity ... (SUCCESS)
(b) fail: SUCCESS(em 1 ,1,negative,em 2) ^ stimulate(em 2 ...)
In the shared task context, the embedding predications of MODAL semantic type are most relevant to the speculation/negation detection task.
Definition 9. A modal predication, Pr MODAL , associates the predication it embeds, Pr e , with a modality value on a context-dependent scale. The scale (S) is determined by the semantic type of the modal predicate, P MODAL . The modality value (MV S ) is a numerical value between 0 and 1 and corresponds to how strongly Pr e is associated with the scale S, 1 indicating strongest association and 0 negative association.
The scalar modality value is partially modeled after the modality value proposed by Nirenburg and Raskin . In this view, a modality value of zero on the EPISTEMIC scale, for example, corresponds to "The speaker does not believe that P", while a value of 0.6 roughly indicates that "The speaker believes that possibly P". More often, modality values are represented discretely, when a single modality-related phenomenon is investigated (certain, possible, probable etc. on the factuality scale [16, 17, 22]). In our framework, we favor a contextual scale rather than a fixed one since it is more general and flexible.
Definition 10. An attributive predicate links an embedded predication with one of its semantic arguments and specifies its semantic role.
Consider the fragment in Example (8a). The verbal predicate (undergo) takes a nominalized predicate (degradation) as its syntactic object. The other syntactic argument of the verbal predicate, p105, serves as the semantic argument of the embedded predicate (degradation) with the semantic role PATIENT. Example (8b) corresponds to the representation after the composition phase and Example (8c) shows the result of the mapping phase.
(8) (a) ... p105 undergoes degradation ...
(b) p105: PROTEIN (t 1 ) ^ undergo: PATIENT (em 1 ,e 1 ,t 1 ) ^ degradation: PROTEIN_ CATABOLISM (e 1,- )
(c) p105: PROTEIN (t 1 ) ^ degradation: PROTEIN_CATABOLISM (e 1 ,t 1 )
Verbs functioning in this way are plenty (e.g., perform corresponding to AGENT role, experience to EXPERIENCER role) . Derivational forms of these verbs also function in the same way (e.g., p105's undergoing of degradation). With respect to the shared task, we found that the usefulness of the ATTRIBUTIVE type of embedding was largely limited to two verbal predicates, involve and require, and their nominalizations.
Definition 11. A relational predicate semantically links two predications, providing a discourse/coherence function between them.
Discourse/coherence relations, discussed in various discourse models (e.g., Rhetorical Structure Theory , Penn Discourse TreeBank ), are typically indicated by syntactic classes such as subordinating and coordinating conjunctions (e.g., although and and, respectively), or discourse adverbials (e.g., then). However, they may also permeate to the subclausal level, often signalled by "discourse verbs"  (e.g., cause, mediate, lead, correlate), their nominal forms or other abstract nouns, such as role. These subclausal realizations appear frequently in biological research articles. We subcategorize the RELATIONAL type into CAUSAL, TEMPORAL, CORRELATIVE, COMPARATIVE, and SALIENCY types. We exemplify subclausal realizations of these categories in the shared task corpus below (See Figure 2 for the relevant logical forms for the sentence in Example (9a)):
(9) (a) Stimulation of cells leads to a rapid phosphorylation of IκBα, which is presumed to be important for subsequent degradation. (CAUSAL, SALIENCY, and TEMPORAL, respectively)
(b) This increase in p50 homodimers coincides with an increase in p105 mRNA, ... coincide: CORRELATIVE(em 1 ,0.5,positive,em 2 ,em 3) ^ increase(em 2 ...) ^ increase(em 3 ...)
(c) Cotransfection with ... expression vectors produced a 5-fold increase compared with cotransfection with the ... expression vectors individually.
compare: COMPARATIVE(em 1 ,1,positive,em 2 ,em 3) ^ cotransfection(em 2 ...) ^ cotransfection(em 3 ...)
Not all the subtypes of this class were relevant to the shared task: for example, comparative predications are not of interest. However, we found that CAUSAL, CORRELATIVE, and SALIENCY subtypes play a role, particularly in complex regulatory events.
Definition 12. Valence shifting describes the sentiment or polarity shift in a clause engendered by particular words, called valence shifters .
Three types of valence shifters are generally defined: NEGATOR (e.g., not), INTENSIFIER (e.g., strongly), and DIMINISHER (e.g., barely) [27–29].The type of embedding introduced by such words is crucial in semantic composition, as they behave similarly to MODAL predicates in changing the scalar modality value associated with the embedded predication. In Example (10a), the negative determiner no makes the binding event indicated by the verbal predicate bound non-factual. Example (10b) illustrates a diminishing effect, introduced by the adverb slightly.
(10) (a) ... no NF-κB bound to the main NF-κB-binding site 2 of the IL-10 promoter ...
(b) FOXP3 was only slightly reduced after RUNX1 silencing.
In the shared task setting, this type of embedding plays a role in speculation and negation detection.
Our methodology relies on a trigger dictionary, in which trigger expressions (predicates) are mapped to relevant atomic or embedding predication types. Previously, we relied on training data and simple statistical measures to identify good trigger expressions for biological event types and used a list of triggers that we manually compiled for speculation and negation detection (see [19, 20] for details).
Embedding trigger dictionary entries
Lemma The lemma of the trigger expression.
Part-of-speech The POS tag of the trigger.
Semantic type One or more atomic/embedding predicate types.
Polarity Whether the meaning contribution of the predicate is positive, negative, or neutral. For instance, with respect to the DYNAMIC:POTENTIAL category, the adjectival predicate capable has positive polarity, while the polarity of unable is negative.
Category strength How strongly the trigger is associated with its semantic type. For example, the evidential predicate show is more strongly associated with the EVIDENTIAL:DEMONSTRATIVE category than the predicate suggest.
Negative raising Whether the trigger allows transfer of negation to its complement. For example, think, believe allow negative raising. (I don't think P ≡ I think ¬P).
Polarity, category strength and negative raising features interact with semantic types to associate a context-dependent scalar modality value with predications, as indicated earlier. We denote the value of a feature F of a trigger P as F(P) (e.g., Lemma(P), Sem(P)).
The semantic types of atomic predicates are simply shared task event types determined from training data using maximum likelihood estimation, as before [19, 20]. Using event types as semantic types of atomic predicates reflects our hypothesis that atomic predications are concerned with domain-specific events. Polarity values of atomic predicates are by default neutral, unless the trigger involves an affix which explicitly has positive or negative polarity (e.g., nonexpression (negative), upregulation (positive)). Category strength is simply set to 1, and negative raising is false by default.
On the other hand, we have been independently extending our manually compiled list of speculation/negation triggers to include other types of embedding predicates and to encode finer grained distinctions in terms of their categorization and trigger behaviors. This portion of the dictionary is composed of: (a) expressions compiled from relevant literature and linguistic classifications, (b) expressions automatically extracted from the shared task corpus as well as the GENIA event corpus , (c) limited extension based on lexical resources, such as WordNet  and UMLS Specialist Lexicon . Some polarity values are derived from a polarity lexicon  and extended by using heuristics involving the predicate. For example, if the most likely event type associated with the predicate is NEGATIVE_REGULATION in the shared task corpus, we assume its polarity to be negative. Others are assigned manually. Similarly, some category strength values are based on our prior work , while others were manually assigned.
The trigger dictionary incorporates ambiguity; however, for the shared task, we limit ourselves to one semantic type per predicate to avoid the issue of disambiguation. For ambiguous triggers extracted from the training data, the semantic type with the maximum likelihood is used. This works well in practice, since the distribution of event types for a trigger word is generally skewed in favor of a single event type . On the other hand, we manually determined the semantic type to use for triggers that we compiled independently of the training data. In this way, we use 466 atomic predicates and 908 embedding ones. All atomic predicates and 152 of the embedding predicates are drawn specifically from the shared task corpus.
As mentioned above, the composition phase builds on simple entities, syntactic dependency relations and a trigger dictionary. Using these elements, we first construct a semantic embedding graph representing the content of the document, making semantic dependencies explicit. Entity semantics are provided in the shared task annotations. To obtain syntactic dependency relations, we segment each document into sentences, parse them using the re-ranking parser of Charniak and Johnson  adapted to the biomedical domain  and extract syntactic dependencies from the resulting parse trees using the Stanford dependency parser , which also provides token information, including lemma and positional information. We use the default Stanford dependency representation, collapsed dependencies with propagation of conjunct dependencies. We consult the trigger dictionary to identify predicate mentions in the document. After the semantic embedding graph for a document is constructed, we compose predications by traversing the graph in a bottom-up manner. We present a high level description of the composition phase below.
We convert syntactic dependencies into a directed acyclic semantic embedding graph whose nodes correspond to surface elements of the document and whose labeled arcs correspond to semantic embedding relations between surface elements.
An embedding relation is clearly similar to a syntactic dependency. However, in contrast to a syntactic dependency, direction of an embedding relation reflects the semantic dependency between its elements, rather than a syntactic one, and a semantic dependency can cross sentence boundaries. We distinguish embedding relations from syntactic dependencies by capitalizing their types (labels).
Application of intra-sentential transformation rules
... CD40 ligand interactions play a key role ...
NN(interactions, CD40 ligand)
... specifically binds and phosphorylates IκBα
... possible involvement of HCMV ...
AMOD(possible,involvement) PREP_OF(involvement, HCMV)
... Tat and Sp1 proteins ...
In addition to collapsing several syntactic dependencies into one embedding relation (row 1), a transformation rule may result in splitting one into several embedding relations (row 2) (Coordination Transformation), or in switching the direction of the dependency (row 3) (Dependency Direction Inversion). In addition to capturing semantic dependency behavior explicitly and incorporating semantic information (entity and predicate mentions) into the embedding structure, these transformations also allow us to correct syntactic dependencies that are systemically misidentified, such as those that involve modifier coordination (row 4) (Corrective Transformation). Also note that a transformation is not necessary when the syntactic dependency under consideration is isomorphic to an embedding relation, that is, it reflects the direction of the semantic dependency accurately (prep_of dependency in row 3). We currently use 13 such transformation rules, hand-crafted by analyzing the relevant syntactic constructions and the corresponding syntactic dependency configurations.
Once these intra-sentential transformations are complete, we finalize the document embedding graph by considering two types of special embedding relations:
PREV A semantic dependency that holds between the topmost nodes associated with adjacent sentences as to reflect the sequence of sentences.
COREF A coreference relation that holds between an anaphoric element and its antecedent. The antecedent may be in the same sentence as the anaphor or in a prior sentence.
The polarity value can be positive, negative or neutral. For simplicity, we limit here scalar modality values to the [0, 1] range and compute it for a predication that is in the scope of a MODAL and VALENCE_SHIFTER predicate. Atomic predications initially take the polarity value assigned to their trigger in the dictionary and a modality value of 1.0.
Definition 14. An argument identification rule R:Q→A is a typing function. Q is a 4-tuple 〈T, POS, IN, EX〉, where
T is an embedding relation type
POS is a part-of-speech category
IN and EX are sets denoting inclusion and exclusion constraints, respectively
Argument identification rules
Embedding Relation Type
Polarity composition is relevant in the context of embedding predications. The polarity value of such a predication depends on:
The polarity value of its trigger (from the dictionary)
The embedded polarity value associated with the embedded predication
Polarity value composition
Embedded polarity value
(11) (a) ... Bcl-2 overexpression leads to the prevention of chemotherapy (paclitaxel)-induced expression...
(b) prep _ to(leads,prevention)
Polarity(lead) = positive
Polarity(prevention) = negative
(c) lead: CAUSAL(em 1 ,1,positive,... em 2 ...) ^ prevention: CAUSAL(em 2 ,1,negative,... em 3) ...
(d) lead_prevention: CAUSAL(em 1 ,1,negative, ... em 3)
Modality value composition is only relevant for MODAL and VALENCE_SHIFTER type predicates, because only these predicates have scale-shifting properties. When a predicate of one of these types is encountered during graph traversal, we percolate its modality effect down to update the scalar modality values of the predications in its scope. This procedure, illustrated in Example (12) below, is affected by several factors:
The semantic type of the current predicate (Sem(P))
Its category strength as specified in the dictionary (Strength(P))
The embedded scalar modality value of the embedded predication (MV S (Pr e ))
Let us consider the underlined fragments in Example (12a) to characterize modality value composition. The embedding relations between the underlined fragments are given in Example (12b) and syntactic embeddings in (12c).
(12) (a) Thus , ... IL-10 upregulation in monocytes may not involve NF-κB, MAPK, or PI 3-kinase activation, ...
(c) Thus > s may > s not > s involve > s upregulation
Valence shifting and modality are encoded with not and may nodes in the graph, respectively. They affect the scalar modality value directly: not changes the modality value of the predication bound to its child, involve, from 1 (default modality value for the predicate involve) to 0, since it is a negative valence shifter (Definition 15).
The may node, parent of the not node, shifts the modality value of Pr e from 0 to 0.3 (see Definition 16). Note that may is a predicate of SPECULATIVE type and has a category strength of 0.7. The increase illustrates the fact that while a modal predicate like may normally lowers the modality value of an embedded predication in a positive context, its effect is to increase this value when the embedded predication is initially in a negative context (not involve).
Argument propagation is concerned with determining whether a descendant of the current node can serve as its argument, when the intermediate nodes between them are semantically free.
Consider the sentence in Example (13a). The entities associated with the fragment are underlined, the embedding relations are given in (13b), and the result of the composition in (13c).
(13) (a) ... no NF-κB bound C to the main NF-κB-binding site B 2 of the IL-10 A promoter after addition of gp41.
(c) bind: BINDING (e 1 ,t 1 ) ^ IL-10: PROTEIN (t 1)
When traversing the embedding graph, checking the daughter nodes of the node bound (corresponding to C in Definition 17) for arguments invokes an argument identification rule, which stipulates that bind can link to an argument of Object type via an embedding relation of PREP_TO type, which in this case is site (B), a nonentity. At this point, argument propagation makes the nodes in scope of the daughter node accessible, which results in finding the node IL-10 (A), corresponding to a PROTEIN term. Thus, IL-10 is allowed as an Object argument of bound. On the other hand, another semantically bound node, gp41, cannot be an argument of bound, since the type of the relevant embedding relation is PREP_AFTER, which does not license an argument identification rule.
Besides these compositional operations, this phase also deals with coordination of entities and triggers. This phase results in a set of predications, forming a directed acyclic graph of fully composed predications. For the sentence depicted in Figure 4, duplicated in (14a), the relevant resulting embedding and atomic predications are given in (14b). Note that the first argument corresponds to Object, the second to Subject, and the rest to Adjunct arguments.
(14) (a) Our previous results show that recombinant gp41 (aa565-647), the extracellular domain of HIV-1 trans-membrane glycoprotein, stimulates interleukin-10 (IL-10) production in human monocytes.
(b) show: DEMONSTRATIVE(em 1 ,1,positive,em 2,t3) ^ our-previous-result(t 3 )
stimulate: CAUSAL(em 2 ,1,positive,e 2 , t 1) ^ gp41: PROTEIN (t 1 )
production: GENE_EXPRESSION(e 2 ,t 2,-,t4) ^ interleukin-10: PROTEIN (t 2 ) ^ human-monocyte(t 4 )
The mapping phase imposes shared task constraints on the partial interpretation obtained in the composition phase. We achieve this in three steps.
Mapping from embedding predications to events
Correspond. Event (Mod.) Type
Mapping logical arguments to semantic roles
Finally, we prune event participants that do not conform to the event definition or are semantically free as well as the predications whose types could not be mapped to a shared task event type. Thus, a Cause participant for a GENE_EXPRESSION event is pruned, since only Theme participants are annotated as relevant for the shared task; likewise, a predication with DEONTIC semantic type is pruned, because such predications are not considered for the shared task. Furthermore, the adjunct argument of the GENE_EXPRESSION event (t 4) is pruned since (a) it is semantically free, and (b) we are not dealing with non-core arguments at the moment. The Infectious Diseases track (ID) event type PROCESS is exceptional, because it may take no participants at all, and we deal with this idiosyncrasy at this step, as well. This concludes the progressive transformation of the graph to event and event modification annotations. The annotations corresponding to the predications in Example (14) are given below. Note that triggers are not shown as separate term annotations for simplicity.
(15) (a) E1 Positive_regulation:stimulates Theme:E2 Cause:T1
(b) E2 Gene_expression:production Theme:T2
The inability to resolve coreference has emerged as a factor that hindered event extraction in the BioNLP'09 Shared Task on Event Extraction . Coreference resolution is essentially a recall-increasing measure: in the following fragment, recognizing that Eotaxin is the antecedent of the pronominal anaphor Its, would allow our system to identify this term as the Theme participant of the GENE_EXPRESSION event triggered by the nominalization expression, which would remain unidentified otherwise.
(16) (a) Eotaxin is an eosinophil specific beta-chemokine assumed to be involved in eosinophilic inflammatory diseases such as atopic dermatitis, allergic rhinitis, asthma and parasitic infections. Its expression is stimulus- and cell-specific.
(b) expression: GENE_EXPRESSION (e 1 ,t 1 ) ^ eotaxin: PROTEIN (t 1 )
The Protein Coreference Task  was proposed as a supporting task in BioNLP'11-ST. The performance of participating systems in this supporting task were not particularly encouraging with regard to their ability to support event extraction, with the best system achieving an F1-score of 34.05 . Post-shared task, we extended our embedding framework with coreference resolution and examined the effect of different classes of anaphora on event extraction. In the description of the Protein Coreference Task , four main classes of coreference are identified:
RELAT Coreference indicated by relative pronouns and adjectives (e.g., that, which, whose)
PRON ( pronominal anaphora ) Coreference indicated by personal and possessive pronouns (e.g., it, its, they, their)
DNP ( sortal anaphora ) Coreference indicated by definite and demonstrative noun phrases (NPs that begin with the, these, this, etc.)
APPOS Coreference in appositive constructions
Our embedding framework performs coreference resolution as a subtask of the composition phase. It accommodates RELAT and APPOS classes naturally, since they are intra-sentential and they can largely be identified based on embedding relations alone. For the more complex anaphoric classes (PRON and DNP), we extended our framework. Our extension is partially inspired by the deterministic coreference resolution system described in Haghighi and Klein . To summarize, for each anaphoric mention identified in the text, their system selects an antecedent among the prior mentions by utilizing syntactic constraints and assessing the semantic compatibility between mentions. Of the remaining possible antecedents, the one with the shortest path from the anaphoric mention in the parse tree is selected as the best antecedent. The syntactic constraints used by their system include number, person, and entity type agreement as well as recognition of appositive constructions. On the other hand, their semantic compatibility filter aims to pair hypernyms, such as AOL and company. They extract such pairs from their corpus using bootstrapping. We provide more details about our treatment of the four coreference classes below.
PRON type of coreference is the second most frequent type of coreference annotated for the Protein Coreference Task (35% of all training instances), while the DNP type corresponds to 9% of the training instances. With respect to the PRON type, we only consider personal and possessive pronouns of the third person (it/its, they/their) as anaphora, since others do not seem relevant to the event extraction task (e.g., Our results). For sortal anaphora, the DNP type, we require that the anaphoric noun phrases are not associated with entities, allowing expressions such as these factors as anaphora while ruling out those like the TRADD protein.
Coreference resolution begins by identifying the set of candidate antecedents. We define the candidate antecedent set for a given anaphor as the set of embedding graph nodes which appear in the discourse prior to the anaphor and which are either semantically bound or involve hypernyms or conjunctions. The prior discourse includes the sentence that the anaphora occurs in as well as those preceding it in the paragraph.
The candidate antecedents are then evaluated for their syntactic and semantic compatibility. PRON requires person and number agreement, while DNP requires number agreement and one of the following constraints:
The head word constraint The head of the anaphoric NP and the antecedent NP are the same. This constraint allows "CD4 gene" as an antecedent for the anaphor "the gene".
The singular hypernymy constraint The head of the anaphoric NP is a hypernym of the antecedent, which involves an entity. This constraint accepts any Protein term as an antecedent for the anaphoric NP "this protein".
The plural hypernymy constraint ( set-instance anaphora ) The head of the anaphoric NP is a plural hypernym of the antecedent, which corresponds to a conjunction of entities. This constraint accepts "CD1, CD2, and CD3" as antecedent for "these factors".
The meronymy constraint The head of the anaphoric NP is a meronym and the antecedent corresponds to a conjunction of entities. This constraint allows "IBR/F" as antecedent for the anaphoric NP "the dimer".
The event constraint The head of the anaphoric NP is associated with a trigger, P1, and the antecedent with another trigger, P2, where P1 and P2 are lexicalizations of the same event. This constraint aims to capture the coreference between, for instance, the anaphor the phosphorylation and the antecedent phosphorylated.
We induced the hypernym list from the training corpus automatically by considering the heads of the NPs with entities in modifier position. Such words include gene, protein, factor, and cytokine. Similarly, we induced the meronym list from the training data of the Static Relations supporting task . These words essentially correspond to triggers for SUBUNIT-COMPLEX relations in that task, and include words such as complex, dimer, and subunit.
Several structural constraints over the embedding graph block some of the possible antecedents for both coreference types:
The antecedent directly embeds or is directly embedded by the anaphor.
The antecedent is the subject and the anaphor is the object of the same relation. In addition, the anaphor is not reflexive (e.g., itself).
The anaphor is in an adjunct position and the antecedent is in subject position of the same relation.
The candidate that is closest to the anaphor in the embedding graph is selected as the antecedent and a COREF embedding relation is created between the anaphor and the antecedent. For plural anaphora, multiple entities or triggers may be considered as antecedents, and thus multiple COREF relations may be created.
The integration of coreference information into the event extraction pipeline is trivial for all coreference types. In the composition phase, when an anaphoric expression appears in the argument position of a predication, it is naturally substituted by its antecedent(s) through argument propagation.
Official GENIA track results
Official EPI and ID track results
A particularly encouraging outcome for our system is that our results on the GENIA development set versus on the test set were very close (an F1-score of 51.03 vs. 50.32), indicating that our general approach avoided overfitting, while capturing the linguistic generalizations, as we intended. We observe similar trends with the other tracks, as well. In the EPI track, development/test F1-score results were 29.1 vs. 27.88; while, in the ID track, interestingly, our test set performance was better (39.64 vs. 44.21). We also obtained the highest recall in the ID track (49), despite the fact that our system typically favors precision. We attribute this somewhat idiosyncratic performance in the ID track partly to the fact that we did not use a track-specific trigger dictionary for the official submission. All but one of the ID track event types are the same as those of the GENIA track, which led to identification of some ID events with triggers consistently annotated only in the GENIA corpus and to low precision particularly in complex regulatory events. A post-shared task re-evaluation confirms this: the F1-score for the ID track increases from 44.21 to 48.9 when only triggers extracted from the ID track corpus are used; recall decreases from 49 to 45.26, while the precision increases from 40.27 to 53.18. It is unclear to us why a reliable trigger in one corpus is not reliably annotated in another, even though the same event types are considered in both corpora. One possibility is that different annotators may have a different conceptualization of the same event types. Consider the following sentences: Example (17a) is from the GENIA corpus and Example (17b) from the ID corpus. Even though the verbal predicate lead appears in similar contexts in both sentences, it is annotated as an event trigger only in Example (17a).
(17) (a) Costimulation of T cells through both the Ag receptor and CD28 leads to high level IL-2 production ...
lead: POSITIVE_REGULATION(em 1 ,em 2)
high_level: POSITIVE_REGULATION(em 2 ,e 3)
production: GENE_EXPRESSION(e 3 ,t 1) ^ IL-2: PROTEIN (t 1 )
(b) ... the two-component regulatory system PhoR-PhoB leads to increased hilE P2 expression ...
increased: POSITIVE_REGULATION(em 1 ,e 2 ,t 1) ^ PhoR-PhoB: PROTEIN (t 1 )
expression: GENE_EXPRESSION(e 2 ,t 2) ^ hilE: PROTEIN (t 2 )
One of the interesting aspects of the shared task was its inclusion of full-text articles in training and evaluation. Cohen et al.  show that structure and content of biomedical abstracts and article bodies differ markedly and suggest that some of these differences may pose problems in processing full-text articles. Since one of our goals was to determine the generality of our system across text types, we did not perform any full text-specific optimization. Our results on article bodies are notable: our system had stable performance across text types (in fact, we had a very slight F1-score improvement on full-text articles: 50.28 to 50.4). This contrasts with the drop of a few points that seems to occur with other well-performing systems. Taking only full-text articles into consideration, we would be ranked 4th in the GENIA track. Furthermore, a preliminary error analysis with full-text articles indicates that parsing-related errors are more prevalent in the full-text article set than in the abstract set, consistent with Cohen et al.'s  findings. At the same time, our results confirm that we were able to abstract away from such errors by a careful, selective use of syntactic dependencies and correcting them with heuristic transformation rules, when necessary.
The regulatory events in the GENIA track may take Cause arguments as core participants. They are annotated much less frequently than the other core argument, Theme, and therefore, it may be more challenging for machine-learning based methods to extract Cause arguments than to extract Theme arguments. Since our methodology is less reliant on the training data with respect to argument identification, we find it informative to compare our results in identifying Cause participants to the results of other systems. The comparison reveals that our system performs the best in identifying the Cause participants (F1-score of 43.71), confirming our intuition that linguistically-grounded methods may perform better in the absence of large amounts of annotated data.
Our core module can extract adjunct arguments, using ABNER  as its source for additional biological named entities. We experimented with mapping these arguments to non-core event participants (Site, toLoc, etc.); however, we did not include them in our official submission, because they seemed to require more work with respect to mapping to shared task specifications. Due to this shortcoming, the performance of our system suffered significantly in the EPI track, in which the primary evaluation criterion involves non-core event participants as well as the core participants.
GENIA Task 3 results based on gold event annotations
Event Modification Type
Task 3-specific precision errors included cases in which speculation or negation was debatable, as the examples below show. In Example (18a), our system detected a SPECULATION instance, due to the verbal predicate suggesting, which scopes over the event indicated by role. In Example (18b), our system detected a NEGATION instance, due to the verbal predicate lack, which scopes over the events indicated by expression. Neither were annotated as such in the shared task corpus. Annotating negation and speculation is clearly nontrivial, as there seems to be some subjectivity involved, and such errors seem acceptable to a certain extent.
(18) (a) ... suggesting a role of these 3' elements in beta-globin gene expression.
(b) ... DT40 B cell lines that lack expression of either PKD1 or PKD3 ...
Another class of precision errors was due to argument propagation. The current algorithm appears to be too permissive in some cases and a more refined approach to argument propagation may be necessary. In the following example, while suggest, an epistemic predicate, does not s-embed induction (as shown in (19b)), the intermediate nodes simply propagate the predication associated with the induction node up the graph, leading us to conclude that the predication triggered by induction is speculated, leading to a precision error.
(19) (a) ... these findings suggest that PWM is able to initiate an intracytoplasmic signaling cascade and EGR-1 induction ...
(b) suggest > s able > s initiate > s induction
Simply restricting argument propagation to one level increases the precision and F1-score slightly (from 66.67 to 66.93). Disallowing it altogether (that is, using the immediate daughters as arguments only), however, increases precision while lowering recall and F1-score significantly (from 66.67 to 61.31). This result indicates that the types of the embedding relations along the path from the trigger node to the target node play a larger role in determining whether the target node can act as an argument than the length of the path.
Some of the recall errors were due to shortcomings in the argument identification rules, as it is currently implemented. One recall problem involved the embedding status of and rules concerning copular constructions, which we had not yet addressed. Therefore, we miss the relatively straightforward SPECULATION instance in the following example.
(20) ... the A3G promoter appears constitutively active.
Similarly, the lack of a trigger expression in our dictionary causes recall errors. One example below (21a) shows an instance where this occurs, in addition to lack of an appropriate argument identification rule, while the recall error in (21b) is solely due to the lack of the trigger expression:
(21) (a) mRNA was quantified by real-time PCR for FOXP3 and GATA3 expression.
(b) To further characterize altered expression of TCRζ, p56(lck) ...
Our system also missed an interesting, domain-specific type of negation, in which the minus sign acts similar to a negative determiner (e.g., no) and indicates negation of the event that the entity participates in.
(22) ... CD14- surface Ag expression ...
Coreference resolution on GENIA development set
Base + RELAT
Base + APPOS
Base + PRON
Base + DNP
Base + ALL
Coreference resolution on test sets
GENIA + COREF
EPI + COREF
ID + COREF
ID-T + COREF
It is interesting to note that while the APPOS type coreference was rarely annotated in Protein Coreference Task corpus, resolving it had the biggest effect on event extraction. This is in contrast to the RELAT type, which had the highest percentage of instances in the corpus but had little effect on event extraction. We were particularly interested in the results involving PRON and DNP types, since the participants of events resulting from resolving these types can potentially span multiple sentences, playing a role in our higher level goal of discourse interpretation. We manually analyzed the events extracted through resolution of PRON and DNP types of coreference. We found that 32.5% of such events were correct, however the positive effect was largely limited to intra-sentential coreference resolution (43.2% vs. 16%). Among the events correctly identified due to intra-sentential coreference resolution, 56% involved coreference of PRON type. On the other hand, among those due to inter-sentential coreference resolution, 84% involved the DNP type. In the following example, the possessive adjective their (PRON type) refers to the proteins GATA3 and FOXP3 and we extract the relevant events shown in (23b).
(23) (a) Thus, although GATA3 and FOXP3 showed similar kinetics, their expression polarizes at the end ...
(b) expression: GENE_EXPRESSION(e 1 , t 1) ^ GATA3: PROTEIN (t 1 )
expression: GENE_EXPRESSION(e 2 , t 2) ^ FOXP3: PROTEIN (t 2 )
In Example (24), we correctly identify the event in (24b) from the sentence in (24a) by resolving the inter-sentential coreference between this restriction factor and APOBEC3G:
(24) (a) APOBEC3G (A3G), a member of the recently discovered family of human cytidine deaminases, is expressed in peripheral blood lymphocytes and has been shown to be active against HIV-1 and other retroviruses. To gain new insights into the transcriptional regulation of this restriction factor , ...
(b) transcriptional_regulation: REGULATION(e 1 ,t 1) ^ APOBEC3G: PROTEIN (t 1 )
Among the misidentified events, we observe that some are due to shortcomings of the event extraction algorithm, rather than coreference resolution. In the following example, the coreference between the expression these receptors and the entities CD3, CD2, and CD28 is correctly identified; however, we extract the event annotation in (25b), since we ignore the quantifier any. The gold standard annotations are as given in (25c).
(25) (a) CD3, CD2, and CD28 are functionally distinct receptors on T lymphocytes. Engagement of any of these receptors induces the rapid tyrosine phosphorylation of a shared group of intracellular signaling proteins, ...
(b) engagement: BINDING(e 1 , t 1 ,t 2) ^ CD2: PROTEIN (t 1 ) ^ CD28: PROTEIN (t 2 )
(c) engagement: BINDING(e 1 , t 1) ^ CD2: PROTEIN (t 1 )
engagement: BINDING(e 2 , t 2) ^ CD28: PROTEIN (t 2 )
We also noted cases in which the events that our system identifies due to coreference resolution seem correct, even though they are not annotated as such in the gold standard, as exemplified below. In this example, the anaphoric expression their is found to corefer with IL-2 and IFN-γ, and therefore, the event annotations in (26b) are extracted, whereas the gold standard only includes the event annotation in (26c).
(26) (a) Runx1 activates IL-2 and IFN-γ gene expression in conventional CD4+ T cells by binding to their respective promoter ...
(b) binding: BINDING(e 1 , t 1 ,t 2) ^ Runx1: PROTEIN (t 1 ) ^ IL-2: PROTEIN (t 2 )
binding: BINDING(e 2 , t 1 ,t 3) ^ Runx1: PROTEIN (t 1 ) ^ IFN-γ: PROTEIN (t 3 )
(c) binding: BINDING(e 1 , t 1) ^ Runx1: PROTEIN (t 1 )
However, the shortcomings of the coreference resolution are evident in most error cases. The fact that we only consider semantically bound elements as potential antecedents leads to a considerable number of errors. In such cases, the actual antecedent closer to the anaphoric expression may be ignored, in favor of a more distant entity. In the following example, we identify as antecedent PKD1, PKD2, and PKD3 for the pronoun they, because the actual antecedent, PKD enzymes, is semantically free. This leads to three false positive errors shown in (27b).
(27) (a) The protein kinase D (PKD) serine/threonine kinase family has three members: PKD1, PKD2, and PKD3 . Most cell types express at least two PKD isoforms but PKD enzymes are especially highly expressed in haematopoietic cells, where they are activated in response to antigen receptors stimulation.
(b) activated: POSITIVE_REGULATION(e 1 , t 1) ^ PKD1: PROTEIN (t 1 )
activated: POSITIVE_REGULATION(e 2 , t 2) ^ PKD2: PROTEIN (t 2 )
activated: POSITIVE_REGULATION(e 3 , t 3) ^ PKD3: PROTEIN (t 3 )
Our two-phase, compositional approach to event extraction clearly distinguishes general linguistic principles from task-specific aspects. Our results demonstrate the viability of our approach on both abstracts and article bodies. The fact that we perform similarly on abstracts and article bodies is a particularly important aspect of our system. Our system also performs consistently between test sets and development sets, suggesting that it is robust and does not suffer from the brittleness and low recall often attributed to rule-based systems. We consider this robustness a result of the generality of the underlying rules, partially aided by syntactic dependency parsing as it normalizes much of the syntactic variation. The results also reveal some of the shortcomings of our approach. For example, our error analysis shows that some aspects of our semantic composition algorithm (argument propagation, in particular) requires more refinement. We also find that learning trigger expressions for the common event types in ID and GENIA tracks from both training corpora has a negative effect on the ID track results; however, more research is needed to determine whether GENIA and ID texts really constitute two different sublanguages or whether the differences are simply due to annotation inconsistencies.
While biological event extraction at the sentence level is already a challenging task, we believe that future research should also focus on moving beyond sentence level to wider discourse context. An important step in this direction is coreference resolution, a problem that we investigated post-shared task. We did not observe much significant improvement due to coreference resolution; however, our experiments allowed us to identify several areas of improvement. For example, the underspecified nature of our current coreference resolution algorithm (that it only targets PROTEIN and predicate terms as antecedents) leads us to miss some relatively easy cases of PRON and DNP types of coreference and lowers precision. Integrating a named-entity recognizer (NER) into our system would allow us to impose more semantics on our system, and thus, could improve coreference resolution performance. We expect that a general NER system such as MetaMap  which provides access to the rich semantics of UMLS  would be particularly useful. In addition, coreference resolution interacts with higher level discourse constraints in significant ways (see, for example, ), and we are currently exploring this further. Our modular, incremental approach ensures that new capabilities can be added and their effect on overall system performance can be measured. With these improvements, we plan to make our system available to the scientific community as a robust baseline system in the near future.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 11, 2012: Selected articles from BioNLP Shared Task 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S11.