The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011

Background The Genia task, when it was introduced in 2009, was the first community-wide effort to address a fine-grained, structural information extraction from biomedical literature. Arranged for the second time as one of the main tasks of BioNLP Shared Task 2011, it aimed to measure the progress of the community since 2009, and to evaluate generalization of the technology to full text papers. The Protein Coreference task was arranged as one of the supporting tasks, motivated from one of the lessons of the 2009 task that the abundance of coreference structures in natural language text hinders further improvement with the Genia task. Results The Genia task received final submissions from 15 teams. The results show that the community has made a significant progress, marking 74% of the best F-score in extracting bio-molecular events of simple structure, e.g., gene expressions, and 45% ~ 48% in extracting those of complex structure, e.g., regulations. The Protein Coreference task received 6 final submissions. The results show that the coreference resolution performance in biomedical domain is lagging behind that in newswire domain, cf. 50% vs. 66% in MUC score. Particularly, in terms of protein coreference resolution the best system achieved 34% in F-score. Conclusions Detailed analysis performed on the results improves our insight into the problem and suggests the directions for further improvements.


Background
The BioNLP Shared Task (BioNLP-ST, hereafter) is a series of efforts to promote a community-wide collaboration towards fine-grained information extraction (IE) in biomedical domain. The first event, BioNLP-ST 2009, introducing a bio-molecular event (bio-event) extraction task, attracted a wide attention, with 24 teams submitting final results [1].
To establish a community effort, the organizers provided the task definition, benchmark data, and evaluations, and the participants competed in developing systems to perform the task. Meanwhile, participants and organizers communicated to develop a better setup of evaluation. Some participants provided their tools and resources for others, making it a collaborative competition.
The final results showed that the automatic extraction of simple events -those with unary arguments, e.g., gene expression -could be achieved at the performance level of 70% in F-score, but the extraction of complex events, e.g., binding and regulation, was a lot more challenging, having achieved 40% of performance level.
After BioNLP-ST 2009, all the resources from the event were released to the public, to encourage continuous efforts for further advancement, and the online evaluation service has been kept open to provide reliable evaluation. Since then, several improvements have been reported [2][3][4][5][6]. For example, Miwa et al. [2] reported a significant improvement with binding events, achieving 50% of performance level.
The task introduced in BioNLP-ST 2009 was renamed to Genia event (GE) task, and was hosted again in BioNLP-ST 2011, which also hosted four other main tasks and three supporting tasks [7]. As the sole task that was repeated in the two events, the GE task took the role of connecting the results of the 2009 event to the other main tasks of 2011. The GE task in 2011 received final submissions from 15 teams. The results show that the community made a significant progress with the task, and that the technology can be generalized to full papers at moderate cost of performance.
It is one of the lessons from BioNLP-ST that coreference structures in biomedical text substantially hinder the progress of fine-grained IE [1]. To address the problem, the Protein Coreference (CO) task was arranged as one of the supporting tasks of BioNLP-ST 2011. While the task itself is not an IE task, it is expected to be a useful component in performing the main IE tasks more effectively. To establish a stable evaluation and to observe the effect of the results of the task to the main IE tasks, the CO task particularly focused on finding anaphoric protein references. After 7 weeks of system development phase, six teams submitted their final results. According to our primary evaluation criteria, the best system is evaluated to find 22.18% of anaphoric protein references at a precision of 73.26%. This paper presents the results of BioNLP-ST 2011, extending the GE and CO task overview papers [8,9] in the BioNLP-ST 2011 workshop proceedings. Particularly, the paper focuses on providing more data and analyses to support the improvement and generalization achieved with the GE task, and also to show the problems of current approaches of CO with possible future directions.

Results and discussions
The results of BioNLP-ST 2011 are summarized to the task definition, resources and results.

Task definitions GE task
The GE task follows the task definition of BioNLP-ST 2009, which is briefly described in this section. For more detail, please refer to [1,10]. Table 1 shows the event types to be addressed in the task. For each event type, the primary and secondary arguments to be extracted with an event are defined. For example, a Phosphorylation event is primarily extracted with the protein to be phosphorylated. As secondary information, the specific site to be phosphorylated needs to be extracted when expressed in text.
From a computational point of view, the event types represent different levels of complexity. When only primary arguments are considered, the first five event types in Table 1 are classified as simple event types, requiring only unary arguments. The Binding and Regulation types are more complex: Binding requires detection of an arbitrary number of arguments, and Regulation requires detection of recursive event structure.
Based on the definition of event types, the entire task is divided to three sub-tasks addressing event extraction at different levels of specificity: Task 1. Core event extraction addresses the extraction of typed events together with their primary arguments.
Task 2. Event enrichment addresses the extraction of secondary arguments that further specify the events extracted in Task 1.
Task 3. Negation/speculation detection addresses the detection of negations and speculations over the extracted events.
Task 1 serves as the backbone of the GE task and is mandatory for all participants, while the other two are optional. Figure 1 shows an example of event annotation. The event encoded in the text is represented in a standoffstyle annotation as follows: T1Protein 15  Finding the full representation of E1 is the goal of Task 2. In the example, the localization event, E1, is negated as expressed in the failure of. Finding the negation, M1, is the goal of Task 3.

CO task
The CO task is newly defined in BioNLP-ST 2011. Figure 2 shows an example text that is segmented into four sentences, S2 -S5, where anaphoric coreferences are illustrated with colored extends and arrows. In the figure, protein names are highlighted in purple, T4 -T10, and anaphoric protein references, e.g., pronouns and definite noun phrases, are highlighted in red, T27, T29, T30, T32, of which the antecedents are indicated by arrows if found in the text. In the example, the definite noun phrase (NP), this transcription factor (T32), is a coreference to p65 (T10). Without knowing the coreference structure, it becomes hard to capture the information written in the phrase, nuclear exclusion of this transcription factor, which is localization of p65 (out of nucleus) according to the framework of BioNLP-ST.
A standard approach would include a step to find candidate anaphoric expressions that may refer to proteins. In this task, pronouns, e.g., it or they, and definite NPs that may refer to proteins, e.g., the transcription factor or the inhibitor are regarded as candidates of anaphoric protein references. This step corresponds to the markable detection and the anaphoricity determination steps in the jargon of MUC [11]. The next step would be to find the antecedents of the anaphoric expressions. This step corresponds to the anaphora resolution step.
The protein annotation to the example text in Figure  2 is as follows: T4 The first line indicates that there is a protein reference, T4, in the span that begins at 275'th and ends before 278'th character, of which the text is p65.
The coreference annotation is made by three types of annotations. The first type is the annotations for anaphoric protein references. For example, those in red in Figure 2  The first line indicates that there is an anaphoric protein reference in the specified span, of which the text is the NF-kappa B transcription factor complex (here truncated to five characters due to limit of space), and that its minimal expression is complex. The second type is the annotations for the noun phrases that are antecedents of the anaphoric references. For example, T28 and T31 (highlighted in blue) are antecedents of T29 and T32, respectively: Note that due to limit of space, argument names are abbreviated, e.g., "Ana" for "Anaphora", and "Ant" for "Antecedent". The first line indicates that there is a coreference relation, R1, of which the anaphor is T29 and the antecedent is T28, and that the antecedent contains two protein names, T5 and T4.
Note that, sometimes, an anaphoric expression, e.g., which (T29), is connected to more than one protein names, e.g., p65 (T4) and p50 (T5). Sometimes, coreference structures do not involve any specific protein name, e.g., T30 and T27. In order to establish a stable evaluation, our primary evaluation will focus only on  coreference structures that involve specific protein names, e.g., T29 and T28, and T32 and T31. Among the three, only two, R1 and R3, involve specific protein references, T4 and T5, and T10. Thus, finding of R2 will be ignored in the primary evaluation. However, those not involving specific protein references are also provided in the training data to help system development, and will be considered in the secondary evaluation mode.

Task resources
In order to guide and promote the development of the systems to perform the GE and CO tasks, benchmark data sets were developed and provided to the participants The benchmark data for the GE task was initially prepared for the first BioNLP Shared Task in 2009. At that time, the data included only titles and abstracts of papers from Medline. For BioNLP-ST 2011, full papers have been added, and the benchmark data now consist of two collections. The abstract collection is the same as the data for BioNLP-ST 2009, and is meant to be used to measure the progress of the community. The full text collection contains full papers which are newly annotated, and is meant to be used to measure the generalization of the technology to full papers. The whole data sets include annotations for events as defined in Table  1. The abstract collection also include annotations for coreferences, becoming the benchmark data set for the CO task.
The whole data set is divided into three sub-sets for the purpose of training, tuning and testing. The training and tuning sets are provided to the shared task participants, with the full annotations. However, with the test set, only the protein annotations are provided, and the participants are expected to produce the remaining annotations. Table 2 shows basic statistics of the annotations in the benchmark data sets. The number of words shows there are much more training data from abstracts than from full papers. The number of annotated coreferences are shown in classification, indicating that relative pronouns, pronouns, and definite noun phrases are the three major types of anaphora expression in the data set. The number of event annotations shows that Gene_expression, Binding and Regulation (including its subtypes) are the most frequent event types in the data sets.
As the full paper collection is a newly added portion for BioNLP-ST 2011, the statistics are examined in more detail across different sections of full papers. As different sections of scientific papers, e.g., title, abstract, introduction, results, conclusions, and so on, are written with different purposes, the type information expected to be found in those sections would be different. Table   3 shows detailed statistics of annotated entities in different sections. For the examination, the sections are roughly classified into TIAB (titles and abstracts), Intro. (introduction and background), R/D/C (results, discussions and conclusions), Methods (methods and experimental procedures), and Caption (captions of figures and tables). An observation at the statistics says that the Methods and Caption sections mention the events defined in Table 1 much less frequently than the other sections: on average, only one and four events are mentioned in 100 words of Methods and Caption sections respectively, while 5~8 events in the other sections. It is also observed that in the two sections, events are mentioned in more coordinated structure: on average, 1.46 and 1.61 events are coordinated in Methods and Caption sections respectively, while 1.23~1.37 events are in the other sections. It may agree with the intuition that the two sections usually describe things in a concrete way with much more details than other sections, enumerating relevant entities as exact as possible. Therefor it is expected that The events and the coreferences annotations are used for the GE and CO tasks, respectively.
the IE from the two sections will benefit from an improved processing of coordinated linguistic structures. The distribution of annotated events across the five different sections is illustrated in Figure 3. It is notable that the TIAB, Intro. and R/D/C sections show similar distribution of annotated events, but the Methods and Caption sections show significantly different distributions. Particularly, the ratio of Gene_expression is significantly high in the latter two sections, and the ratio of Negative_regulation is quite high in the Methods section. An intuition which may explain the observation is that the Methods sections often describe experimental procedures that are designed to cause negative regulatory effects, e.g., mutation, addition of inhibitor proteins, and so on, and that the results of molecular biology experiments are often observed at the gene expression level. This observation suggests a different event annotation scheme, or a different event extraction strategy would be required for Methods and Caption sections.

Task results
The participants to the GE and CO tasks were given three months and seven weeks respectively for system development. After that, 19 teams submitted their final results: 13 to the GE, 4 to the CO, and 2 to both tasks. Table 4 describes the teams who participated in the tasks, except three who wanted to remain anonymous. Table 5 shows brief profiles of the systems. This section presents the final results and analyses on them. The performance is reported in recall, precision, and f-score, based on the Approximate recursive matching and the Protein coreference evaluation for the GE and CO tasks, respectively. Readers are referred to the Methods section, for more detail.

GE Task 1 results
Among the sub-tasks of the GE task, Task 1 was mandatory and 15 teams made their final submissions to the task. Table 6 shows the evaluation results of Task 1. For The Abstract column shows the statistics of the abstraction collection (1210 titles and abstracts), and the following columns show that of the full paper collection  the evaluation results of the individual simple-type events, e.g. Gene_expression, please refer to the GE task overview paper of the BioNLP-ST 2011 workshop [8].
For reference, the reported performance of the two previous systems, UTurku09 and Miwa10 is shown at the top. UTurku09 was the winning system of Task 1 in 2009 shared task [12], and Miwa10 was the best system reported after BioNLP-ST 2009 [2]. The best overall performance on Task 1 (56.04%) in BioNLP-ST 2011 was achieved by the FAUST system, which adopted a combination model of UMass and Stanford. In terms of improvement, the performance of FAUST on the abstract collection (57.46%) demonstrates a significant improvement of the community on the GE task, when compared to the performance of UTurku09 (51.95%) and Miwa10 (53.29%). The biggest improvement was made to the Regulation events (from 40.11% and 40.60% to 46.97%) of which the extraction requires a complex modeling of recursive event structure -an event may become an argument of another event. In terms of generalization, the performance of UMass on the full paper collection (53.14%) suggests that the technology which began with only abstracts can be generalized to full papers without a big loss of accuracy. Note that however this observation contrasts to the recent report about a substantial performance drop of protein mention detection in full papers [13], and that the performance reported in this paper is obtained when the gold protein annotation is given. Therefore, the performance of event extraction in a full automatic system needs to be investigated and discussed more carefully.
The ConcordU system is notable as it is the sole rulebased system that is ranked above the average. The performance of the system demonstrates both pros and The '09 column indicates whether at least one team member participated in BioNLP-ST 2009. In Background column, C=Computer Scientist, BI=Bioinformatician, B=Biologist, L=Linguist cons of a typical rule-based approach. It showed the best precision in extracting simple-type events, but was not very successful with complex-type events, suggesting that when the problem is simple, rules may be developed effectively, but it may become difficult as the problem gets complex. The performance of the system on the two collections shows that the rules of ConcordU system well generalize to full papers. The generalization of performance can be further investigated by observing the performance on different sections. As noted in the previous section, the Methods sections of full papers are significantly different from other sections, while the Intro. and R/D/C as well as TIAB are relatively similar to the abstract collection. It is thus expected that generalization to the Methods sections be more difficult than to other sections.
At a glance, the overall performance reported in Table  6 does not seem worse on the Methods sections than on other sections. However, it needs to be considered that the Methods sections do not include as many complex type events as other sections. In fact, in the Methods sections of the test data set, the numbers of Binding and Regulation events are both less than 10, which prevents a reliable analysis and that is the reason that the performance figures are italicized in the table. In the Methods sections, simple type events take almost 80% of the whole event, thus the overall performance is dominated by the performance extracting simple-type events. In such a case, it is more reasonable to refer to the simple event performance than to the overall performance. The simple event performance reported in Table 6 supports the hypothesis that the performance be harder to generalize to the Methods sections.
The two rule-based systems, ConcordU and TM-SCS, however seem free from the hypothesis, showing the best performances (81.82% and 72.00%) on the Methods sections. We find the reason of their good performance on the sections from the fact that rule-based approaches are in general not aggressive as much as machine learning approaches in optimizing to the training data. Note that there is usually a trade-off between optimization and generalization.
This time, three teams achieved better results than Miwa10, which indicates some role of focused efforts like BioNLP-ST. The comparison between the performance on abstract and full paper collections shows that generalization to full papers is feasible with very modest loss in performance.

GE Task 2 results
Task 2 of the GE task was optional and 4 teams submitted their final results. Table 7 shows the final evaluation results. For reference, the reported performance of the task-winning system in 2009, UT+DBCLS09 [14], is shown at the top. The top two systems on Task 1, FAUST and UMass, marked the best performances on Task 2, too. The performances of the two systems on the abstract collection demonstrate a significant improvement of the technology in two years (from 44.52% to 52.77% and 52.12%).
In detail, a significant improvement was made for Location arguments (36.59% 50.00%) by the top two systems. The performance of site argument extraction was also improved significantly. A breakdown of the evaluation results of site argument extraction, shown in table 8, indicates that for Phosphorylation events the performance of finding the site arguments is approaching a level of practical use (82.93% by UTurku). As for Regulation, the performance was significantly improved (41.03% by FAUST), while the performance improvement for Binding was only marginal (13.33% by BMI@ASU).
The performance on the full paper collection shown in Table 7 seems to tell the extraction of secondary arguments in full text is much more challenging, but Table  8 shows there is no particular performance degradation on the full paper collection, except with the Regulation events. The reason of the the low overall performance figures on the full paper collection in Table 7 is explained by referencing Table 2. In the full paper portion of the test set, there were extraordinary number of Binding events with site arguments (70%), which dominated the overall performance figures. Note that it cancels the conclusion made by [10] that extraction of secondary arguments from full papers might be much more challenging than from abstracts. Table 9 shows final evaluation results of Task 3. For reference, the reported performance of the task-winning system in 2009, ConcordU09 [15], is shown at the top. Among the two teams participated in the task, UTurku showed a better performance in extracting negated events, while ConcordU showed a better performance in extracting speculated events. Both demonstrate significant improvements over the ConcordU09 system, but the performances are still far from a level of practical use. It seems generalization to the full papers is a bit more challenging with the speculation extraction.

GE Task 3 results
CO task results Table 10 shows the evaluated performance of the six systems who participated in the CO task. The UUtah team, who already had an experience of developing a coreference resolution system for the newswire domain, marked the best performance (34.05%). The authors reported a performance degradation (from 66.38% to 49.64% in MUC score) by the change of the domain. For a more detailed analysis, the performance was evaluated for different types of anaphoric expressions. As shown in the table, the three top-ranked teams, UUtah, UZurich and ConcordU, marked the best performances in finding coreferences of definite noun phrases (10.8%), pronouns (26.7%) and relative pronouns (66.2%), respectively. It is not surprising that the coreference resolution for relative pronouns marked the highest accuracy as in many cases relative pronouns immediately follow their antecedents. Coreference resolution of definite noun phrases marked very low accuracy even by the topranked system. The reason may be found in the fact that most systems relied on syntactic features, e.g., partof-speech or syntactic parse, for coreference resolution without differentiating the type of coreferences. However, in many cases there is only semantic connection between a definite noun phrase and its antecedent. The UUtah system does incorporate some semantic features which are originally designed for newswire domain, but it was once reported that semantic features are not domain portable while syntactic features are [16]. So, it

Conclusions
The Genia event task which was repeatedly arranged for BioNLP-ST 2009 and 2011 took a role of measuring the progress of the community and generalization of IE technology to full papers. The results from 15 teams who made their final submissions to the task show a clear progress of the community in terms of the performance on a focused domain and also generalization to full papers. The coreference resolution supporting task of BioNLP Shared Task 2011 has drawn attention from researchers of different interests. Although the overall results are not good enough to be helpful for the main shared tasks as expected, the analysis results show the problems to be solved and directions for improvements.

GE task evaluation
The shared task requires participants to predict event annotation for the test data. The evaluation is carried out by comparing the predicted annotation to the gold annotation. For the comparison, equality of annotations is first defined at various levels as follows: (1) Event equality holds between any two events when (1-1) the event types are the same, (1-2) the event triggers are the same, and (1-3) the arguments are fully matched.
(2) Argument equality holds between any two arguments when (2-1) the role types are the same, and (2-2-1) both are text entities in equality, or (2-2-2) both are events in equality.
(3) Text entity equality holds between any two text entities when (3-1) the entity types are the same, and (3-2) the spans are the same.
In the condition (1-3), a full matching of arguments between two events means there is a perfect one-toone mapping between the two sets of argument, while the equality of individual arguments is defined by the Argument Equality. Due to the condition (2-2-2), event equality is defined recursively for events referring to events. Any two text spans (beg1, end1) and (beg2, end2), are the same iff beg1 = beg2 and end1 = end2. Note that the event triggers are also text entities thus their equality is defined by the text entity equality. Various evaluation modes can be defined by varying the condition of equality. In the following, we describe five fundamental variants applied in the evaluation.

Strict matching
The strict matching mode requires exact equality, as defined in previous section. As some of its requirements may be viewed as unnecessarily precise, practically motivated relaxed variants, described in the following, are also applied.

Approximate span matching
The approximate span matching mode is defined by relaxing the requirement for text span matching for text entities. Specifically, a given span is equivalent to a gold span if it is entirely contained within an extension of the gold span by one word both to the left and to the right, that is, beg1 ≥ ebeg2 and end1 ≤ eend2, where (beg1, end1) is the given span and (ebeg2, eend2) is the extended gold span.

Approximate recursive matching
In strict matching, for a regulation event to be correct, the events it refers to as theme or cause must also be be strictly correct. The approximate recursive matching mode is defined by relaxing the requirement for recursive event matching, so that an event can match even if the events it refers to are only partially correct. Specifically, for partial matching, only Theme arguments are considered: events can match even if referred events differ in non-Theme arguments.

Event decomposition mode
Many events are expressed with more than one argument, e.g., binding of multiple proteins or regulation with a theme and a cause. Such events are inherently more difficult to extract than events with a single argument. In the Event decomposion mode, events with multiple arguments are decomposed into multiple singleargument events. Specifically, in this mode, each multiargument event trigger, arg1-type:arg1-value, arg2-type:arg2-value, . . . is decomposed into single-argument events trigger, arg1-type:arg1-value trigger, arg2-type:arg2-value . . . The resulting single-argument events are treated as separate events in evaluation, thus allowing recognition of partially correct events and awarding the recognition of complex events more highly. Note that the Event decomposition mode is used in combination with other matching modes.

CO task evaluation
The coreference resolution performance is evaluated in two modes.
The Surface coreference mode evaluates the performance of finding anaphoric protein references and their antecedents, regardless whether the antecedents actually embed protein names or not. In other words, it evaluates the ability to predict the coreference relations as provided in the gold coreference annotation file, which we call surface coreference links.
The protein coreference mode evaluates the performance of finding anaphoric protein references with their links to actual protein names (protein coreference links). In the implementation of the evaluation, the chain of surface coreference links is traced until an antecedent embedding a protein name is found. If a protein-nameembedding antecedent is connected to an anaphora through only one surface link, we call the antecedent a direct protein antecedent. If a protein-name-embedding antecedent is connected to an anaphora through more than one surface link, we call it an indirect protein antecedent, and the antecedents in the middle of the chain intermediate antecedents. The performance evaluated in this mode may be directly connected to the potential performance in main IE tasks: the more the (anaphoric) protein references are found, the more the protein-related events may be found. For this reason, the protein coreference mode is chosen as the primary evaluation mode.
Evaluation results for both evaluation modes are given in standard recall, precision and f-score.

Surface coreference
Surface coreference links are links between target anaphors and their antecedents or intermediate antecedents. Note that the shared task development and training data include not only the target protein coreference links but also other pronoun and definite noun phrase coreference links.
A response expression is matched with a gold one following partial match criterion. In particular, a response expression is considered correct when it covers the minimal boundary, and is included in the maximal boundary of expression. Note that maximal boundary is the span of expression annotation, and minimal boundary is the head of expression, as defined in MUC annotation schemes [17]. A response link is correct when its two argument expressions are correctly matched with those of a gold link.

Protein coreference
This is the primary evaluation perspective of the protein coreference task. In this mode, we ignore coreference links that do not reference to proteins. Intermediate antecedents are also ignored.
Protein coreference links are generated from the surface coreference links. A protein coreference link is composed of an anaphoric expression and a protein reference that appears in its direct or indirect antecedent. Below is an example.
Response expressions and generated response result links are matched with gold expressions and links correspondingly in a way similar to the surface coreference evaluation mode. and analyzed the results. NN implemented the Protein Coreference task evaluation and analyzed the results. JD and NN prepared the manuscript. JT, TT and AY participated in the discussions and finalization of the manuscript.