Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011

We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.


Background
The biomedical scientific literature is growing at an exponential rate, far outstripping the capacity of individual researchers to fully process in any but the narrowest of subfields. To address the challenges of information overload and to improve access to the wealth of knowledge in this literature, there have been substantial efforts over the previous 15 years to develop automatic methods for the analysis of medical and biomolecular scientific publications [1,2].
Much of this work has focused on information extraction (IE) and text mining, applying natural language processing (NLP) methods to analyse domain texts, extract structured, computer-readable representations of key information, and compile extracted information into knowledge bases. Until recently, domain IE efforts concentrated primarily on foundational tasks, such as entity mention detection, and on the extraction of simple representations of entity associations, most typically the detection of mentions of names of proteins and the extraction of protein mention pairs representing protein-protein interactions.
However, in recent years there has been increasing interest in the application of more expressive representations in domain IE to address the requirements of tasks such as pathway curation, Gene Ontology term annotation, and semantic literature search [3,4]. To support the development and evaluation of methods for such tasks, a number of recently introduced corpus resources have been manually annotated using event representations that capture structured associations of arbitrary numbers of participants in specific roles [5][6][7][8][9][10]. The community took a decisive step toward the introduction of practical tools capable of extracting information using such representations in the BioNLP Shared Task 2009 (BioNLP ST'09) [11,51].
Shared tasks have been instrumental in the development of general domain IE technology by introducing new tasks, resources and evaluation standards [12,13]. Also in the biomedical domain, shared tasks such as JNLPBA [14], LLL [15], TREC Genomics [16] and Bio-Creative [17][18][19] have played a central role in focusing the efforts of the community to new timely tasks and challenges. The BioNLP ST'09, the first shared task in its series, sought to advance the state of the art in structured event extraction by providing a shared task definition and annotated data as well as evaluation criteria and tools for the task. The task met with enthusiastic response from the community: 24 groups participated in the task, proposing a variety of approaches for automatic event extraction. Interest in event extraction continued past the original shared task, whose data and setup have supported further advances in extraction methods and the introduction of automatically annotated literature-scale resources [20][21][22][23][24][25][26].
Although successful in introducing structured event representations to the general community and promoting the development of practically applicable methods for event extraction, the resources of the BioNLP ST'09 were somewhat limited in their scope. The task data was prepared on the basis of the GENIA corpus [6,27], an annotated resource of publication abstracts in the domain of transcription factors in human blood cells, and the event types targeted in the task were chosen by relevance to its topics. These limitations raised the question whether the findings of the shared task and the methods introduced to address the task can generalize beyond this narrow domain. To address such questions, generalization was chosen as the main theme of the second event in the series, the BioNLP Shared Task 2011 (BioNLP ST'11) [28]. This was emphasized in task selection and design and data preparation, targeting new domains, an extended set of extraction targets, and new text types, including full-text articles.
In this paper, we present three of the eight tasks in BioNLP ST'11, including two tasks that attracted the widest participation among newly introduced main tasks, EPI and ID. The Epigenetics and Post-translational Modifications (EPI) task focuses on events relating to epigenetic change and encompasses also common protein post-translational modifications, reactions that are critical for the control or gene expression and protein function. The Infectious Diseases (ID) task concerns the biomolecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria but as of yet incompletely understood. In addition to these two main tasks, we introduce the Entity Relations (REL) supporting task, which seeks to assist extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems whose analyses can then be used to support the recognition of various event extraction targets.
Each of these tasks follows the general design of the BioNLP ST'09, providing participants with an extensive, fully annotated corpus with manually curated examples of the extraction targets for method development and training, with evaluation of final submissions received from participants against a separate held-out test set prepared in similar fashion.
We extend on the EPI, ID and REL task results previously reported in the BioNLP Shared Task 2011 workshop proceedings [29][30][31][32] in particular in performing further analyses of the outputs of the participating systems, placing specific emphasis on aspects of event extraction system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation.
In the following, we first briefly motivate each of the EPI, ID and REL tasks and introduce the task setting. We then describe the task data annotation and evaluation criteria before presenting the results of each task. Finally, we present an extended analysis of the outputs of the participating systems.

EPI task
The Epigenetics and Post-translational Modifications (EPI) task is an information extraction task focusing on events relating to epigenetic change, including DNA methylation and histone methylation and acetylation (see e.g. [33,34]), as well as other common protein posttranslational modifications (PTMs) [35]. PTMs are chemical modifications of the amino acid residues of proteins, and DNA methylation a parallel modification of the nucleotides on DNA. While these modifications are chemically simple reactions and can thus be straightforwardly represented in full detail, they have a crucial role in the regulation of gene expression and protein function: the modifications can alter the conformation of DNA or proteins and thus control their ability to associate with other molecules, making PTMs key steps in protein biosynthesis for introducing the full range of protein functions. For instance, protein phosphorylation -the attachment of phosphate -is a common mechanism for activating or inactivating enzymes by altering the conformation of protein active sites [36,37], and protein ubiquitination -the post-translational attachment of the small protein ubiquitin -is the first step of a major mechanism for the destruction (breakdown) of many proteins [38].
Many of the PTMs targeted in the EPI task involve modification of histone, a core protein that forms an octameric complex that has a crucial role in packaging chromosomal DNA. The level of methylation and acetylation of histones controls the tightness of the chromatin structure, and only "unwound" chromatin exposes the gene packed around the histone core to the transcriptional machinery. Since histone modification is of substantial current interest in epigenetics, we designed aspects of the EPI task to capture the full detail in which histone modification events are stated in text. Finally, the DNA methylation of gene regulatory elements controls the expression of the gene by altering the affinity with which DNA-binding proteins (including transcription factors) bind, and highly methylated genes are not transcribed at all [39,40]. DNA methylation can thus "switch off" genes in a way that is reversible through DNA demethylation.
The specificity with which protein modifications can be described in text makes them promising IE targets (see Figure 1), and there have been many studies of automatic extraction of PTMs from the scientific literature in support of modification database curation [41][42][43][44]. However, these have generally targeted only single PTM types such as phosphorylation or ubiquitination, frequently using highly customized rule-based systems that are not readily adaptable to other extraction targets. The BioNLP ST'09 involved the extraction of nine event types including one PTM type, PHOS-PHORYLATION, which was found to be the single most reliably extracted event type in the task, with the bestperforming system for the type achieving 83% F-score in its extraction [45]. The results suggest both that the event representation is well applicable to PTM extraction and that current extraction methods are capable of reliable PTM extraction. Many of the most successful systems participating in the ST'09 further involved general machine learning-based approaches, suggesting that their scope could be extended to PTM extraction more broadly. The EPI task follows up on these opportunities, introducing specific, strongly biologically motivated extraction targets that are expected to be both feasible for high-accuracy event extraction, relevant to the needs of present-day molecular biology, and closely applicable to biomolecular database curation needs.

ID task
The Infectious Diseases (ID) task is an event extraction task focusing on the biomolecular mechanisms of infectious diseases. The task concentrates on the specific domain of two-component systems (TCSs, or two-component regulatory systems), a mechanism widely used by bacteria to sense and respond to the environment [46]. Typical TCSs consist of two proteins, a membrane-associated sensor kinase and a cytoplasmic response regulator. The sensor kinase monitors changes in the environment while the response regulator mediates an adaptive response, usually through differential expression of target genes [47]. TCSs have many functions, but those of particular interest for infectious disease researchers include virulence, response to antibiotics, quorum sensing, and bacterial cell attachment [48]. Not all TCS functions are well known: in some cases, TCSs are involved in metabolic processes that are difficult to precisely characterize [49]. TCSs are of interest also as drugs designed to disrupt TCSs may reduce the virulence of bacteria without killing it, thus avoiding the potential selective pressure of antibiotics lethal to some pathogenic bacteria [50]. Information extraction techniques may support better understanding of these fundamental systems by identifying and structuring the molecular processes underlying two component signaling.
The ID task seeks to support efforts to build a systemic understanding of the molecular-level pathways relating to these mechanisms of infectious diseases by adapting the BioNLP ST'09 event extraction model to domain scientific publications. The adaptation of the model originally introduced to represent biomolecular events relating to transcription factors in human blood cells to a domain that centrally concerns both bacteria and their hosts involves a variety of novel challenges, such as events concerning whole organisms, the chemical environment of bacteria, prokaryote-specific concepts (e.g. regulons as units of gene regulation), as well as the effects of biomolecules on larger-scale processes involving hosts, such as virulence. In addition to supporting an application of significant public health interest, the ID task also provides opportunities to study the ability of event extraction technology to generalize in a number of aspects.

REL supporting task
The Entity Relations (REL) supporting task focuses on the extraction of specific binary relations between biomolecular entities. The motivation for the task draws in part from analysis of the results of the BioNLP ST'09, which suggested that events that involve coreference or entity relations represent particular challenges for extraction [51]. To help address these challenges and encourage modular extraction approaches, increased sharing of successful solutions, and an efficient division of labor, the two were separated into independent supporting tasks on Coreference (CO) [52,53] and Entity Relations [32] for BioNLP ST'11. To allow participants in main tasks to benefit from successful approaches to the supporting tasks, the ST'11 was arranged in two distinct stages, with supporting tasks carried out before the main tasks.

Methods
In this section, we introduce the general representation of the IE tasks, the specific realization of this representation applied in each task, and the task evaluation criteria.

Representation
While the EPI, ID and REL tasks differ substantially in the specifics of their extraction targets, the three share the same basic representation of extracted information, an extension of the representation introduced for the BioNLP ST'09 [11] and applied also in the ST'11 GE task [53,54]. In this section, we first present the shared aspects of this event representation, in particular the four annotation primitives entities, relations, events and event modifications, briefly illustrate the format in which these are stored in the task, and then describe the specific set of annotations involved in each of the tasks.

Entities
All of the tasks build on the basis of entity annotations, which capture mentions of entities of interest in text using a simple typed-span representation. Each entity annotation consists of a type (e.g. PROTEIN) and a (start, end) offset pair identifying the span of text containing the entity mention. The entity annotations thus mark contiguous sequences of characters. All entity annotations further follow the constraint that no two entities of the same type overlap in their spans, and that the spans of no two overlapping annotations cross. By contrast, entity annotation spans may be nested so that one span completely contains another. Figure 2a shows examples of entity annotation.
Entity mention detection and normalization are arguably the most frequently studied IE-related tasks in the domain, and the target of numerous previous and ongoing shared tasks [14,[55][56][57][58]. Further, a wealth of systems addressing these tasks have been introduced (e. g. [59][60][61][62]). To focus on the novel aspects of event extraction, the BioNLP Shared Task series has adopted the general policy of providing task participants with manual "gold standard" annotation identifying the primary entities relevant to each task as a starting point for extraction, thus isolating effects of entity mention detection from IE performance.
The three tasks presented in this paper share the definitions of two entity types, PROTEIN and ENTITY, defined similarly as for the BioNLP ST 2009. Mentions of specific names of genes and gene products are annotated as PROTEIN in all tasks, with some task-specific exceptions to the precise scope of the annotation (see the sections on each task). Gold annotation for PRO-TEIN entities is provided to participants in all tasks. The generic type ENTITY, by contrast, is defined for marking additional entity annotations generated by participants, such as the specific protein domains or DNA regions involved in modification or binding events. Annotations of this type are only provided in training data, and must thus be detected for test data by systems addressing the full main tasks or the supporting task. The non-specific type ENTITY is selected in part to reduce the demands of this entity recognition component of the tasks by removing the need to differentiate between specific types.

Relations
Relations are typed binary associations of entities and may be either directed or undirected. While relations are only a target of extraction in the REL task, all of the three tasks involve a specific relation, EQUIV. This is a binary, symmetric, transitive relation that defines two entities to be equivalent [63]. The relation is used in the gold annotation to mark local aliases such as the full and abbreviated forms of a protein name as referring to the same real-world entity. In evaluation, references to any of a set of equivalent entities are treated identically. Figure 2b shows examples of equiv annotation.
In addition to the general EQUIV relation, the REL task defines directed part-of relations that are the targets of extraction in the supporting task, introduced in the section defining the task.

Events
Events are typed, n-ary associations of entities or other events, each identified as participating in a specific role. Events are bound to specific expressions in text (the event trigger or text binding) and are primary objects of annotation; that is, annotations may refer to event annotations. Event triggers identify the word or words stating the occurrence of the event in text. Like entities, triggers are represented using a (start, end) offset pair. Event types (e.g. BINDING, ACETYLA-TION) are drawn from a fixed set separately defined for each task. Each event typically takes one or more arguments (participants in specific roles): for example, an ACETYLATION event may be defined as requiring a single Theme identifying the PROTEIN entity that is acetylated. Events may involve other events as participants, thus creating complex event structures. For example, a REGULATION event may have a BINDING event as its Theme, thus specifying a compound "regulation of binding" event. Figure 2c shows examples of event annotation.
The main tasks differ in the details of event arguments, but share the definition of the basic core argument roles Theme and Cause as well as the additional argument role Site. As the terms suggest, Theme identifies the participant or participants that undergo the primary effects of the event, and Cause a participant that causes the event to occur. Site identifies a specific part of another participant that is involved in the event, such as the modified residue in a PHOSPHORYLATION event. Event arguments may be specified as being either mandatory or optional, where mandatory arguments must be identified for an event to be extracted. Events typically take a mandatory Theme, reflecting a specificity constraint on the extracted information: while statements regarding e.g. the phosphorylation of specific proteins are targeted for extraction, statements regarding phosphorylation in general are not.
The event arguments vary by event type and task, and the specification of event types and arguments largely defines the differences between the different main tasks of the BioNLP ST'11.

Event modifications
Event modification annotations are used to specify further aspects of event statements beyond the core propositional content, for example identifying an event as being negated. Event modifications are represented as simple binary "flags" attached to events. Both of the main event extraction tasks EPI and ID follow the BioNLP ST'09 setting in defining two event modification extraction targets: NEGATION and SPECULATION. The former marks an event as being explicitly negated (e.g. H2A is not methylated) and the latter as stated in a speculative context (e.g. H2A may be methylated). An event may be simultaneously marked as both negated and speculated (e.g. H2A may not be methylated). Unlike in the representation used for events and the cue-scope model applied for negation and speculation annotation in e.g. the BioScope corpus and the CoNLL 2010 shared task [64,65], no "trigger expressions" are marked for event modifications.

Format
The above presentation of the represented information content abstracts away the specific file format in which this information is stored in the task. While secondary to this information content, the specifics of the format may be of interest for assessing the technical requirements of the task; we thus include in Figure 3 an illustration of the applied file format. In brief, this is a standoff annotation format in which all references to the text are stored as offsets. Each annotation is given an ID that is used to refer to that annotation, and the ID assignment follows a simple scheme to assist identifying entity types (e.g. IDs beginning with "E" for events). For a detailed description of the format, we refer to the BioNLP ST'09 overview [51].

EPI task setting
The EPI task focuses on the extraction of information on statements regarding change in the chemical modification state of proteins and DNA. The task involves the two generally applied entity types PROTEIN and ENTITY, where annotations of the PROTEIN type are provided as part of the input. By contrast to its standard entity definition, the EPI task introduces considerable novelty in the targeted events, involving a total of 14 event types and two participant roles not considered in other BioNLP ST'11 tasks. Table 1 summarizes the targeted event types and their arguments. In addition to the standard Theme, Cause and Site, EPI defines the task-specific arguments Sidechain and Contextgene.
Sidechain, specific to GLYCOSYLATION and DEGLY-COSYLATION among the targeted events, identifies the moiety attached or removed in the event (in glycosylation, the sugar). Note that while arguments similar to Sidechain could be defined for other event types also, their extraction would provide no additional information: the attached molecule is always acetyl in acetylation, methyl in methylation, etc. Contextgene, specific to ACETYLATION and METHYLATION events and their reverse reactions, identifies the gene whose expression is controlled by these modifications. This argument applies specifically for histone protein modification: the modification of the histones that form the nucleosomes that structure DNA are key to the epigenetic control of the expression of the genes encoded in that segment of DNA. Theme is required for all events in the EPI task, but the Site, Sidechain and Contextgene arguments are not mandatory, and should only be extracted when explicitly stated. For CATALYSIS events, representing the catalysis of protein or DNA modification by another protein, both Theme and Cause are mandatory. Figure 4 illustrates some of the EPI task extraction targets.
While the EPI task setting has few extraction targets in common with the BioNLP ST'09 and the ST'11 GE task, its entity types follow the same scheme as these tasks and the general definition aims to preserve compatibility with their setting. The basic modification events in EPI are defined similarly to the PHOSPHORY-LATION event type targeted in ST'09, and while CATA-LYSIS is a new event type, it is related to the ST'09 POSITIVE REGULATION type by a class-subclass relation: any CATALYSIS event is a POSITIVE REGULA-TION event in the ST'09 task terms (but not vice versa).
As described in the section on representation, the EPI task targets also the two event modifications NEGATION and SPECULATION, and involves EQUIV relations in its evaluation. The task extraction targets do not include any relations.

ID task setting
The ID task concerns the molecular mechanisms of infectious diseases, which involve various associations between multiple types of molecular entities, diseasecausing microorganisms and other organisms undergoing the diseases. To support the extraction of information from domain publications, the task extends the basic entity types with multiple new categories. The ID event types extend on those defined in the BioNLP ST'09, broadening the scope of previously defined entities to encompass the new entity types and introducing a new class of events, high-level biological processes. In addition to PROTEIN, the ID task defines four additional types of core entities: TWO-COMPONENT-SYSTEM, REGULON-OPERON, CHEMICAL and ORGAN-ISM. As in the other tasks, mentions of names of genes and their products (RNA and proteins) are annotated with the PROTEIN type. Two-component systems, consisting of two proteins, frequently have names derived from the names of the proteins involved (e.g. PhoP-PhoR or SsrA/SsrB). Mentions of TCSs are annotated as TWO-COMPONENT-SYSTEM, nesting PROTEIN annotations if present. Regulons and operons are collections of genes whose expression is jointly regulated. Like the names of TCSs, their names may derive from the names of the involved genes and proteins, and are annotated as embedding PROTEIN annotations when they do. The annotation does not differentiate between the two, marking both with a single type REGULON-OPERON.
In addition to these three classes relating to genes and proteins, the core entity annotation recognizes the classes CHEMICAL and ORGANISM. All mentions of formal and informal names of atoms, inorganic compounds, carbohydrates and lipids as well as organic compounds other than amino acid and nucleic acid compounds (i.e. gene/protein-related compounds) are annotated as CHEMICAL. Mentions of names of families, genera, species and strains, as well as nonname references with comparable specificity are annotated as ORGANISM. The recognition of these core entities is not part of the ID task: gold annotation for these entities is provided to participants also for test data. As in the other tasks, the non-specific type ENTITY is defined for marking entities that specify additional details of events, ENTITY annotations are not provided for test data and must be detected by participants addressing the full task.
The primary extraction targets of the ID task are the event types summarized in Table 2. These are a superset of those targeted in the BioNLP ST'09 and the ST'11 GE task [54]. This design makes it possible to study aspects of domain adaptation by having the same extraction targets in two subdomains of biomedicine, that of transcription factors in human blood cells (GE) and infectious diseases. The events in the ID task extend on those of GE in the inclusion of additional entity types as participants in previously considered event types and the introduction of the new PROCESS type. The semantics of these events are defined (as in GE) with reference to the community-standard Gene Ontology [66]. We refer to [6,11] for the ST'09/GE definitions.
The definitions of the first four types in Table 2 are otherwise unchanged from the ST'09 definitions except that GENE EXPRESSION and TRANSCRIPTION extend on the former definition in recognizing REGULON-OPERON as an alternative unit of expression.
The type of entity allowed as argument is specified in parenthesis. LOCALIZATION, taking only PROTEIN type arguments in the ST'09 definition, is allowed to take any core entity argument. This expanded definition remains consistent with the scope of the corresponding GO term (GO:0051179). BINDING is similarly extended, giving it a scope largely consistent with GO:0005488 (binding) but also encompassing GO:0007155 (cell adhesion) (e.g. a bacterium binding another) and protein-organism binding. The three regulation types (REGULATION, POSITIVE REGULATION, and NEGATIVE REGULA-TION) likewise allow the new core entity types as arguments, but their definitions are otherwise unchanged from those in ST'09, that is, the GENIA ontology definitions. As in these resources, regulation types are used not only for the biological sense but also to capture statements of general causality [6]. As in ST'09, all events of types discussed above require a Theme argument: only events involving an explicitly stated theme (of an appropriate type) should be extracted. All other arguments are optional. The PROCESS type, new to ID, is used to annotate high-level processes such as virulence, infection and resistance that involve infectious organisms. This type differs from the others in that it has no mandatory arguments: the targeted processes should be extracted even if they have no explicitly stated participants, reflecting that they are of interest even without the further specification. When stated, the involved participants are captured using the generic role type Participant. Figure 5 shows an illustration of some of the ID task extraction targets.
We term the first five event types in Table 2 taking exactly one Theme argument as their core argument simple events. In analysis we further differentiate nonregulation events (the first seven) and regulation (the last three). The latter category is known to represent particular challenges for extraction in involving nested event structures where events take events as arguments.
The ID task event modifications and relations are defined similarly as for GE and EPI: the two modifications NEGATION and SPECULATION are targeted for extraction in the full task, and EQUIV relations are used in evaluation, but no relations are included among extraction targets.

REL task setting
The REL task aims to support the main event extraction tasks by isolating the recognition of entity relations as an independent subtask whose results can be used to resolve event structures. Following the results and analysis from previous studies [67,68], in the design of the REL task we chose to limit the task specifically to relations involving a gene/protein named entity (NE) and one other entity. Fixing one entity involved in each relation to an NE helps assure that the relations are "anchored" to real-world entities, and the specific choice of the gene/protein NE class further provides a category with several existing systems and substantial ongoing efforts addressing the identification of those referents through named entity recognition and normalization.
The recognition of biologically relevant associations of gene/protein NEs is a key focus of the main event extraction tasks of the shared task. As marking identifying mentions of these entities (PROTEIN annotations) are further included as part of the input of the main tasks, the availability of these annotations could be assumed also in the REL task without introducing additional requirements on the task input. However, in the The type of entity allowed as argument is specified in parenthesis. "Core entity" is any of PROTEIN, TWO-COMPONENT-SYSTEM, REGULON-OPERON, CHEMICAL, or ORGANISM. Arguments that can be filled multiple times marked with "+", non-mandatory core arguments with "?" (all additional arguments are nonmandatory).

Figure 5
Example ID event annotation. The association of a TCS with an organism is captured through an event structure involving a PROCESS ("virulence") and POSITIVE REGULATION. Regulation types are used to capture also statements of general causality such as "is essential for" here. (Simplified from PMC ID 2358977).
REL task setting, only one participant in each binary relation is a gene/protein NE, while the other can be either a non-name reference such as promoter or the name of an entity not of the gene/protein type (e.g. a protein complex). Motivated in part by the design goal to avoid introducing additional assumptions on input and the relatively limited number of existing methods for the detection of such entity references, their detection is included in the task: participants must recognize these secondary entities in addition to extracting the relations they participate in. These entities are assigned the generic type ENTITY. The general task setting encompasses a rich set of potential relation extraction targets. For REL, we aimed to select relations that minimize overlap between the targets of other BioNLP ST'11 tasks while maintaining relevance as a supporting goal. As the main tasks primarily target events ("things that happen") involving change in entities, we chose to focus in the REL task on what we have previously termed "static relations" [67], that is, relations such as part-of that hold between entities without necessary implication of causality or change. A previous study further indicated that this class of relations may benefit event extraction [69]. We based our choice of specific target relations on previous studies of entity relations domain texts [67,68], which indicated that part-whole relations are by far the most frequent class of relevant relations for the task setting and proposed a classification of these relations for biomedical entities. We further found that -in terms of the taxonomy of Winston et al. [70] -object-component and collection-member relations account for the great majority of part-of relations relevant to the domain. For REL, we chose to omit collection-member relations in part to minimize overlap with the targets of the coreference task. Instead, we focused on two specific types of objectcomponent relations, that holding between a gene or protein and its part (domains, regions, promoters, amino acids, etc.) and that between a protein and a complex that it is a subunit of. Following the biological motivation and the general practice in the shared task to term genes and gene products PROTEIN for simplicity, we named these two relations PROTEIN-COMPO-NENT and SUBUNIT-COMPLEX. Figure 6 shows an illustration of a simple relation with an associated event (not part of REL). Events with Site arguments such as that shown in the figure are targeted in the GE, EPI, and ID tasks that REL is intended to support.

Corpora
Each of the three tasks makes use of manual annotations created specifically for the shared task, either entirely or in primary part. This section presents the three corpus resources on which these tasks are based.

EPI corpus
The primary EPI task data were annotated specifically for the BioNLP Shared Task 2011 and are not based on any previously released resource. Before starting this annotation effort, we performed two preparatory studies using related, previously released datasets: one considering the extraction of four protein post-translational modification event types [8], with reference to annotations originally created for the Protein Information Resource (PIR) [71,72], and one studying the annotation and extraction of DNA methylation events [10], with reference to annotations created for the PubMeth [73,74] database. The EPI corpus text selection and annotation scheme were then defined following the understanding formed in these studies.

EPI document selection
The texts for the EPI task corpus were drawn from PubMed abstracts. In selecting the primary corpus texts, we aimed to gather a representative sample of all PubMed documents relevant to selected modification events, avoiding bias toward, for example, specific genes/proteins, species, forms of event expression, or subdomains. We primarily targeted DNA methylation and the "prominent PTM types" identified in [8]. We defined the following document selection protocol: for each of the targeted event types, 1. Select a random sample of PubMed abstracts annotated with the MeSH term corresponding to the target event (e.g. Acetylation) 2. Automatically tag protein/gene entity mentions in the selected abstracts, removing abstracts where fewer than a specific cutoff are found 3. Perform manual filtering removing documents not relevant to the targeted topic (optional).
MeSH is a controlled vocabulary of over 25,000 terms that is used to manually annotate each document in PubMed. By performing initial document retrieval using MeSH terms it is possible to select relevant documents without bias toward specific expressions in text. While search for documents tagged with e.g. the Acetylation MeSH term is sufficient to select documents Figure 6 Example REL relation annotation. Simple REL annotation example showing a PROTEIN-COMPONENT (PR-CO) relation between "histone H3" and "lysine 9". An associated METHYLATION event and its arguments (shaded, not part of the REL task targets) shown for context. relevant to the modification, not all such documents necessarily concern specifically protein modification, necessitating a filtering step. Following preliminary experiments, we chose to apply the BANNER named entity tagger [59] trained on the GENETAG corpus [75] and to filter documents where fewer than five entities were identified. Finally, for some modification types this protocol selected also a substantial number of non-relevant documents. In these cases a manual filtering step was performed prior to full annotation to avoid creating detailed annotation for large numbers of non-relevant abstracts.
This primary corpus text selection protocol does not explicitly target reverse reactions such as deacetylation, and the total number of these events in the resulting corpus was low for many types. To be able to measure the extraction performance for these types, we defined a secondary selection protocol that augmented the primary protocol with a regular expression-based filter removing documents that did not (likely) contain mentions of reverse reactions. This protocol was used to select a secondary set of test abstracts enriched in mentions of reverse reactions. Performance on this secondary test set was also evaluated, but is not part of the primary task evaluation. Results on the secondary test set are reported to provide additional perspective the performance of systems at the EPI task.

EPI annotation
Annotation was performed manually. The gene/protein entities automatically detected in the document selection step were provided to annotators for reference for creating PROTEIN annotations, but all entity annotations were checked and revised to conform to the specific guidelines for the task (Due to differences in annotation criteria [76], substantial annotation was required). For the annotation of PROTEIN entities, we adopted the GENIA gene/gene product (GGP) annotation guidelines [77], adding one specific exception: while the primary guidelines require that only specific individual gene or gene product names are annotated, we allowed also the annotation of mentions of histone protein families or the entire histone superfamily to capture histone modification events also in cases where only the family is mentioned.
All event annotations were created from scratch without automatic support to avoid bias toward specific automatic extraction methods or approaches. The event annotation follows the GENIA event corpus annotation guidelines [78] as they apply to protein modifications, with CATALYSIS being annotated following the criteria for the POSITIVE REGULATION event type, with the additional constraints that the Cause of the event is a PROTEIN entity and the form of regulation is catalysis of a modification reaction. The manual annotation was performed by three experienced annotators with a molecular biology background. Annotator training and the overall annotation process was organized and supervised by an annotator with extensive experience in domain event annotation (TO).
After completion of primary annotation, we performed a final check targeting simple human errors using an automatic extraction system as follows: High-confidence system predictions differing from gold annotations were provided to a human annotator for re-evaluation (not used directly to change corpus data). To further reduce the risk of bias, we only informed the annotator of the entities involved, not of the predicted event structure. This correction process resulted in the revision of approximately 2% of the event annotations. To evaluate the consistency of the annotation, we performed independent event annotation (taking PROTEIN annotations as given) for a random sample of 10% of the corpus documents. Comparison of the two manually created sets of event annotations under the primary task evaluation criteria gave an F-score of 82% for the full task and 89% for the core task (due to symmetry of precision/ recall and the applied criteria, this score was not affected by the choice of which set of annotations to consider as "gold" for the comparison). We found that CATALYSIS events were particularly challenging, showing just 65% agreement for the core task. Table 3 shows the statistics of the primary EPI task data. We note that while the corpus is broadly comparable in size to the BioNLP ST'09 dataset [11] in terms of the number of abstracts and annotated entities, the number of annotated events in the EPI corpus is approximately 20% of that in the ST'09 dataset, reflecting the more focused event types.

ID corpus
The ID task data were newly annotated for the BioNLP Shared Task and are not based on any previously released resource. Annotation was performed by two teams, one in Tsujii laboratory (University of Tokyo) and one in Virginia Bioinformatics Institute (Virginia Tech). The entity and event annotation design was guided by previous studies on NER and event extraction in a closely related domain [79,80].

ID document selection
The training and test data were drawn from the primary text content of recent full-text PMC open access documents selected by infectious diseases domain experts (Virginia Tech team) as representative publications on two-component regulatory systems. Table 4 presents some characteristics of the corpus composition. To focus efforts on natural language text likely to express novel information, we excluded tables, figures and their captions, as well as methods sections, acknowledgments, authors' contributions, and similar meta-content.

ID annotation
Annotation was performed in two primary stages, one for marking core entities and the other for events and secondary entities. As a preliminary processing step, initial sentence segmentation was performed with the GENIA Sentence Splitter [81]. Segmentation errors were corrected during core entity annotation. Core entity annotation was performed from the basis of an automatic annotation created using selected existing taggers for the target entities. The following tools and settings were adopted, with parameters tuned on initial annotation for two documents: PROTEIN: NeMine [82] trained on the JNLPBA data [14] with threshold 0.05, filtered to only GENE and Protein types.
Initial automatic tagging was not applied for entities of the REGULON-OPERON type or the generic ENTITY type (for additional event arguments). All automatically generated annotations were at least confirmed through manual inspection, and the majority of the automatic annotations were revised in manual annotation. Table 5 summarizes the tagging performance of the automatic tools as measured against the final human-annotated training and development datasets. It should be noted that these results are low in part due to differences in annotation criteria (see e.g. [76]) and to data tagged using the ID task annotation guidelines not being applied for training; training on the newly annotated data is expected to allow notably more accurate tagging.
Annotation for the task extraction targets -events and event modifications -was created entirely manually without automatic annotation support to avoid any possible bias toward specific extraction methods or approaches. The Tsujii laboratory team organized the annotation effort, with a coordinating annotator with extensive experience in event annotation leading annotator training and annotation scheme development, similarly as for the EPI corpus annotation. Detailed annotation guidelines [85] extending on the GENIA annotation guidelines [78] were developed jointly with all annotators and refined throughout the annotation effort. Based on measurements of inter-annotator consistency between annotations independently created by the two teams, made throughout annotator training and primary annotation (excluding final corpus cleanup), we estimate the consistency of the final entity annotation to be no lower than 90% F-score and that of the event annotation to be no lower than 75% F-score for the primary evaluation criteria (see the section on Evaluation).

ID datasets and statistics
Initial annotation was produced for the selected sections in 33 full-text articles, of which 30 were selected for the final dataset as representative of the extraction targets. These documents were split into training, development and test sets of 15, 5 and 10 documents, respectively. Participants were provided with all training and development set annotations and test set core entity annotations. The overall statistics of the datasets are given in Table 6.   As the corpus consists of full-text documents, it contains a somewhat limited number of articles, but in other terms it is of broadly comparable size to the largest of the BioNLP ST corpora: the corpus word count, for example, corresponds to that of a corpus of approximately 800 PubMed abstracts, and the core entity count is comparable to that in the ST'09 data. However, for reasons that may relate in part to the domain, the event count is approximately a third of that for the ST'09 data. In addition to having less training data, the entity/ event ratio is thus considerably higher (i.e. there are more candidates for each true target), suggesting that the ID data could be expected to provide a more challenging extraction task.

REL corpus
The REL task dataset consists of new annotations for the GENIA corpus [6], building on the existing biomedical term annotation [27], the gene and gene product name annotation [77] and the syntactic annotation [86] of the corpus. The general features of the annotation are presented in [67], describing a previous release of a subset of the data. The REL task annotation effort extended the coverage of the previously released annotation to all relations of the targeted types stated within sentence scope in the GENIA corpus.
For compatibility with the ST'09 and the ST'11 GE task, the REL task training/development/test set division of the GENIA corpus abstracts matches that of the ST'09 data. The statistics of the corpus are presented in Table 7. We note that both in terms of training examples and the data available in the given development set, the number of examples of the PROTEIN-COMPONENT relation is more than twice that for SUBUNIT-COM-PLEX. Thus, at least for methods based on machine learning, we might generally expect to find higher extraction performance for the former relation.

Evaluation
Evaluation for all tasks is based on comparison of submissions from participants against gold standard data prepared in advance for each task. Performance measurement is instance-oriented -that is, each individual annotation (event or relation) is considered separatelyand based on the standard precision, recall and F-score metrics. Specifically, F 1 , the harmonic mean of precision and recall, is used to combine precision and recall and used as the primary ranking criterion in all tasks. The F 1 score is referred to as F-score for short throughout.
The primary evaluation criteria for determining whether a predicted event matches a gold event in the EPI and ID tasks are the same as in the BioNLP ST'09 and the GE task. These criteria relax exact matching in two aspects, incorporating approximate span matching and approximate recursive matching. Under approximate span matching, text-bound annotations (event triggers and ENTITY type entities) in a submission are considered to match a corresponding gold annotation if their span is contained within the expansion of the gold span by one word to both the left and the right. Under approximate recursive matching, events that refer to other events as arguments are considered to match if the Theme arguments of the recursively referred events match, that is, non-Theme arguments are ignored for recursively referred events. For an extended discussion of these evaluation criteria, we refer to their definition in the BioNLP ST'09 overview [11].
In addition to the primary evaluation criteria, we consider a new relaxed event evaluation criterion that we term single partial penalty. Under the primary criteria, when a predicted event matches a gold event in some of its arguments but lacks one or more arguments of the gold event, the submission is arguably given a double penalty: the predicted event is counted as a false positive (FP), and the gold event is counted as a false negative (FN). Under the single partial penalty evaluation criterion, predicted events that match a gold event in all their arguments but do not contain all the arguments of the gold event are not counted as FP, although the corresponding gold event still counts as FN (the "single penalty"). Analogously, gold events that partially match a predicted event are not counted as FN, although the corresponding predicted event with "extra" arguments counts as FP. This criterion is intended to provide a more nuanced view of performance for partially correctly predicted events.
The full ID and EPI tasks involve many partially independent challenges, requiring the extraction of all event arguments (both core and additional) as well as the detection of event modifications (NEGATION and SPECULATION). These aspects of event extraction are treated as separate subtasks in the BioNLP ST'09 and the GE task, where the identification of additional event arguments is subtask 2 and the detection of negated and speculated events subtask 3. While explicit subtasks are not defined for ID and EPI, both tasks specify in addition to the full task targets also minimal core extraction targets, consisting of events with only their core arguments and excluding event modifications. Results are then reported for each submission separately for evaluation against the full gold standard data (full task) and for evaluation where both the gold standard data and the submission events are reduced to only core arguments, event modifications are removed, and resulting duplicate events removed (core task). In terms of the subtask structure of the BioNLP ST'09 and the GE task, the core task is analogous to subtask 1 and the full task analogous to the combination of subtasks 1-3. The evaluation of the REL task relations parallels the criteria for event evaluation in the main tasks. The REL task also relaxes the equality criteria for matching textbound annotations: for a submission entity to match an entity in the gold reference annotation, it is sufficient that the span of the submitted entity (i.e. its start and end positions in text) is entirely contained within the span of the gold annotation. This corresponds largely to the approximate span matching criterion for events, although the REL criterion is slightly stricter in not involving testing against an extension of the gold entity span. Relation matching is exact: for a submitted relation to match a gold one, both its type and the related entities must match.
The statistical significance of the differences in system performance is evaluated using the approximate randomization method with 9,999 repetitions [87,88].

Participation
Final results to each of the EPI and ID main tasks were successfully submitted by seven participants and results for the REL supporting task by four participants. Table 8 summarizes the groups participating in one or more of these tasks, and Table 9 the features of the event and relation extraction systems. We note that, similarly to the ST'09 task, machine learning-based systems remain dominant overall, although there is considerable divergence in the specific methods applied. In addition to domain mainstays such as support vector machines and maximum entropy models, we find increased application of joint models [89][90][91] as opposed to pure pipeline systems [92][93][94]. Remarkably, the application of full parsing is adopted in all systems and the use of dependency-based representations of syntactic analyses in all but one. Further, there is significant uniformity in the specific choice of tools for syntactic analysis: the parser of Charniak and Johnson [95] with the biomedical domain model of McClosky [96] and conversion into the Stanford Dependency representation [97] is applied in six out of nine event extraction systems and all but one relation extraction system. These choices may be motivated in part by the success of systems using the tools in the previous shared task and the availability of the analyses as supporting resources [98].
Several participants compiled dictionaries of event trigger words and two dictionaries of hedge words from the data. The use of dictionary-based approaches instead of machine learning for these components may reflect challenges in training with the trigger and hedge annotations. Trigger annotation is not exhaustive as triggers are not included in gold data e.g. for event statements where the participants fall out of scope of the entity annotation, and negation and speculation are annotated without specific trigger/cue words, precluding direct training for modifier cue detection. Despite the existence of rich lexical resources in the domain, only two groups applied databases such as UMLS in their systems. The results indicate that such resources are not critical for success at the task, a somewhat surprising result that may merit further investigation. In addition to manually curated external resources, ones derived from large-scale unannotated data through unsupervised methods were used by three groups for event extraction and one for relation extraction. Table 10 summarizes the use of corpus resources other than the task training data in the main tasks. Despite the availability of PTM and DNA methylation resources other than those specifically introduced for the task and the PHOSPHORYLATION annotations included in the GE corpus [54], no participant chose to apply other corpora for training in the EPI task. With the exception of externally acquired unlabeled data, the EPI task results thus reflect a closed task setting in which only the given data is used for training. By contrast, in the ID task four teams -including the three top-ranking -used the GE task corpus as supplementary material. The findings indicate that the GE corpus, containing approximately three times as many event annotations as ID, is largely compatible with the ID annotations and can be beneficially combined with the smaller in-domain corpus (see detailed results below). This result is encouraging for future applications of the event extraction approach: as manual annotation requires considerable effort and time, the ability to use large existing annotated resources is important for the feasibility of adaptation of the approach to new domains.
While several participants made use of supporting syntactic analyses provided by the organizers [98], none applied the analyses for supporting tasks such as coreference or REL -at least in cases due to time constraints [99]. There may thus remain further opportunities for improvement through combinations of supporting analyses with main task extraction systems.
We find a remarkable number of similarities between the approaches also for the REL supporting task, with all four utilizing full parsing and a dependency representation of the syntactic analysis, and the three highestranking further using the same parser and specific dependency representation. While UTurku [92] and VIBGhent [100] further agree in the choice of Support Vector Machines for the extraction of relations, Con-cordU [99] and HCMUS [101] pursue approaches  building on dictionary-and rule-based extraction. Only the VIBGhent system makes use of resources external to those provided for the task, extracting specific semantic entity types from the GENIA corpus as well as inducing word similarities from a large unannotated corpus of PubMed abstracts. We note that only two teams, UTurku [93] and Con-cordU [100], predicted event modifications, and only UTurku predicted additional (non-core) event arguments (data not shown). The other five systems thus addressed only the core task. For the full task, this difference in approach is reflected in the substantial performance advantage for the UTurku system, which exhibits highest performance overall as well as for most individual event types. The results suggest that the ability to recover additional arguments is key to competitive performance on the EPI full task.

EPI primary evaluation results
Extraction performance for simple events taking only Theme and Site arguments is consistently higher than for other event types, with absolute F-score differences of over 10% points for many systems. Similar notable performance differences are seen between the addition events, for which ample training data was available, and the removal types for which data was limited (Size column in Table 11). This effect is particularly noticeable for DEPHOSPHORYLATION, DNA DEMETHYLATION and DEMETHYLATION, for which the clear majority of systems failed to predict any correct events.  The Δ f column gives the F-score difference of each core task result to the corresponding primary result.
Extraction performance for CATALYSIS events is very low despite a relatively large set of training examples, indicating that the extraction of nested event structures remains very challenging. This low performance may also be related to the fact that CATALYSIS events are often triggered by the same word as the catalysed modification (e.g. Figure 1b), requiring the assignment of multiple event labels to a single word in typical system architectures. Table 12 (right) summarizes the EPI core task results and and Figure 7 (right) illustrates the statistical significance of the differences. While all systems show notably higher performance than for the full task, high-ranking participants focusing on the core task gain most dramatically, with the FAUST system core task F-score essentially matching that of the top system (UTurku). For the core task, all participants achieve F-scores over 50% -a level of performance achieved by only a single system in the ST'09 task -and the top four participants average over 65% F-score. These results confirm that current event extraction technology is well applicable to the core PTM extraction task, even when the number of targeted event types is relatively high, and may be ready to address the challenges of exhaustive PTM extraction [9]. The best core tasks results, approaching 70% Fscore, are particularly encouraging as the level of performance is comparable to or better than state-of-the-art results for many reference resources for protein-protein interaction extraction (see e.g. [102]) using the simple untyped entity pair representation, a standard task that has been extensively studied in the domain. Table 13 presents the primary results of the ID task by event type, and Table 14 (left) summarizes these results. Evaluation of statistical significance showed that the differences in overall performance are significant for all pairs of systems (p < 0.05).

ID primary evaluation results
The full ID task requires the extraction of additional arguments and event modifications and involves multiple novel challenges from previously addressed event extraction tasks including a new subdomain, full-text documents, several new entity types and a new event category. Nevertheless, extraction performance for the top systems is comparable to the state-of-the-art results for the established BioNLP ST'09 task [103] as well as its repetition as the 2011 GE task [54], where the highest overall result for the primary evaluation criteria was also 56% F-score for the FAUST system [89]. This result is encouraging regarding the ability of the extraction approach and methods to generalize to new domains as well as their applicability specifically to texts on the molecular mechanisms of infectious diseases.
We note that there is substantial variation in the relative performance of systems for different entity types. For example, Stanford [89] has relatively low performance for simple events but achieves the highest result for PROCESS, while UTurku [92] results show roughly the reverse. This suggests further potential for improvement from system combinations.
The best performance for simple events and for PRO-CESS approaches or exceeds 70% F-score, arguably approaching a sufficient level for user-facing applications of the extraction technology. By contrast, BINDING and regulation events, found challenging in ST'09 and GE, remain problematic also in the ID task, with best overall performance below 50% F-score. As in the EPI task, only the UTurku and ConcordU [99] teams attempted to extract event modifications, with somewhat limited performance. The difficulty of correct extraction of event modifications is related in part to the recursive nature of the problem (similarly as for nested regulation events): to extract a modification correctly, the modified event must also be extracted correctly. Again as for EPI, only UTurku predicted any instances of secondary arguments. Thus, teams other than UTurku and ConcordU Figure 7 Statistical significance of EPI task results. Results for full task on the left, core task on the right. Enclosed areas identify sets of systems for which performance differences were not statistically significant; for systems not sharing an enclosed area, differences are significant (p < 0.05). addressed only the core task extraction targets. However, by contrast to the EPI results, lacking prediction of additional arguments does not preclude systems from competitive performance at the ID task.
With the exception of ConcordU, all systems clearly favor precision over recall (Table 14, left), in many cases having over 15% point higher precision than recall. This a a somewhat unexpected inversion as the ConcordU system is rule-based, an approach typically associated with high precision.
The five top-ranking systems participated also in the GE task [54], which involves a subset of the ID extraction targets. This allows additional perspective into the relative performance of the systems. While there is a 13% point spread in overall results for the top five systems here, in GE all these systems achieved F-scores ranging between 50-56%. The results for FAUST, UMass and Stanford were similar in both tasks, while the ConcordU result was 6% points higher for GE and the UTurku result over 10% points higher for GE, ranking third after FAUST and UMass. These results suggest that while the FAUST and UMass systems in particular have some systematic (e.g. architectural) advantage at both tasks, much of the performance difference observed here between the top three systems and those of ConcordU and UTurku is due to strengths or weaknesses specific to ID. Possible weaknesses may relate to the treatment of multiple core entity types (vs. only PROTEIN in GE), challenges related to nested entity annotations (not appearing in GE), or the new PROCESS type. A possible ID-specific strength of the three topranking systems is the use of GE data for training: Riedel and McCallum [91] report an estimated 7% point improvement and McClosky et al. [90] a 3% point improvement from use of this data; McGrath et al. [94] estimate a 1% point improvement from direct corpus combination. The integration strategies applied in training these systems could potentially be applied also with other systems, an experiment that could further clarify the relative strengths of the various systems. The topranking five systems all participated also in the EPI task, for which UTurku ranked first with FAUST having comparable performance for the core task. While this supports the conclusion that ID performance differences do not reflect a simple universal ranking of the systems, due to many substantial differences between the ID and EPI setups it is not straightforward to identify specific reasons for relative differences to performance at EPI. Table 14 (right) summarizes the ID core task results. Please note that these results differ (by up to 0.5% point The Size column gives the number of annotations of each type in the given data (training+development). Best result for each type shown in bold. in overall F-score) from those reported previously [31] due to the correction of an error in the original core task evaluation system. As for the primary ID results, all the differences in system performance for the core task are also statistically significant (p < 0.05). By contrast to the results for EPI, where differences of over 30% points were observed between full and core task results, there are only modest and largely consistent differences to the corresponding full task results for ID, reflecting in part the relative sparseness of additional arguments: in the training data, for example, only approximately 3% of instances of event types that can potentially take additional arguments had one or more additional arguments. While event modifications represent a further 4% of full task extraction targets not required for the core task, the overall low extraction performance for additional arguments and modifications limits the practical effect of these annotation categories on the performance difference between systems addressing only the core targets and those addressing the full task.
REL evaluation results Table 15 shows the results of the REL task. We find that the four systems diverge substantially in terms of overall performance, with all pairs of systems of neighboring ranks showing differences approaching or exceeding 10% points in F-score. While three of the systems notably favor precision over recall, VIBGhent shows a decided preference for recall, suggesting a different approach from UTurku in design details despite the similarities in relation extraction approach. The highestperforming system, UTurku, shows an F-score in the general range of state-of-the-art results in the main event extraction tasks, which could be taken as an indication that the reliability of REL task analyses created with presently available methods may not be high enough for direct use as a building block for the main tasks. However, the emphasis of the highest-scoring system on precision is encouraging for such applications: nearly 70% of the entity-relation pairs that the system predicts are correct. The two top-ranking systems show similar precision and recall results for the two relation types. The submission of HCMUS shows a decided advantage for PROTEIN-COMPONENT relation extraction as tentatively predicted from the relative numbers of training examples, but their rule-based approach suggests training data size is likely not the decisive factor. While the limited amount of data available prevents strong conclusions from being drawn, overall the lack of correlation between training data size and extraction performance suggests that performance may not be primarily limited by the size of the available training data. Table 16 (left) summarizes the full EPI task results with the addition of the single partial penalty criterion. The F-scores for the seven participants under this criterion are on average over 4% points higher than under the primary criteria, with the most substantial increases seen for high-ranking participants only addressing the core task: for example, the precision of the FAUST system [89] is nearly 30% higher under the relaxed criterion. The core evaluation results showed that the limited precision of these systems for the full task is not due to errors in core argument recognition but rather due to lack of recognition for secondary arguments. From an application-oriented perspective it could be argued that the high precision but low recall seen for the single partial penalty criterion is a more accurate representation of actual system performance than seen under the primary criteria, as the systems make few outright false predictions, even though they do lack much of the detail of the full annotation. Despite substantial effects on precision in many cases, the overall F-score difference to the main criteria remains limited due to the low recall of many systems. Table 16 (right) shows the results for the ID task under the single partial penalty criterion. The average difference to F-score for primary criteria is broadly comparable to that for EPI, at slightly less than 4% points. However, this change is due to a very different effect on precision and recall: while for EPI a strong increase in evaluated precision was observed, for ID the increase is more evenly distributed and in cases observable as an increase in recall. This suggests that while in EPI missing arguments were a key limiting factor for performance in most cases, for ID there are comparatively many cases where systems make the error of predicting extra arguments that are not found in otherwise matching gold annotations. It is also interesting to note that while core evaluation is by far the most permissive setting for EPI, for ID higher F-scores are achieved in the full task with the single partial penalty criterion. Evaluation using single partial penalty allows additional perspectives into both the absolute and the comparative performance of systems. The results also indicate possible directions for seeking more meaningful criteria for event extraction system evaluation. While there are no single "correct" answers, further basis for evaluating candidate answers can be provided through manual evaluation reflecting the perceived quality of system outputs, considered in the following.

Additional evaluation results
Finally, Table 17 summarizes the results for the EPI extra test set enriched for events that were relatively sparse in the training data; especially reverse reactions. While these results show the expected drop associated with sparse training data for some systems, surprisingly a reverse effect is seen for others, particularly clearly for UMass. This result is unexpected also as UMass performance for reverse reactions on the main test data was comparatively weak. The mixed results for this additional test data have not suggested to us any straightforward conclusions regarding the comparative strengths of the various approaches, and we will here refrain from speculation regarding possible explanations. Nevertheless, this additional test data remains available for further experiments and evaluation and can hopefully support detailed study of effective strategies for event extraction with limited training data.

System combination
The event extraction system analyses created by the participants represent a rich resource for further study of the task. To explore one opportunity presented by this data, we studied whether the outputs of various systems could be beneficially combined to further improve extraction performance. We performed a straightforward combination of the output of all systems participating in the EPI and ID main tasks using a strategy where the combination system only outputs an event if it is found in the output of at least n combined systems, where n is a parameter ranging between one and the total number of systems. For determining event equality for the combination, we applied the primary evaluation metrics (approximate span and recursive matching). As the majority of systems only addressed the core tasks, we measured the performance of system combinations only on the core extraction targets.
Table 18 (left) shows the results of the combination experiment for the core EPI task and Figure 8 (left) plots the performance of the seven submissions and the seven combinations. We find that for n (2, 3, 4) the combinations outperform the highest standalone result for the core task (68.86%) in terms of F-score, improving on the standalone result by over 3% points in F- Results for EPI and ID full tasks, primary evaluation criteria with single partial penalty. The Δ f columns give F-score differences to the corresponding primary results. Results using primary criteria. The Δ f column gives differences to the main test set results for the same criteria. Results for this data are not considered part of the primary evaluation for the task. score for n = 3, thus providing an approximately 10% relative decrease in error. Different choices of n in calculating the combination permit the choice of various points along a precisionrecall curve spanning nearly the full precision range, giving nearly 97% precision for n = 7. However, no combination permits for comparably high values of recall: even for n = 1, corresponding to the union of all submissions, recall remains below 80%. There thus remains a substantial number of events that none of the systems can retrieve. Table 18 (right) and Figure 8 (right) present the system combination results for the ID task. Also for ID the system combination outperforms the best standalone result (57.57% F-score) for n = 2 and n = 3. The advantage is somewhat less than for EPI, approximately 2.5% points or an approximately 6% decrease in error. The more limited benefit from system combination for ID may be explained in part by the greater variance in the performance of individual systems for the task.
We note that the system combination strategies explored here are comparatively simple and make no use of information such as the relative reliability of the different systems [11], which could be separately estimated e.g. using the development data. There is a possibility that the performance of system combinations could be further substantially improved through the use of advanced combination strategies.

Manual analysis
Evaluation against a fixed gold standard provides a stable, objective basis for the comparison of systems and, when feasible, is often the preferred choice for the study of new tasks such as those introduced in this study. A fixed, fully annotated gold standard has the further benefit of allowing the results of experiments performed after its original introduction to be directly compared against those of the original work on an even basis.
Nevertheless, comparison against gold standard annotation such as that applied in the BioNLP Shared Tasks also has its drawbacks when compared to alternatives such as direct evaluation of system outputs. One such drawback is the difficulty of assessing degrees of correctness for system outputs that differ from gold annotation. For any non-trivial representation aiming to capture (part of) the meaning of complex text there will almost unavoidably be cases where more than one set of annotations are at least acceptable; where the annotation task takes on aspects of translation from natural language into the semantic representation. Tightly specified annotation guidelines can (and in gold standard annotation arguably should) be used to rule out all but one candidate interpretation (cf. controlled language). However, from the perspective of many applications such constraints do not change the basic issue that declaring one out of several potentially acceptable analyses "correct" and others "incorrect" can lead to a distorted picture of system (or human annotator) performance.
To take a specific example, the annotation guidelines of the GE and ID tasks require each "link" in a stated chain of causation to be separately marked, so that e.g. A affects the regulation of B is marked with two REGU-LATION annotations (Figure 9 top). However, from the  perspective of many applications (e.g. pathway curation and semantic search), an annotation involving a single REGULATION event correctly identifying the direction of causality (Figure 9 middle) may be equally acceptable, or at least clearly preferable to an annotation where the direction is inverted (Figure 9 bottom). Yet in evaluation against the gold standard the latter two are both equally wrong.
Another drawback of the evaluation against a gold standard is that errors in gold annotation -unavoidable in any human product of this complexity -will tend to bias results against the systems, leading to underestimates of performance. The tendency of gold annotation errors to hurt rather than help systems in evaluation arises for any representation where there are many more ways to annotate wrongly than correctly, as the likelihood of an erroneous gold annotation matching an erroneous system output will be comparatively small.
To provide additional perspective to the performance of the systems participating in the EPI and ID main tasks, we performed an extensive manual analysis of system outputs, providing an experienced annotator with a random sample of the events predicted by each of the participating systems. To avoid possible bias, we blinded the inputs so that the annotator could not tell which system had created which annotation or whether the annotation had matched gold or not. To further assess the quality and stability of the original gold standard annotation, we additionally included a similarly blinded sample of annotations drawn from the gold standard. Specifically, for both EPI and ID we selected for evaluation a random sample of 100 events from each of the seven final submissions as well as the gold data, for a total of 800 events for each task. These 1,600 events were then blindly evaluated with instructions to mark each as one of correct, acceptable or incorrect. The first of these labels identifies an event as, informally, "as good as gold": a complete and accurate representation of the targeted aspects of the relevant biological claim with respect to the task annotation guidelines. The label acceptable specifies that an event is an acceptable representation of the relevant central biological statement, but either incomplete or at variance with the standard guidelines in some secondary aspect. The last label was used for any candidate for which the first two labels were not applicable. As for the system combination, we performed this evaluation for core task subsets as the majority of participants did not attempt the full task. Table 19 shows the results of the manual analysis for the EPI and ID tasks. We first note that the evaluation supports the estimates of high gold annotation quality: in total, only four of 200 blindly evaluated gold events were marked incorrect. The difference in the number of "accurate" judgments supports a view received also from annotators' informal impressions that the EPI task targets are very strictly defined, while those of the ID task require more interpretation of the guidelines.
The evaluation shows good correlation with the results of evaluation against gold: a ranking of systems by the number of events judged correct or acceptable produces no instances of disagreement with ranking by the number of matches with gold in the samples. However, despite high correlation, the manual evaluation results indicate systematic differences between the quality of system outputs as perceived in manual evaluation and the results of evaluation against gold. For EPI, the average precision of the systems over the 700 selected events is 74.3% from evaluation against gold while manual evaluation of the same events suggests 79.3% are correct (79.6% correct or acceptable). This effect is consistent also when examining the individual samples of 100 events for each system separately. Accepting (for the moment) the manual evaluation as "ground truth", this result indicates that approximately 20% of events appearing as false positive in evaluation against gold are in fact correct. Number of manually evaluated events judged correct (Corr.), acceptable (Acc.) and incorrect (Inc.) out of samples of 100, with number matching gold annotation (Match) and, for reference, core task precision (Prec.). Analysis performed independently for EPI and ID tasks.
For ID, while the average precision for the sample of 700 events is 49.2% in comparison against gold, the manual evaluation suggests that 64.7% of the events are correct. This discrepancy between the two evaluations is larger than that for EPI also in relative terms: with reference to the results of the manual evaluation, 30% of "false positive" events in gold evaluation are correct. Unlike in the EPI evaluation, a nontrivial amount of ID events were marked acceptable; the evaluation suggests that less than 30% of all event predictions are incorrect, a striking contrast to the over 50% estimate from comparison to gold.
Some of these differences are almost certainly due to events missing from gold. However, given the very high precision of gold events measured also in this evaluation, the lack of factors that could make errors of omission up to an order of magnitude more common than errors of commission in gold annotation, and the support from the relative number of "accurate" markings, we find it likely that the observed discrepancy is primarily due to cases that genuinely permit more than one acceptable analysis.
While the trend that this evaluation shows was expected and similar results have been reported previously (e.g. [25]), the specific estimates of the magnitude of the discrepancy between evaluation of event extraction outputs against a fixed gold standard and the perceived quality in manual evaluation is a novel result. This difference was found to be substantial in particular for ID, especially when taking into account cases that are acceptable although not entirely correct. This evaluation indicates that the quality of event extraction results perceived by users may be considerably higher than the primary shared task evaluation would suggest.

Conclusions
We have presented the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Epigenetics and Post-translational modifications (EPI) and Infectious Diseases (ID) and the supporting task on Entity Relations (REL).
The two main tasks were each based on newly annotated corpus resources developed for the purposes of the shared task. For the EPI task, we annotated 1,200 publication abstracts selected as a balanced sample of all PubMed citations regarding the selected event types. For ID, a corpus of 30 full-text publications on the twocomponent systems subdomain of infectious diseases was created in a collaboration of event annotation and domain experts, adapting and extending the BioNLP'09 Shared Task (ST'09) event representation to the domain. Data for the supporting task REL was created by extending previously introduced GENIA corpus annotations.
The EPI task results demonstrated that the core extraction target of identifying statements of 14 different gene/protein modification types can be reliably addressed by current event extraction methods, with two systems approaching 70% F-score for this task. We further demonstrated that a simple system combination can further improve on this level of performance, reducing error rate by approximately 10%. Challenges remain in detecting statements regarding the catalysis of modification events as well as in resolving the full detail of such events, a task attempted by only one EPI task participant and at which performance remains below 55% F-score.
For the ID task, despite the novel challenges of full papers, four new entity types, extension of event scopes and the introduction of a new event category for highlevel processes, the highest results achieved by the seven participating teams were comparable to the state-of-theart performance on the established ST'09 data, showing that the event extraction approach and present systems generalize well and demonstrating the feasibility of event extraction for the infectious diseases domain. Analysis of the results suggested further opportunities for improving extraction performance, and a system combination experiment demonstrated that even comparatively simple combination approaches can realize further benefits. Finally, manual evalutation of system outputs suggested that many events identified as false positives in evaluation against the gold standard data may nevertheless be acceptable from a user perspective.
Both the EPI and ID main tasks were designed from the perspective of supporting specific biocuration tasks. We believe successful EPI systems have potential to offer further support for existing protein modification database curation as well as for novel efforts targeting specifically epigenetics-related modifications. The resources, tools and methods developed in the ID task are currently being adopted in the development of the Pathosystems Resource Integration Center (PATRIC) [104]. Present and future advances in these tasks can thus assist biologists in efforts of substantial scientific and public health interest.
As was done after the BioNLP'09 shared task, we have made the data, tools and resources for all three task introduced in this paper open to all interested parties to encourage further study of event extraction. The tasks continue as an open shared challenges accessible through the shared task website, http://www.bionlp-st. org/. full contents of the supplement are available online at http://www. biomedcentral.com/bmcbioinformatics/supplements/13/S11. We would like to thank Yoshiro Okuda and Yo Shidahara of NalaPro Technologies for their efforts in producing the EPI task annotation and Yumiko Katsu-Kimura for contributions to the REL task annotation. This work was supported by Grant-in-Aid for Specially Promoted Research (MEXT, Japan) and funded in part with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN272200900040C, awarded to BWS Sobral. The UK National Centre for Text Mining is funded by UK Joint Information Systems Committee (JISC). Authors' contributions SP and TO conceived of the tasks. SP coordinated the ID and REL task organization, drafted the manuscript, and implemented evaluation and support tools. TO coordinated the EPI task organization and led the EPI, ID and REL annotation efforts. RR, DS, CM and CW contributed to the design of the ID task, RR created the ID task initial automatic annotation, and DS, CM, CW and TO created the ID task manual annotation. BS and SA participated in the design and coordination of the ID task and JT participated in the design and coordination of the overall shared task. All authors read and approved the final manuscript.