Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011

Pyysalo, Sampo; Ohta, Tomoko; Rak, Rafal; Sullivan, Dan; Mao, Chunhong; Wang, Chunxia; Sobral, Bruno; Tsujii, Jun'ichi; Ananiadou, Sophia

doi:10.1186/1471-2105-13-S11-S2

Volume 13 Supplement 11

Selected articles from the BioNLP Shared Task 2011

Proceedings
Open access
Published: 26 June 2012

Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011

Sampo Pyysalo^1,2,
Tomoko Ohta³,
Rafal Rak^1,2,
Dan Sullivan⁴,
Chunhong Mao⁴,
Chunxia Wang⁴,
Bruno Sobral⁴,
Jun'ichi Tsujii⁵ &
…
Sophia Ananiadou^1,2

BMC Bioinformatics volume 13, Article number: S2 (2012) Cite this article

6782 Accesses
40 Citations
2 Altmetric
Metrics details

Abstract

We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.

Background

The biomedical scientific literature is growing at an exponential rate, far outstripping the capacity of individual researchers to fully process in any but the narrowest of subfields. To address the challenges of information overload and to improve access to the wealth of knowledge in this literature, there have been substantial efforts over the previous 15 years to develop automatic methods for the analysis of medical and biomolecular scientific publications [1, 2].

Much of this work has focused on information extraction (IE) and text mining, applying natural language processing (NLP) methods to analyse domain texts, extract structured, computer-readable representations of key information, and compile extracted information into knowledge bases. Until recently, domain IE efforts concentrated primarily on foundational tasks, such as entity mention detection, and on the extraction of simple representations of entity associations, most typically the detection of mentions of names of proteins and the extraction of protein mention pairs representing protein-protein interactions.

However, in recent years there has been increasing interest in the application of more expressive representations in domain IE to address the requirements of tasks such as pathway curation, Gene Ontology term annotation, and semantic literature search [3, 4]. To support the development and evaluation of methods for such tasks, a number of recently introduced corpus resources have been manually annotated using event representations that capture structured associations of arbitrary numbers of participants in specific roles [5–10]. The community took a decisive step toward the introduction of practical tools capable of extracting information using such representations in the BioNLP Shared Task 2009 (BioNLP ST'09) [11, 51].

Shared tasks have been instrumental in the development of general domain IE technology by introducing new tasks, resources and evaluation standards [12, 13]. Also in the biomedical domain, shared tasks such as JNLPBA [14], LLL [15], TREC Genomics [16] and BioCreative [17–19] have played a central role in focusing the efforts of the community to new timely tasks and challenges. The BioNLP ST'09, the first shared task in its series, sought to advance the state of the art in structured event extraction by providing a shared task definition and annotated data as well as evaluation criteria and tools for the task. The task met with enthusiastic response from the community: 24 groups participated in the task, proposing a variety of approaches for automatic event extraction. Interest in event extraction continued past the original shared task, whose data and setup have supported further advances in extraction methods and the introduction of automatically annotated literature-scale resources [20–26].

Although successful in introducing structured event representations to the general community and promoting the development of practically applicable methods for event extraction, the resources of the BioNLP ST'09 were somewhat limited in their scope. The task data was prepared on the basis of the GENIA corpus [6, 27], an annotated resource of publication abstracts in the domain of transcription factors in human blood cells, and the event types targeted in the task were chosen by relevance to its topics. These limitations raised the question whether the findings of the shared task and the methods introduced to address the task can generalize beyond this narrow domain. To address such questions, generalization was chosen as the main theme of the second event in the series, the BioNLP Shared Task 2011 (BioNLP ST'11) [28]. This was emphasized in task selection and design and data preparation, targeting new domains, an extended set of extraction targets, and new text types, including full-text articles.

In this paper, we present three of the eight tasks in BioNLP ST'11, including two tasks that attracted the widest participation among newly introduced main tasks, EPI and ID. The Epigenetics and Post-translational Modifications (EPI) task focuses on events relating to epigenetic change and encompasses also common protein post-translational modifications, reactions that are critical for the control or gene expression and protein function. The Infectious Diseases (ID) task concerns the biomolecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria but as of yet incompletely understood. In addition to these two main tasks, we introduce the Entity Relations (REL) supporting task, which seeks to assist extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems whose analyses can then be used to support the recognition of various event extraction targets.

Each of these tasks follows the general design of the BioNLP ST'09, providing participants with an extensive, fully annotated corpus with manually curated examples of the extraction targets for method development and training, with evaluation of final submissions received from participants against a separate held-out test set prepared in similar fashion.

We extend on the EPI, ID and REL task results previously reported in the BioNLP Shared Task 2011 workshop proceedings [29–32] in particular in performing further analyses of the outputs of the participating systems, placing specific emphasis on aspects of event extraction system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation.

In the following, we first briefly motivate each of the EPI, ID and REL tasks and introduce the task setting. We then describe the task data annotation and evaluation criteria before presenting the results of each task. Finally, we present an extended analysis of the outputs of the participating systems.

EPI task

The Epigenetics and Post-translational Modifications (EPI) task is an information extraction task focusing on events relating to epigenetic change, including DNA methylation and histone methylation and acetylation (see e.g. [33, 34]), as well as other common protein post-translational modifications (PTMs) [35]. PTMs are chemical modifications of the amino acid residues of proteins, and DNA methylation a parallel modification of the nucleotides on DNA. While these modifications are chemically simple reactions and can thus be straightforwardly represented in full detail, they have a crucial role in the regulation of gene expression and protein function: the modifications can alter the conformation of DNA or proteins and thus control their ability to associate with other molecules, making PTMs key steps in protein biosynthesis for introducing the full range of protein functions. For instance, protein phosphorylation - the attachment of phosphate - is a common mechanism for activating or inactivating enzymes by altering the conformation of protein active sites [36, 37], and protein ubiquitination - the post-translational attachment of the small protein ubiquitin - is the first step of a major mechanism for the destruction (breakdown) of many proteins [38].

Many of the PTMs targeted in the EPI task involve modification of histone, a core protein that forms an octameric complex that has a crucial role in packaging chromosomal DNA. The level of methylation and acetylation of histones controls the tightness of the chromatin structure, and only "unwound" chromatin exposes the gene packed around the histone core to the transcriptional machinery. Since histone modification is of substantial current interest in epigenetics, we designed aspects of the EPI task to capture the full detail in which histone modification events are stated in text. Finally, the DNA methylation of gene regulatory elements controls the expression of the gene by altering the affinity with which DNA-binding proteins (including transcription factors) bind, and highly methylated genes are not transcribed at all [39, 40]. DNA methylation can thus "switch off" genes in a way that is reversible through DNA demethylation.

The specificity with which protein modifications can be described in text makes them promising IE targets (see Figure 1), and there have been many studies of automatic extraction of PTMs from the scientific literature in support of modification database curation [41–44]. However, these have generally targeted only single PTM types such as phosphorylation or ubiquitination, frequently using highly customized rule-based systems that are not readily adaptable to other extraction targets. The BioNLP ST'09 involved the extraction of nine event types including one PTM type, PHOSPHORYLATION, which was found to be the single most reliably extracted event type in the task, with the best-performing system for the type achieving 83% F-score in its extraction [45]. The results suggest both that the event representation is well applicable to PTM extraction and that current extraction methods are capable of reliable PTM extraction. Many of the most successful systems participating in the ST'09 further involved general machine learning-based approaches, suggesting that their scope could be extended to PTM extraction more broadly. The EPI task follows up on these opportunities, introducing specific, strongly biologically motivated extraction targets that are expected to be both feasible for high-accuracy event extraction, relevant to the needs of present-day molecular biology, and closely applicable to biomolecular database curation needs.

ID task

The Infectious Diseases (ID) task is an event extraction task focusing on the biomolecular mechanisms of infectious diseases. The task concentrates on the specific domain of two-component systems (TCSs, or two-component regulatory systems), a mechanism widely used by bacteria to sense and respond to the environment [46]. Typical TCSs consist of two proteins, a membrane-associated sensor kinase and a cytoplasmic response regulator. The sensor kinase monitors changes in the environment while the response regulator mediates an adaptive response, usually through differential expression of target genes [47]. TCSs have many functions, but those of particular interest for infectious disease researchers include virulence, response to antibiotics, quorum sensing, and bacterial cell attachment [48]. Not all TCS functions are well known: in some cases, TCSs are involved in metabolic processes that are difficult to precisely characterize [49]. TCSs are of interest also as drugs designed to disrupt TCSs may reduce the virulence of bacteria without killing it, thus avoiding the potential selective pressure of antibiotics lethal to some pathogenic bacteria [50]. Information extraction techniques may support better understanding of these fundamental systems by identifying and structuring the molecular processes underlying two component signaling.

The ID task seeks to support efforts to build a systemic understanding of the molecular-level pathways relating to these mechanisms of infectious diseases by adapting the BioNLP ST'09 event extraction model to domain scientific publications. The adaptation of the model originally introduced to represent biomolecular events relating to transcription factors in human blood cells to a domain that centrally concerns both bacteria and their hosts involves a variety of novel challenges, such as events concerning whole organisms, the chemical environment of bacteria, prokaryote-specific concepts (e.g. regulons as units of gene regulation), as well as the effects of biomolecules on larger-scale processes involving hosts, such as virulence. In addition to supporting an application of significant public health interest, the ID task also provides opportunities to study the ability of event extraction technology to generalize in a number of aspects.

REL supporting task

The Entity Relations (REL) supporting task focuses on the extraction of specific binary relations between biomolecular entities. The motivation for the task draws in part from analysis of the results of the BioNLP ST'09, which suggested that events that involve coreference or entity relations represent particular challenges for extraction [51]. To help address these challenges and encourage modular extraction approaches, increased sharing of successful solutions, and an efficient division of labor, the two were separated into independent supporting tasks on Coreference (CO) [52, 53] and Entity Relations [32] for BioNLP ST'11. To allow participants in main tasks to benefit from successful approaches to the supporting tasks, the ST'11 was arranged in two distinct stages, with supporting tasks carried out before the main tasks.

Methods

In this section, we introduce the general representation of the IE tasks, the specific realization of this representation applied in each task, and the task evaluation criteria.

Representation

While the EPI, ID and REL tasks differ substantially in the specifics of their extraction targets, the three share the same basic representation of extracted information, an extension of the representation introduced for the BioNLP ST'09 [11] and applied also in the ST'11 GE task [53, 54]. In this section, we first present the shared aspects of this event representation, in particular the four annotation primitives entities, relations, events and event modifications, briefly illustrate the format in which these are stored in the task, and then describe the specific set of annotations involved in each of the tasks.

Entities

All of the tasks build on the basis of entity annotations, which capture mentions of entities of interest in text using a simple typed-span representation. Each entity annotation consists of a type (e.g. PROTEIN) and a (start, end) offset pair identifying the span of text containing the entity mention. The entity annotations thus mark contiguous sequences of characters. All entity annotations further follow the constraint that no two entities of the same type overlap in their spans, and that the spans of no two overlapping annotations cross. By contrast, entity annotation spans may be nested so that one span completely contains another. Figure 2a shows examples of entity annotation.

Entity mention detection and normalization are arguably the most frequently studied IE-related tasks in the domain, and the target of numerous previous and ongoing shared tasks [14, 55–58]. Further, a wealth of systems addressing these tasks have been introduced (e.g. [59–62]). To focus on the novel aspects of event extraction, the BioNLP Shared Task series has adopted the general policy of providing task participants with manual "gold standard" annotation identifying the primary entities relevant to each task as a starting point for extraction, thus isolating effects of entity mention detection from IE performance.

The three tasks presented in this paper share the definitions of two entity types, PROTEIN and ENTITY, defined similarly as for the BioNLP ST 2009. Mentions of specific names of genes and gene products are annotated as PROTEIN in all tasks, with some task-specific exceptions to the precise scope of the annotation (see the sections on each task). Gold annotation for PROTEIN entities is provided to participants in all tasks. The generic type ENTITY, by contrast, is defined for marking additional entity annotations generated by participants, such as the specific protein domains or DNA regions involved in modification or binding events. Annotations of this type are only provided in training data, and must thus be detected for test data by systems addressing the full main tasks or the supporting task. The non-specific type ENTITY is selected in part to reduce the demands of this entity recognition component of the tasks by removing the need to differentiate between specific types.

Relations

Relations are typed binary associations of entities and may be either directed or undirected. While relations are only a target of extraction in the REL task, all of the three tasks involve a specific relation, EQUIV. This is a binary, symmetric, transitive relation that defines two entities to be equivalent [63]. The relation is used in the gold annotation to mark local aliases such as the full and abbreviated forms of a protein name as referring to the same real-world entity. In evaluation, references to any of a set of equivalent entities are treated identically. Figure 2b shows examples of equiv annotation.

In addition to the general EQUIV relation, the REL task defines directed part-of relations that are the targets of extraction in the supporting task, introduced in the section defining the task.

Events

Events are typed, n-ary associations of entities or other events, each identified as participating in a specific role. Events are bound to specific expressions in text (the event trigger or text binding) and are primary objects of annotation; that is, annotations may refer to event annotations. Event triggers identify the word or words stating the occurrence of the event in text. Like entities, triggers are represented using a (start, end) offset pair. Event types (e.g. BINDING, ACETYLATION) are drawn from a fixed set separately defined for each task. Each event typically takes one or more arguments (participants in specific roles): for example, an ACETYLATION event may be defined as requiring a single Theme identifying the PROTEIN entity that is acetylated. Events may involve other events as participants, thus creating complex event structures. For example, a REGULATION event may have a BINDING event as its Theme, thus specifying a compound "regulation of binding" event. Figure 2c shows examples of event annotation.

The main tasks differ in the details of event arguments, but share the definition of the basic core argument roles Theme and Cause as well as the additional argument role Site. As the terms suggest, Theme identifies the participant or participants that undergo the primary effects of the event, and Cause a participant that causes the event to occur. Site identifies a specific part of another participant that is involved in the event, such as the modified residue in a PHOSPHORYLATION event. Event arguments may be specified as being either mandatory or optional, where mandatory arguments must be identified for an event to be extracted. Events typically take a mandatory Theme, reflecting a specificity constraint on the extracted information: while statements regarding e.g. the phosphorylation of specific proteins are targeted for extraction, statements regarding phosphorylation in general are not.

The event arguments vary by event type and task, and the specification of event types and arguments largely defines the differences between the different main tasks of the BioNLP ST'11.

Event modifications

Event modification annotations are used to specify further aspects of event statements beyond the core propositional content, for example identifying an event as being negated. Event modifications are represented as simple binary "flags" attached to events. Both of the main event extraction tasks EPI and ID follow the BioNLP ST'09 setting in defining two event modification extraction targets: NEGATION and SPECULATION. The former marks an event as being explicitly negated (e.g. H2A is not methylated) and the latter as stated in a speculative context (e.g. H2A may be methylated). An event may be simultaneously marked as both negated and speculated (e.g. H2A may not be methylated). Unlike in the representation used for events and the cue-scope model applied for negation and speculation annotation in e.g. the BioScope corpus and the CoNLL 2010 shared task [64, 65], no "trigger expressions" are marked for event modifications.

Format

The above presentation of the represented information content abstracts away the specific file format in which this information is stored in the task. While secondary to this information content, the specifics of the format may be of interest for assessing the technical requirements of the task; we thus include in Figure 3 an illustration of the applied file format. In brief, this is a standoff annotation format in which all references to the text are stored as offsets. Each annotation is given an ID that is used to refer to that annotation, and the ID assignment follows a simple scheme to assist identifying entity types (e.g. IDs beginning with "E" for events). For a detailed description of the format, we refer to the BioNLP ST'09 overview [51].

EPI task setting

The EPI task focuses on the extraction of information on statements regarding change in the chemical modification state of proteins and DNA. The task involves the two generally applied entity types PROTEIN and ENTITY, where annotations of the PROTEIN type are provided as part of the input. By contrast to its standard entity definition, the EPI task introduces considerable novelty in the targeted events, involving a total of 14 event types and two participant roles not considered in other BioNLP ST'11 tasks. Table 1 summarizes the targeted event types and their arguments. In addition to the standard Theme, Cause and Site, EPI defines the task-specific arguments Sidechain and Contextgene.

Table 1 EPI event types and their arguments

Selected articles from the BioNLP Shared Task 2011

Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011

Abstract

Background

EPI task

ID task

REL supporting task

Methods

Representation

Entities

Relations

Events

Event modifications

Format

EPI task setting

ID task setting

REL task setting

Corpora

EPI corpus

EPI document selection

EPI annotation

ID corpus

ID document selection

ID annotation

ID datasets and statistics

REL corpus

Evaluation

Results

Participation

EPI primary evaluation results

ID primary evaluation results

REL evaluation results

Additional evaluation results

System combination

Manual analysis

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us