Extending the evaluation of Genia Event task toward knowledge base construction and comparison to Gene Regulation Ontology task

Kim, Jin-Dong; Kim, Jung-jae; Han, Xu; Rebholz-Schuhmann, Dietrich

doi:10.1186/1471-2105-16-S10-S3

Volume 16 Supplement 10

BioNLP Shared Task 2013: Part 1

Research
Open access
Published: 23 June 2015

Extending the evaluation of Genia Event task toward knowledge base construction and comparison to Gene Regulation Ontology task

Jin-Dong Kim¹,
Jung-jae Kim²,
Xu Han² &
…
Dietrich Rebholz-Schuhmann³

BMC Bioinformatics volume 16, Article number: S3 (2015) Cite this article

2379 Accesses
8 Citations
Metrics details

Abstract

Background

The third edition of the BioNLP Shared Task was held with the grand theme "knowledge base construction (KB)". The Genia Event (GE) task was re-designed and implemented in light of this theme. For its final report, the participating systems were evaluated from a perspective of annotation. To further explore the grand theme, we extended the evaluation from a perspective of KB construction. Also, the Gene Regulation Ontology (GRO) task was newly introduced in the third edition. The final evaluation of the participating systems resulted in relatively low performance. The reason was attributed to the large size and complex semantic representation of the ontology. To investigate potential benefits of resource exchange between the presumably similar tasks, we measured the overlap between the datasets of the two tasks, and tested whether the dataset for one task can be used to enhance performance on the other.

Results

We report an extended evaluation on all the participating systems in the GE task, incoporating a KB perspective. For the evaluation, the final submission of each participant was converted to RDF statements, and evaluated using 8 queries that were formulated in SPARQL. The results suggest that the evaluation may be concluded differently between the two different perspectives, annotation vs. KB. We also provide a comparison of the GE and GRO tasks by converting their datasets into each other's format. More than 90% of the GE data could be converted into the GRO task format, while only half of the GRO data could be mapped to the GE task format. The imbalance in conversion indicates that the GRO is a comprehensive extension of the GE task ontology. We further used the converted GRO data as additional training data for the GE task, which helped improve GE task participant system performance. However, the converted GE data did not help GRO task participants, due to overfitting and the ontology gap.

Background

The BioNLP Shared Task (BioNLP-ST) has been organized three times since 2009. The goal is to provide the community with shared resources for the development and evaluation of fine-grained information extraction (IE) systems, particularly for the domain of molecular biology and medicine. Each time, it was organized with a grand theme (a goal shared by all the tasks): introduction of the event extraction task, generalization, and knowledge base (KB) construction, for the 1st, 2nd, and 3rd editions, respectively [1–3].

Initially motivated by the Genia annotation [4], the tasks of BioNLP-ST are designed for intrinsic evaluation, with the hope of complementing more application-oriented tasks of extrinsic evaluation, e.g. the Protein-Protein Interaction (PPI) extraction task of BioCreative [5]. While an extrinsic evaluation measures the performance of a system in the context of a specific application, i.e. its utility in the entire application, an intrinsic evaluation focuses on measuring the performance of a system in isolation, independently from a specific application [6]. For example, while the BioCreative PPI task is explained in the context of database curation specifically, the potential application of BioNLP-ST tasks is often broadly explained.

Until its second edition, the participants in BioNLP-ST tasks were evaluated from a perspective of annotation: The annotation instances in the submitted and gold annotations are individually compared for evaluation. For its third edition in 2013, BioNLP-ST attempted to broaden the scope of its potential applications to include KB construction, which is expected to be closer to the interest of general domain scientists, e.g. biologists, and set it as its grand theme.

Among the tasks organized in BioNLP-ST, the Genia Event (GE) task is the sole task that has continued from the beginning [7–9]. For its third edition, the grand theme - KB construction - was considered in the design and implementation of the task: the coreference task [10] was integrated, for improved sensitivity of knowledge harvesting, and recently published full papers were added to the benchmark dataset [9]. However, the evaluation of the participating systems was still carried out in the same way as previously, in the context of corpus annotation, without implementing the grand theme into it much.

The work presented in this paper addresses extension of the GE task evaluation, considering KB construction as a potential application of the task. Specifically, what is intended with the new evaluation is to measure the effect of abstracting out schematic differences in annotation, which is not a concern of domain scientists, but rather of annotation practitioners. An example of schematic differences in annotation is illustrated in Figures 1 and 2. The example text reads that the signaling cascade that involves the protein MyD88 induces the expression of the protein NFAT5. However, the interpretation is represented differently in annotations in the two figures. In Figure 1, the two words signaling and dependent are annotated as triggering two successive regulation events, whereas in Figure 2, the word dependent is not annotated as triggering a regulation event, but as connecting the protein MyD88 to the signaling cascade as a causal factor. To annotation practitioners, and those who are interested in automating the annotation, this is an important issue, as it is related to the consistency of the annotations. This is the perspective from which the original evaluation is performed. We call it annotation-oriented evaluation. Domain scientists, however, would not be interested in such a difference, and would want to avoid being affected by it during their use of a KB. This is the perspective of the new evaluation, which we call KB-oriented evaluation.

The paper reports the results of the KB-oriented evaluation on all of the final submissions to the GE task in 2013. It complements the overview paper for the task [9], which provides a general introduction to the task and reports the results of the annotation-oriented evaluation.

This paper also provides a comparison of the GE task to the Gene Regulation Ontology (GRO) task [11], which is to automatically annotate biomedical documents with the Gene Regulation Ontology [12]. GRO is a conceptual model of gene regulation. It includes 507 concepts, which are cross-linked to such standard ontologies as Gene Ontology and Sequence Ontology and are integrated into a deep hierarchical structure via is-a and part-of relations. It is much larger than the Genia ontology, and its concepts are generally more specific than the Genia ontology concepts used in the previous GE tasks.

The complex structure of the GRO enables us to evaluate participant systems at different abstraction/generalization levels. However, its large size and complex semantic representation make the event extraction based on the ontology highly challenging. One of the issues of the GRO task is that its dataset is small compared to the size of the ontology. In this paper, we test whether the conversion of the GE task dataset, whose ontology has a large overlap with the GRO, into the GRO task may help address this issue.

Methods

Representation

Since its first edition in 2009, the annotation of the BioNLP-ST has been provided in the so-called a * format. In addition to the a * format, in the third edition of the GE task, the datasets are also provided in a new format, which is motivated by the following two issues.

1. For ease of implementation: The a * format is actually a quite complex format for which to implement reader and writer modules: it has to handle three delimiter characters, tab, space, and colon, in a structural way, and has to handle coordination of arguments based on number suffixes. The complexity is an extra overhead that has nothing to do with the task itself. The burden of the implementation prevents the participants from concentrating on the task. Using a more standard format would let them spend more time on the task itself.

2. For flexible representation and retrieval of information: As the a * format requires any piece of information related to an event to be represented in an event-centric n-ary relationship, it is hard to represent partial information. For example, even if there is a system which is very good at extracting causal relationships, since the information cannot be represented without successfully extracting some other part, e.g. the theme, of the relevant event, there is no way to evaluate the potential of the system. Also, from a KB perspective, the n-ary relationship makes the representation unsuitable to inference, which is necessary to provide flexible access to the contents of a KB. For example, from the annotation in Figures 1 and 2, users may want to retrieve the information that the protein MyD88 is a causal factor of NFAT5 gene expression, regardless of how it is represented in annotation, which, however, requires inference over the explicit annotation. To fulfill a KB-oriented use case, a more inference-friendly representation would be beneficial.

To address these issues, we present a new format. It is a JSON application. Note that JSON is currently one of the most widely used standard data formats and most major programming languages already have public reader and writer modules for it. Being provided with the benchmark dataset in JSON, the developers do not need to implement reader and writer modules themselves. The new format is also designed to be relation-centric, and all the information is represented by binary relations, which are more elementary than events. The new format is thus flexible enough to represent various aspects of information, and it is inference-friendly.

For example, the annotation illustrated in Figure 3 is represented in the a * format as shown in Figure 4. The format is basically a variation of the CSV (comma-separated values) format with the tab character as the the primary delimiter. In the format, the annotation statements, which are n-tuples, are in the second column, while the ID of each statement is in the first column. The statements which have their ID beginning with the prefix T are entity annotations, and they are in the form of triples, (entity-type, beginning-caret-offset, ending-caret-offset), which state that the text span between the beginning- and ending-caret-offsets denotes an entity of the entity-type. In the a * format, the elements of the triple are delimited by the space character. The statements which have their ID beginning with the prefix E are event annotations. They are in the form of typed n-tuples, (event-type:trigger-entity-id, arg1-type:arg1-entity-id, ...), representing the predicate-argument structure of an event. The order of the tuple, n, varies according to the number of arguments. Note that in the example, the event E1 has two arguments, E2 (Theme) and T1 (Cause). As the a * format is event-centric, the event is represented as one statement representing its predicate-argument structure, regardless of the number of arguments involved in it.

Figure 5 shows the same annotation in the JSON format. As can be seen, the JSON format is more self-descriptive than the a * format. A denotation type of annotation, which is stored in the array denotations, states that a span of text "denotes" an object. The relation type of annotation, which is stored in the array relations, states that two objects, of which one is subject and the other is object, are related to each other via a predicate. Note that the JSON format is relation-centric: a relation is represented by a single statement. In the example, the two (binary) relations, R1 and R2, together with the denotation, E1, correspond to the event E1 in the a * format. Such a conversion from n-ary relation to n binary relations is a standard process in description logic [13]. As a relation can be represented individually, a partial piece of information about an event, e.g. the causal relation, R2, can be represented independently from the other relation, R1, of the event. Through inference over relations, implicit information also can become accessible. For example, by defining the relation themeOf as a transitive one, it becomes straightforward to access the information that the protein Sox6 negatively regulates the protein Epsilon Globin. Note that the denotation annotations also can be seen as a special case of relation annotations, with text spans as their subjects. For more detailed information on the JSON format, readers are referred to the web document [14].

Considered at an abstract level, the pros and cons of the two formats become clearer. The a * format is closer to the relational database (RDB) model, i.e. tables, and the JSON format is closer to the resource description framework (RDF) model, i.e. graphs. The former is often more efficient for optimized applications, e.g. event-centric processing, while the latter is more flexible and inference-friendly. The GE task supports both models by providing the benchmark dataset in both formats, and also tool for converting between them.

KB construction

To evaluate the annotation submitted by the participants from the perspective of KB construction, we built a KB from the annotations of each participant. As the framework of the KB, we have chosen to use the Resource Description Framework (RDF) [15], a widely accepted knowledge representation framework recommended by the World Wide Web Consortium (W3C) [16].

As the vocabulary for RDF statements, some existing open vocabularies were considered [17, 18]. However, we found that they are focused on retaining provenance of information. As our purpose of constructing the KBs in this work is for evaluation of annotation, we have chosen to develop a minimal in-house vocabulary, which we call text annotation ontology (TAO). Then, we implemented a converter from the JSON format to RDF. In fact, the JSON format is very close to RDF, as each annotation statement is already a triple. Converting it to RDF is explicating its semantics using an RDF vocabulary.

Figure 6 shows an annotation example in RDF using TAO. As can be seen in lines, 6, 9, 12, and 15, each denotation annotation introduces an entity which is of the type tao:Context_entity. As the name implies, an instance of the class tao:Context_entity is an entity defined in a specific context, denoted by a span of text. The predicate tao:denoted_by connects a context entity, as the subject, to a span of text, as the object. It is an inverse predicate of tao:denote. The URLs used for the span specification, e.g. http://pubannotation.org/docs/sourcedb/pmc/sourceid/1359074/divs/0/spans/0-4, are dereferenceable ones which are provided by the PubAnnotation service [19]. Each relation annotation is simply represented by a statement using either of the predicates in the GE task namespace, e.g., genia:themeOf , or genia:causeOf .

Using the vocabulary, the annotation submitted by each of the 10 participants has been converted into a RDF graph, which then has been loaded into a RDF store.

Queries

To evaluate the 10 KBs constructed from the final submissions, 8 queries were prepared in SPARQL, as shown in Table 1. The queries are designed to demonstrate the effect of abstracting out schematic differences in annotation. We consulted with biologists and bioinformaticians to prepare queries useful to domain scientists, within the scope of the GE task. However, this is not meant to be an extrinsic evaluation, and the set of queries is not comprehensive. Rather, it is meant to be an intrinsic evaluation, to highlight focused features of a system. Each of the prepared queries thus has its own goal, which is described below.

Table 1 Queries used for the KB evaluation.

BioNLP Shared Task 2013: Part 1

Extending the evaluation of Genia Event task toward knowledge base construction and comparison to Gene Regulation Ontology task

Abstract

Background

Results

Background

Methods

Representation

KB construction

Queries

KB evaluation

Comparison of the GE and GRO tasks

Results and discussions

Results of KB-oriented evaluation

Results of task dataset conversion

Conclusions

References

Declarations

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us