Extending the evaluation of Genia Event task toward knowledge base construction and comparison to Gene Regulation Ontology task

Background The third edition of the BioNLP Shared Task was held with the grand theme "knowledge base construction (KB)". The Genia Event (GE) task was re-designed and implemented in light of this theme. For its final report, the participating systems were evaluated from a perspective of annotation. To further explore the grand theme, we extended the evaluation from a perspective of KB construction. Also, the Gene Regulation Ontology (GRO) task was newly introduced in the third edition. The final evaluation of the participating systems resulted in relatively low performance. The reason was attributed to the large size and complex semantic representation of the ontology. To investigate potential benefits of resource exchange between the presumably similar tasks, we measured the overlap between the datasets of the two tasks, and tested whether the dataset for one task can be used to enhance performance on the other. Results We report an extended evaluation on all the participating systems in the GE task, incoporating a KB perspective. For the evaluation, the final submission of each participant was converted to RDF statements, and evaluated using 8 queries that were formulated in SPARQL. The results suggest that the evaluation may be concluded differently between the two different perspectives, annotation vs. KB. We also provide a comparison of the GE and GRO tasks by converting their datasets into each other's format. More than 90% of the GE data could be converted into the GRO task format, while only half of the GRO data could be mapped to the GE task format. The imbalance in conversion indicates that the GRO is a comprehensive extension of the GE task ontology. We further used the converted GRO data as additional training data for the GE task, which helped improve GE task participant system performance. However, the converted GE data did not help GRO task participants, due to overfitting and the ontology gap.


Background
The BioNLP Shared Task (BioNLP-ST) has been organized three times since 2009. The goal is to provide the community with shared resources for the development and evaluation of fine-grained information extraction (IE) systems, particularly for the domain of molecular biology and medicine. Each time, it was organized with a grand theme (a goal shared by all the tasks): introduction of the event extraction task, generalization, and knowledge base (KB) construction, for the 1st, 2nd, and 3rd editions, respectively [1][2][3].
Initially motivated by the Genia annotation [4], the tasks of BioNLP-ST are designed for intrinsic evaluation, with the hope of complementing more application-oriented tasks of extrinsic evaluation, e.g. the Protein-Protein Interaction (PPI) extraction task of BioCreative [5]. While an extrinsic evaluation measures the performance of a system in the context of a specific application, i.e. its utility in the entire application, an intrinsic evaluation focuses on measuring the performance of a system in isolation, independently from a specific application [6]. For example, while the BioCreative PPI task is explained in the context of database curation specifically, the potential application of BioNLP-ST tasks is often broadly explained.
Until its second edition, the participants in BioNLP-ST tasks were evaluated from a perspective of annotation: The annotation instances in the submitted and gold annotations are individually compared for evaluation. For its third edition in 2013, BioNLP-ST attempted to broaden the scope of its potential applications to include KB construction, which is expected to be closer to the interest of general domain scientists, e.g. biologists, and set it as its grand theme.
Among the tasks organized in BioNLP-ST, the Genia Event (GE) task is the sole task that has continued from the beginning [7][8][9]. For its third edition, the grand theme -KB construction -was considered in the design and implementation of the task: the coreference task [10] was integrated, for improved sensitivity of knowledge harvesting, and recently published full papers were added to the benchmark dataset [9]. However, the evaluation of the participating systems was still carried out in the same way as previously, in the context of corpus annotation, without implementing the grand theme into it much.
The work presented in this paper addresses extension of the GE task evaluation, considering KB construction as a potential application of the task. Specifically, what is intended with the new evaluation is to measure the effect of abstracting out schematic differences in annotation, which is not a concern of domain scientists, but rather of annotation practitioners. An example of schematic differences in annotation is illustrated in Figures 1  and 2. The example text reads that the signaling cascade that involves the protein MyD88 induces the expression of the protein NFAT5. However, the interpretation is represented differently in annotations in the two figures. In Figure 1, the two words signaling and dependent are annotated as triggering two successive regulation events, whereas in Figure 2, the word dependent is not annotated as triggering a regulation event, but as connecting the protein MyD88 to the signaling cascade as a causal factor. To annotation practitioners, and those who are interested in automating the annotation, this is an important issue, as it is related to the consistency of the annotations. This is the perspective from which the original evaluation is performed. We call it annotationoriented evaluation. Domain scientists, however, would not be interested in such a difference, and would want to avoid being affected by it during their use of a KB. This is the perspective of the new evaluation, which we call KB-oriented evaluation.
The paper reports the results of the KB-oriented evaluation on all of the final submissions to the GE task in 2013. It complements the overview paper for the task [9], which provides a general introduction to the task and reports the results of the annotation-oriented evaluation.
This paper also provides a comparison of the GE task to the Gene Regulation Ontology (GRO) task [11], which is to automatically annotate biomedical documents with the Gene Regulation Ontology [12]. GRO is a conceptual model of gene regulation. It includes 507 concepts, which are cross-linked to such standard ontologies as Gene Ontology and Sequence Ontology and are integrated into a deep hierarchical structure via is-a and part-of relations. It is much larger than the Genia ontology, and its concepts are generally more specific than the Genia ontology concepts used in the previous GE tasks.
The complex structure of the GRO enables us to evaluate participant systems at different abstraction/ generalization levels. However, its large size and complex semantic representation make the event extraction based on the ontology highly challenging. One of the issues of the GRO task is that its dataset is small compared to the size of the ontology. In this paper, we test whether the conversion of the GE task dataset, whose ontology has a large overlap with the GRO, into the GRO task may help address this issue.

Representation
Since its first edition in 2009, the annotation of the BioNLP-ST has been provided in the so-called a* format. In addition to the a* format, in the third edition of the GE task, the datasets are also provided in a new format, which is motivated by the following two issues.
1. For ease of implementation: The a* format is actually a quite complex format for which to implement reader and writer modules: it has to handle three delimiter characters, tab, space, and colon, in a structural way, and has to handle coordination of arguments based on number suffixes. The complexity is an extra overhead  that has nothing to do with the task itself. The burden of the implementation prevents the participants from concentrating on the task. Using a more standard format would let them spend more time on the task itself.
2. For flexible representation and retrieval of information: As the a* format requires any piece of information related to an event to be represented in an eventcentric n-ary relationship, it is hard to represent partial information. For example, even if there is a system which is very good at extracting causal relationships, since the information cannot be represented without successfully extracting some other part, e.g. the theme, of the relevant event, there is no way to evaluate the potential of the system. Also, from a KB perspective, the n-ary relationship makes the representation unsuitable to inference, which is necessary to provide flexible access to the contents of a KB. For example, from the annotation in Figures 1 and 2, users may want to retrieve the information that the protein MyD88 is a causal factor of NFAT5 gene expression, regardless of how it is represented in annotation, which, however, requires inference over the explicit annotation. To fulfill a KB-oriented use case, a more inferencefriendly representation would be beneficial.
To address these issues, we present a new format. It is a JSON application. Note that JSON is currently one of the most widely used standard data formats and most major programming languages already have public reader and writer modules for it. Being provided with the benchmark dataset in JSON, the developers do not need to implement reader and writer modules themselves. The new format is also designed to be relation-centric, and all the information is represented by binary relations, which are more elementary than events. The new format is thus flexible enough to represent various aspects of information, and it is inference-friendly.
For example, the annotation illustrated in Figure 3 is represented in the a* format as shown in Figure 4. The format is basically a variation of the CSV (comma-separated values) format with the tab character as the the primary delimiter. In the format, the annotation statements, which are n-tuples, are in the second column, while the ID of each statement is in the first column. The statements which have their ID beginning with the prefix T are entity annotations, and they are in the form of triples, (entity-type, beginning-caretoffset, ending-caret-offset), which state that the text span between the beginning-and endingcaret-offsets denotes an entity of the entitytype. In the a* format, the elements of the triple are delimited by the space character. The statements which have their ID beginning with the prefix E are event annotations. They are in the form of typed n-tuples, (event-type:trigger-entity-id, arg1-type:arg1-entity-id, ...), representing the predicate-argument structure of an event. The order of the tuple, n, varies according to the number of arguments. Note that in the example, the event E1 has two arguments, E2 (Theme) and T1 (Cause). As the a* format is event-centric, the event is represented as one statement representing its predicate-argument structure, regardless of the number of arguments involved in it. Figure 5 shows the same annotation in the JSON format. As can be seen, the JSON format is more selfdescriptive than the a* format. A denotation type of annotation, which is stored in the array denotations, states that a span of text "denotes" an object. The relation type of annotation, which is stored in the array relations, states that two objects, of which one is subject and the other is object, are related to each other via a predicate. Note that the JSON format is relation-centric: a relation is represented by a single statement. In the example, the two (binary) relations, R1 and R2, together with the denotation, E1, correspond to the event E1 in the a* format. Such a conversion from n-ary relation to n binary relations is a standard process in description logic [13]. As a relation can be represented individually, a partial piece of information about an event, e.g. the causal relation, R2, can be represented independently from the other relation, R1, of the event. Through inference over relations, implicit information also can become accessible. For example, by defining the relation themeOf as a transitive one, it becomes straightforward to access the information that the protein Sox6 negatively regulates the protein Epsilon Globin. Note that the denotation annotations also can be seen as a special case of relation annotations, with text spans as their subjects. For more detailed information on the JSON format, readers are referred to the web document [14].
Considered at an abstract level, the pros and cons of the two formats become clearer. The a* format is closer  to the relational database (RDB) model, i.e. tables, and the JSON format is closer to the resource description framework (RDF) model, i.e. graphs. The former is often more efficient for optimized applications, e.g. event-centric processing, while the latter is more flexible and inference-friendly. The GE task supports both models by providing the benchmark dataset in both formats, and also tool for converting between them.

KB construction
To evaluate the annotation submitted by the participants from the perspective of KB construction, we built a KB from the annotations of each participant. As the framework of the KB, we have chosen to use the Resource Description Framework (RDF) [15], a widely accepted knowledge representation framework recommended by the World Wide Web Consortium (W3C) [16].
As the vocabulary for RDF statements, some existing open vocabularies were considered [17,18]. However, we found that they are focused on retaining provenance of information. As our purpose of constructing the KBs in this work is for evaluation of annotation, we have chosen to develop a minimal in-house vocabulary, which we call text annotation ontology (TAO). Then, we implemented a converter from the JSON format to RDF. In fact, the JSON format is very close to RDF, as each annotation statement is already a triple. Converting it to RDF is explicating its semantics using an RDF vocabulary. Figure 6 shows an annotation example in RDF using TAO. As can be seen in lines, 6, 9, 12, and 15, each denotation annotation introduces an entity which is of the type tao:Context_entity. As the name implies, an instance of the class tao:Context_entity is an entity defined in a specific context, denoted by a span of text. The predicate tao:denoted_by connects a context entity, as the subject, to a span of text, as the object. It is an inverse predicate of tao:denote. The URLs used for the span specification, e.g. http://pubannotation. org/docs/sourcedb/PMC/sourceid/1359074/divs/0/spans/ 0-4, are dereferenceable ones which are provided by the  PubAnnotation service [19]. Each relation annotation is simply represented by a statement using either of the predicates in the GE task namespace, e.g., genia:the-meOf , or genia:causeOf .
Using the vocabulary, the annotation submitted by each of the 10 participants has been converted into a RDF graph, which then has been loaded into a RDF store.

Queries
To evaluate the 10 KBs constructed from the final submissions, 8 queries were prepared in SPARQL, as shown in Table 1. The queries are designed to demonstrate the effect of abstracting out schematic differences in annotation. We consulted with biologists and bioinformaticians to prepare queries useful to domain scientists, within the scope of the GE task. However, this is not meant to be an extrinsic evaluation, and the set of queries is not comprehensive. Rather, it is meant to be an intrinsic evaluation, to highlight focused features of a system. Each of the prepared queries thus has its own goal, which is described below.
Note that a characteristic feature of BioNLP-ST annotation is that the annotation instances are all anchored to the text pieces that refer to them, and such annotation can be considered as a semantic index to specific context. As it is a strong feature of BioNLP-ST annotation, we assume that users of the KBs induced from such annotation would want to retrieve information together with the specific parts of literature that talk about it for evidence or for the full context of the information. The SPARQL queries in the table are constructed considering that.
The first three queries, Q1, Q2, and Q3, represent simple query needs: to find proteins in the events of a specific type, Gene_expression, Localization, and Binding, respectively. Note that in the queries, the proteins and the events are to be bound to the variables ?t1 and ?e1, respectively, and that for the reason discussed above, they are formulated to return the text spans, to be bound to the variable ?s1, that "denotes" the proteins.
The two queries Q3 and Q4 represent different levels of specificity of similar search needs: while Q3 is to search for single proteins in the context of binding, Q4 is to search for protein pairs binding to each other. Note that in the SPARQL construct of Q4, the two proteins bound to the variables ?t1 and ?t2 are themes of the event bound to ?e1, and they need to be different to each other (stated by the FILTER constraint).
The next two queries, Q5 and Q6, are prepared to compare search performance with and without inference: both are to search for protein pairs of which one regulates the other. However, while Q6 uses transitivity reasoning (indicated by the plus sign, '+') on the predicate, genia:themeOf , Q5 does not. By using the transitivity inference, when it is known that A is a theme of B and B is a theme of C, it is assumed that A is also a theme of C. Figure 1 shows an annotation example which can be found by Q6 but cannot be found by Q5. The protein NFAT5 is a theme of the Gene_expression event represented by expression, and it is not retrieved by Q5. However, the Gene_expression event is a theme of the regulation event represented by signaling, which is again a theme of the regulation event represented by dependent. By the transitivity inference, NFAT5 is assumed to be a theme of the regulation event represented by dependent. Through the process, the protein MyD88 can be found as a regulatory factor of NFAT5. As discussed in the section Background, alternative annotation is possible. For example, Figure 2 shows a possible variation of the annotation: it does not capture the word dependent as a trigger for a regulation event, but instead it connects the protein MyD88 as a cause to the regulation event expressed by signaling. Note that such an annotation variation would not be important to the domain scientists who are potential users of the KBs, but it is rather a matter of annotation guidelines and consistency. Therefore, from a KB perspective, it would be a natural demand to abstract out such an annotation difference. The transitive inference in the SPARQL query Q6 implements such a demand.
The last two queries Q7 and Q8 represent more specific query needs to search for the causal factors of proteins in a specific type of event (Gene_expression in the queries), with and without transitivity inference.

KB evaluation
Using the method described in the section KB construction, each final submission to the GE task is converted into RDF statements which are then stored in a graph. From the 10 submissions, 10 graphs are generated, each of which representing a KB to be constructed from the annotation in the corresponding submission. To represent a gold KB, a gold graph is generated from the gold annotation of the test data set, using the same method. The 11 graphs are then stored in a RDF store (specifically, Virtuoso Open Source Edition version 7.1.0 is used).
The SPARQL queries explained in the section Queries are submitted to the RDF store, and the results from each graph are compared to the results from the gold graph. The results are then evaluated in terms of recall, precision and F1-score. Table 2 shows the basic statistics of the GE and GRO datasets. The dataset of the GE task consists of full papers collected from PubMed Central [20], using the MeSH terms [21] NF-kappa B and transcription factors. The dataset of the GRO task consists of abstracts collected from PubMed [22], using a list of human transcription factors. It is thus expected that the subject domain of the two datasets would be close to each other, i.e. NF-kappa B transcription factors vs. human transcription factors, while the nature of the texts might be quite different, i.e. full papers vs. abstracts [23].

Comparison of the GE and GRO tasks
We tested whether the dataset of the GE task can be used to improve the performance of participant systems when it is converted into the GRO task format, and vice versa. We converted the datasets of the two tasks into each other's format to measure the overlap of the two tasks in terms of corpus annotations as well. This may help lead us to a unified framework for the shared tasks. We performed the conversion via equivalence mappings between the concepts of the two task ontologies. Table 3 shows the equivalence mappings used for the conversion. In fact, the GE'13 corpora also used the concepts of Deacetylation and Ubiquitination, but we ignored them since there is no correspondent in GRO for them. Note that we mapped Binding from Genia to BindingToProtein from GRO, because all participants of Binding events in the GE'13 corpora are proteins.
We converted the training data of the two tasks into each other's format according to the mappings in Table 3. When converting the GE data to the GRO task, the Genia concepts are replaced with the corresponding GRO concepts according to the table. When converting the GRO data to the GE task, not only the GRO concepts found in the table, but also those whose ancestors have mappings to the Genia concepts, are converted. For example, an instance of the GRO concept RegulationOfGeneExpression is converted to an instance of the Genia concept Regulation, since RegulationOf GeneExpression is a subconcept of the GRO concept RegulatoryProcess, which is equivalent to Regulation.
We used the converted data for increasing the training data set size of the two tasks, especially where the GRO task's participants suffered from the relatively low amount of training data. For example, we combined the GRO task training data and the conversion of the GE task training data and used them for training the TEES system [24] on the GRO task. We measured the system performance before and after the data conversion to see its effect and analyzed the results. We also tested the other conversion, and with another event extraction system as reported in the next section.

Results of KB-oriented evaluation
This section reports the results of the KB-oriented evaluation that has been carried out on the KBs induced from the final submissions to the GE task in 2013. For the purpose of comparison, the results of annotation-oriented evaluation on the four event types, Gene-expression, Localization, Binding, and REGULATION-ALL (which subsumes all the regulation types, Regulation, Positive-and Negative-regulation), are shown in Tables 4 and 5. Table 6 shows the results for the queries Q1 and Q2. Each row in the table shows the result from the KB induced from the annotation submitted by the team indicated by the label in the first column. Note that GS means the KB induced from the gold annotation. With these simple queries, which do not require abstraction or inference, the results are similar to the results from the annotationoriented evaluation. There is a small difference in the number of true positives, which is because the annotationoriented evaluation counts the number of "events", while the KB-oriented evaluation counts the number of proteins involved in the events. Table 7 shows how much the performance drops when pairs of binding proteins are required to be retrieved instead of single proteins. The performance drop is substantial, even considering that the complexity is expected to be quadratic: when the performance of finding single proteins is P, the performance of finding pairs is expected to be P × P, e.g., 59.12% × 59.12% = 34.95% vs. 28.19% in the case of TEES-2.1, and 63.20% × 63.20% = 39.94% vs. 19.35% in the case of HDS4NLP. This is because even after two proteins are correctly found to be connected to the same trigger indicating the event Binding, it still needs to be determined whether the two proteins are involved in a single binding event (collective parsing) or in two separate events (distributive parsing). Note that the terms, collective and distributive parsing, are inspired from the linguistic terms collective and distributive reading. The results indicate that the system HDS4NLP is generally good at extracting individual predicate-argument relations (see     also Table 6), while TEES-2.1 does a much better job in determining collective vs. distributive parsing. Table 8 shows how much the performance changes when transitive inference is used to abstract out the schematic difference in annotations, which is a concern of annotation-oriented evaluation, but not of KBoriented evaluation. Some systems exhibit much better performance in KB-oriented evaluation than in annotation-oriented evaluation, e.g., TEES-2.1 (20.81% 46.02%), NCBI (11.27% 29.81%), and NICTANLM (7.19% 27.12%). Note that regulation events in the GE task annotation are represented in a recursive manner, which may require more computation than that for simple type events. When the top two systems, EVEX and TEES-2.1, are compared, it may be said that EVEX is more optimized to the annotation-oriented evaluation, which is the original official evaluation of the GE task. Note that it is often the case for a retraining approach to be optimized to the object function. It is notable that while in the context of automatic annotation, the performance with regulation-type events looks almost too poor to be useful, in the context of KB application, it is much more encouraging, though not sufficient. The observation in Table 9 is similar to that in Table 8 as the contrast may be less dramatic.
The experimental results of the extended evaluation provide additional insight to the performance of the systems. For example, while EVEX is evaluated to perform best for production of gold annotation. TEES-2.1 is evaluated better when the application is KB construction. Table 10 shows the statistics of the conversion using the equivalence correspondences. As shown in the table, most of the entities and events of the GE data are convertible to the GRO task, while many of the entity events of the GRO data are not. Table 11 shows the most frequent GRO concepts that correspond to Genia   Table 12 shows the most frequent GRO concepts that are not convertible to the GE task, including those that indicate where the gene regulation events take place (e.g. organism, tissue, cell), DNA without specific location of interest, protein domains, chemicals, quantitative changes without clear causal effects, and disease. We converted the training data of each task to the other task format and used the converted data as additional training data for the latter task. For example, we converted the GRO task training data to the GE task format, used it together with the original training data of the GE task to train the TEES system on the GE task, and evaluated the system against the GE task test data. We followed the same procedure for the GE->GRO conversion. We used the default settings of version 2.1.1 of the system [25].

Results of task dataset conversion
The GRO GE conversion (i.e. using the converted GRO data as additional training data for the GE task) resulted in an increase of the performance from 38.2 F-score to 42.2 F-score. The conversion enhanced the system performance in most of the event classes, which may mean that the GE task requires more training data to saturate the system performance and that the class (or concept) distribution of the convertible data of the GRO task is not heavily biased in comparison with the GE task.
However, the GE GRO shows a slightly negative effect on the performance of the TEES system: The original performance of the system in terms of F-measure was 24.9%, while it shows 24.0% F-measure with the additional data from the GE task, dropping F-measure by 0.9 percentage points. Table 13 shows the performance change for some individual GRO concepts by the GE->GRO conversion. The first five concepts in the table are those whose instances are increased by the data conversion, but whose performance changes are below 5 percentage points. The last four concepts in the table are those whose performance has changed by more than 5 percentage points and from which true positives identified by the system before the data conversion were above 10. Note that all these four concepts do not obtain any new instances from the data conversion. As shown in the table, all the four concepts show a performance drop.
This performance difference between the concepts that obtained more instances from the data conversion and those that did not may be due to the following factors: 1) The five GRO concepts populated with additional instances from the GE task are already highly populated in the original GRO corpora, and thus their performance is not affected much. The average number of instances for the populated event concepts in the GRO'13 training data is 48, while that of all event concepts is only 13 (see Table 14). 2) The data conversion increases the imbalance of instances among GRO concepts and causes overfitting and thereby performance drop. The average number of instances per event concept that are converted from the GE'13 training data is 287, which is larger than the overall average 13 (see Table 14). 3) The different   ProteinTargeting < ProteinTransport < Localization (12) Transcription (105) Transcription (83) TranscriptionOfGene < Transcription (22)  The count of the last column is the count of the concept at the lowest level in each row. styles of annotation between the two tasks [26] may lead to heterogeneity of the combined data and thereby to little synergistic gain. We also examined whether the differences between the two ontologies of the two tasks show specific impact on system performance. We considered two differences: 1) GRO differentiates between Gene and Protein, while the Genia ontology has only the Protein concept; and 2) GRO has more specific subconcepts of events than Genia (e.g. PositiveRegulationOfGeneExpression < PositiveRegulation). First, as shown in Table 11, 1,521 instances of the GRO concept "Protein" and 482 instances of the GRO concept "Gene" were converted to the Genia concept "Protein". If we only consider the GRO events at least one of whose participants is a Gene instance or a Protein instance, 570 events with a Protein instance and 116 events with a Gene instance were converted to the corresponding Genia events with Protein instances. We assume that these ratios of 25% (entity) and 17% (event) of Genespecific information in the GRO dataset would be similarly found in the GE dataset, since the two datasets have large overlaps. Second, as shown in Table 11, around 90% of the GRO event instances are mapped to the equivalent Genia concepts, while the rest belong to the GRO concepts that are more specific than the mapped Genia concepts. In short, the conversion of the GE data to the GRO format ignores the fact that 10%-17% of the event instances should be mapped to more specific concepts.
We compared the changes in errors (i.e. false positives, false negatives) of events that are related to the differences by the data conversion. Table 15 shows the number of errors in the events at least one of whose participants (i.e. agent, patient) is either a Gene or Protein instance before and after the data conversion. As shown in the table, it is not clear if the data conversion has a positive or negative effect on the system performance. Table 16 shows the number of errors in the events of frequent subconcepts of the GRO concepts that have equivalent Genia concepts (e.g. PositiveRegulation). This table also does not show any definite effect of the data conversion. These indefinite results may be due to the small ratios of the affected instances.
We also tested the effect of the GE GRO conversion on a rule-based system [27]. (TEES is based on machine learning.) The rule-based system was developed for populating the E. coli transcription regulatory database RegulonDB, and its results can be used for the GRO task evaluation since the system represents all of the events for database population with the GRO. Note, however, that this experiment is preliminary since the system has not been seriously adapted for the GRO task. The rulebased system also shows a slight performance drop (from 16.3% to 15.2%). The highest performance gain was made for the concept ProteinCatabolism (from 0% to 33.3%; 7 true instances), which was benefited by the data conversion (see Table 3), while the largest performance drop was seen for BindingOfTranscriptionFactor (from 19.4% to 6.5%; 25 true instances), which was not benefited by the data conversion.

Conclusions
The third edition of the BioNLP-ST was organized with the grand theme of knowledge base construction, in order to extend the potential applications of the tasks by more carefully considering the perspective of domain scientists. The paper presents an extended evaluationthe KB-oriented evaluation -of the GE task, to better fulfill the grand theme. Experimental results suggest that the participating systems may be evaluated differently in different application contexts, annotation vs. KB.   The paper also presents a comparison of the GE task to the GRO task toward a KB with richer semantics. The inter-task resource conversion was found useful only when the converted data did not bias the class distribution of the original data. In the case of GE GRO conversion, it could not improve either a machine learning-based system or a rule-based system.
As future work, the KB-oriented evaluation will be made publicly available as an automatic online service, so that the participants of the task can consider the aspect of KBoriented evaluation during their system development. While the evaluation was carried out as an intrinsic evaluation, exploring its connection to a relevant extrinsic evaluation, e.g., Task 2 (Biomedical question answering over interlinked data) of Question Answering over Linked Data (QALD)-4 [28], should be beneficial. Also, the comparison, and eventually an integration, of the GE and GRO tasks will be explored toward information extraction (IE) for KB with richer semantics.