Overview of the gene regulation network and the bacteria biotope tasks in BioNLP'13 shared task
© Bossy et al.; licensee BioMed Central Ltd. 2015
Published: 23 June 2015
We present the two Bacteria Track tasks of BioNLP 2013 Shared Task (ST): Gene Regulation Network (GRN) and Bacteria Biotope (BB). These tasks were previously introduced in the 2011 BioNLP-ST Bacteria Track as Bacteria Gene Interaction (BI) and Bacteria Biotope (BB). The Bacteria Track was motivated by a need to develop specific BioNLP tools for fine-grained event extraction in bacteria biology. The 2013 tasks expand on the 2011 version by better addressing the biological knowledge modeling needs. New evaluation metrics were designed for the new goals. Moving beyond a list of gene interactions, the goal of the GRN task is to build a gene regulation network from the extracted gene interactions. BB'13 is dedicated to the extraction of bacteria biotopes, i.e. bacterial environmental information, as was BB'11. BB'13 extends the typology of BB'11 to a large diversity of biotopes, as defined by the OntoBiotope ontology. The detection of entities and events is tackled by distinct subtasks in order to measure the progress achieved by the participant systems since 2011.
This paper details the corpus preparations and the evaluation metrics, as well as summarizing and discussing the participant results. Five groups participated in each of the two tasks. The high diversity of the participant methods reflects the dynamism of the BioNLP research community.
The highest scores for the GRN and BB'13 tasks are similar to those obtained by the participants in 2011, despite of the increase in difficulty. The high density of events in short text segments (multi-event extraction) was a difficult issue for the participating systems for both tasks. The analysis of the BB'13 results also shows that co-reference resolution and entity boundary detection remain major hindrances.
The evaluation results suggest new research directions for the improvement and development of Information Extraction for molecular and environmental biology. The Bacteria Track tasks remain publicly open; the BioNLP-ST website provides an online evaluation service, the reference corpora and the evaluation tools.
Motivation and related work
Large-scale experimental approaches in the field of biology shift the focus of researchers towards transversal questions that involve very diverse biological knowledge. The researcher needs new tools to deal with the growing number of relevant publications. The domain of text-mining for biology (BioNLP) develops automatic methods to assist the analysis of knowledge expressed in natural language scientific articles. Periodic shared tasks measure the progress of the community methods by formally comparing the method predictions to a reference annotation on test data [1, 2]. The goals of the shared tasks evolve with the advances in BioNLP, moving towards a better adaptation to the needs of biologists. It is reflected through the diversity of biology questions (e.g. regulation, disease, metabolism, and environment), types of documents (e.g. abstracts, papers, Web pages) and the related biologist research activity (e.g. knowledge curation, system modeling, and data normalization).
The third series of BioNLP Shared Task that took place in 2013 (BioNLP-ST'13) proposed six tasks under the Knowledge Base domain. BioNLP-ST'13 encourages the development of methods that improve the extraction of fine-grained complex events in systematic and concise ways . The common organization of BioNLP-ST'13 includes an official evaluation of the participant systems by an automatic comparison of their predictions on the test sets to reference data. The evaluation took place at a fixed date after a period for the training of methods using the reference corpora provided by the task organizers. The two Bacteria Track tasks were organized within this framework.
The creation of the BioNLP Bacteria Track in 2011  followed the LLL initiative in 2005 . It was motivated by research questions brought up by bacteria research that encompass all levels of knowledge from the molecular to the physiological and phenotypic levels. This is exemplified by the GRN and BB tasks.
Gene Regulation Network
Gene regulation networks are key components in the understanding of cell processes and more generally of living organisms. More and more research efforts are being invested in the design of regulation networks for species of all kingdoms [6, 7]. Systems biology aims to integrate the knowledge from heterogeneous sources in consistent predictive models of gene regulation [8, 9]. Beyond experimental data, which is the main source to date, the abundance of regulation descriptions in literature has been a strong motivation for the development of dedicated Information Extraction (IE) methods [5, 10, 11]. The goal of the LLL task was the extraction of binary directed interactions between the agent protein and the target gene from which the regulation network can be derived in a straightforward way. The gene interaction information extraction task has posed a challenge for years due to the spread of information over large collections of scientific data, the complexity of the underlying biological phenomena and the linguistic diversity of the descriptions.
 and  showed that the use of a fine-grained biological model for the representation of the events facilitates the understanding and validation of the extracted knowledge by the biologists and its integration with other data sources. A fine-grained model was implemented in the Bacteria Gene Interaction (BI) task in 2011. It describes in detail interactions at the biological level and underlying cellular mechanisms at the molecular level . The activation of the transcription of a gene is an example of a biological interaction. The physical binding of a protein to a DNA site is an example of a molecular level phenomenon.
The BI model is formalized by biological entities and n-ary events between entities and events. The biological entities are mainly proteins and genes, their subparts (e.g. site), families (e.g. gene cluster) and aggregates (e.g. protein complex).
The GRN task goes one step beyond BI by making the design of the regulation network its primary goal. It has several benefits compared to the BI text-bound event extraction. Regulation networks are needed by biologists in order to enrich their biological models and to integrate text knowledge with other sources of knowledge. Because the knowledge extracted from text is directly bound to its sources, end-users can more easily assess its quality, compared to other bioinformatics methods such as transcriptomic profile screening . Moreover, the evaluation of regulation network quality better reflects the biologist needs because it abstracts from text-mining peculiarities, such as the linguistic complexity of the text descriptions and information redundancy.
The GRN annotation model is built upon the BI model, and also includes inference rules to automatically deduce the regulation network from text-bound events. The inference rules provides a continuity between event-based extractions and the regulation network, allowing to benefit from both types of knowledge.
The biological question behind the LLL and BI tasks is the cell process regulation network of the model bacterium Bacillus subtilis (Bs) with a focus on sporulation, one of the most studied developmental processes. This choice was motivated by the abundance of publicly available information in PubMed abstracts and the richness of the biological phenomena described in them.
The previous work on the Bacteria Biotope task (BB'11) has stressed the importance of microorganism environment information. The formal description of biotopes and their properties is an essential step for the study of interactions between the organisms and their environment. In particular, it is needed in order to correlate genetic specificity to environmental properties and to explain the adaptation of organisms to their habitats and their evolution. The application domains of this fundamental research are broad, from the health of humans, plants, and animals, to food processes including plant growth enhancement . Biotope descriptions are abundant in scientific documents, but they cannot be used as such for biological studies. Their form is extremely variable: the biotope descriptions may be very complex from a linguistic point of view, including many embedded biotope names and properties. A normalization of biotope descriptions using a reference is required for their comparison. The extraction of the relations between the organism and the biotope entities is also difficult to automate due to the abundance of entity mentions in short spans of texts. This motivated the organization of the Bacteria Biotope Task in 2011. The goal of BB'11 was the identification and categorization of bacterial habitat entities in natural language texts, their linking to their subparts (i.e. part-of event) and their linking to the bacteria that live there (i.e. localization event). The good results of the participants on BB'11 demonstrate the feasibility of this IE task.
Since 2011, the development of new sequencing techniques has had a major impact on the field of metagenomics. Metagenomics studies microorganism sequences in their environments, thus avoiding strain cultivation. The number of metagenomics studies has grown exponentially in the last few years. This has resulted in a considerable increase in the diversity of the microorganisms that can be studied. This is appealing for information extraction tools that can automatically analyze biotope descriptions of the microorganisms so that these biotopes and genes from different metagenomics experiments can be compared on a large scale.
The categorization of biotopes is a form of normalization that is necessary for the generalization of biotope observations. BB'11 defined seven broad biotope categories that were a priori considered as relevant for biological studies. Participant methods had to categorize the extracted biotope mentions according to these categories. The limited number of categories affects the ability of bioinformatics methods to find useful correlations between gene sequences and biotopes. This motivates the use of a large set of categories organized in a hierarchical structure. Moreover,  has shown that the lexical information contained in ontologies can make the task easier.
OntoBiotope is an ontology of microorganism habitats . Its modeling principle and its lexicon reflect the usual biotope classification used by biologists to describe microorganism isolation sites (e.g. GenBank, GOLD, EnvO) [16–18]. OntoBiotope is developed and maintained by the Meta-omics of Microbial Ecosystems (MEM) network in which 30 microbiologists from INRA (French National Institute for Agricultural Research) from all fields of applied microbiology participate. The relevance of OntoBiotope terms has been evaluated through the PubMedBiotope semantic search engine . It identifies and categorizes biotopes in a collection of 600 000 PubMed abstracts by applying the ToMap method (Text to Ontology Mapping)  to the OntoBiotope ontology. This suggests that the ontology is fully appropriate as a new fine-grained categorization plan for the BB'13 task.
The BB'11 corpus is a collection of encyclopedia-like web pages. They are comprehensible by non-biologists and they share many linguistic characteristics in common with scientific articles. The limited size of encyclopedic webpages, compared to full research articles, is appropriate for a first attempt at a novel task, while preserving the generalization of trained IE systems.
This section describes the corpora features, the event representation and the evaluation metrics for the two tasks. More details and examples can be found on the task website  and the ACL BioNLP Shared Task articles that are devoted to the two tasks [22, 23].
Gene Regulation Network
Biological and molecular representation in GRN
Molecular mechanisms are frequently detailed in the literature on bacteria and are very useful to determine the nature of the regulation. Moreover, the events involve not only proteins and genes, but also parts of them, families, complexes, or even cascades of nested events.
The molecular level of the annotation model represents the role of the promoter of the regulated gene and the binding of the protein on the promoter as illustrated by Example 2 of Figure 2. The model defines a Master of Promoter event that relates the binding protein to the promoter and more generally the Bind to event relates the binding protein to any site. The model also defines a Promoter of event that relates the promoter to its gene and more generally the Site of event relates any binding site to its DNA region (gene or promoter).
Inference rules derive the Interaction arcs of the network and their types from these molecular low-level events. Inference is done in two steps, (i) inference of biological annotations from the molecular annotations, and (ii) inference of the network arcs from the biological annotations. Example 2 of Figure 2 illustrates the inference of molecular to biological annotations. Example 3 shows the result of the inference of the network arc Interaction: Binding.
 formally specifies the annotation model and the inference rules that produce the regulation network from the text annotations. The specifications were made available on the GRN BioNLP-ST 13 webpage, as well as a tool for checking the predicted events against the annotation model and for inferring the network from these predictions.
Information extraction challenges in the GRN task
The GRN corpus has been designed by following the BioNLP annotation standard . It was selected from Pubmed abstracts on Bacillus subtilis transcription. All together, the information represented by the corpus had to be sufficient to build a regulation network centered on the sporulation of Bs.
The annotation model is based on the Bacteria Genic Interaction (BI) proposed in BioNLP- ST 2011. The manual annotation of the whole GRN corpus confirmed that it captures all of the descriptions of genic interactions without ambiguity. The regulation network that has been inferred from the annotations has been checked against state of the art knowledge [24, 25] with a focus on the sporulation of Bacillus subtilis . Its formal validity was checked by applying the inference rules to the corpus annotations.
Unlike most task corpora, the GRN corpus is a set of sentences isolated from PubMed abstracts. This is done for two reasons. Isolated sentences provide all the regulation network information. The prediction of the correct relations among the entities in the sentences is challenging as previously demonstrated by the LLL and Bacteria Genic Interaction (BI) tasks. This challenge is the result of a high number of entities (almost 5 per sentence on average), their diversity (11 types) and the diversity of the events (15). The sentences are provided with the gold entities (genes, proteins, promoters, etc.) and their text span, allowing the participant methods to focus on relation extraction.
The GRN corpus was split into the training, development and test sets, ensuring that the distribution of event types and entities in the training and development sets was representative of the test set. The molecular level annotations account for 60% of the annotations, and the biological level interactions for 40%. At the biological level, Transcription and Regulation interactions combined account for half of the interactions.
Figures of the GRN corpus.
Prediction evaluation metrics
The evaluation must assess the quality of the predicted gene regulation network with regards to the knowledge contained in the corpus. The reference network has been inferred from the manual corpus annotations using the inference rules detailed above. Therefore, the evaluation compares the predicted networks to the reference network.
The evaluation accepts predicted networks in two possible formats. In the first, the submission includes the prediction of text-bound events from which the predicted network is inferred. Alternatively, regulation network submissions without any text annotations are also valid. These formats allows for a greater diversity of prediction methods. They may or may not use the low-level annotations and the inference rules. They can even make use of information from external sources in order to build a network prediction.
Two kinds of errors are noteworthy. The inversion error reverses the direction of an interaction by confusing the agent and the target roles. It is quite detrimental for the design of a systems biology model because the inversion can cause negative side effects that are costly to recover. The substitution error occurs when an arc is correctly predicted, but its label is incorrect. From the target application point of view, the cost of a posteriori recovery of a substitution error is equivalent to the correction of a false positive (FP) or a false negative (FN). However, an evaluation framework without substitution errors counts these errors twice: as a false negative and as a false positive. In the case of the F-score, calculated on the basis of FP and FN errors exclusively, both recall and precision account for each substitution. Since the F-score is the harmonic mean of recall and precision, it penalizes the substitution twice, overestimating the deviation from the reference. These standard metrics are therefore inadequate for GRN. Instead, we use the Slot Error Rate (SER) as proposed by . The SER is related to the Levenshtein distance and to the Word Error Rate (WER) that is widely used in speech recognition evaluations.
where S is the number of substitutions (mismatches), D the number of deletions (false negatives), I the number of insertions (false positives) and N the number of items (arcs) in the reference. The SER indicates the proportion of errors in a prediction in comparison to the reference. The lower the SER, the better the prediction. A SER equal to zero means that the prediction is perfect. However the SER is unbound since the number of insertions is unbound. By design, the SER requires an analysis that isolates the substitution errors and allots them the same weight as insertions and deletions.
The GRN evaluation algorithm handles the arc prediction errors for all of the pairs of genes mentioned in the test set as shown in Figure 3. The number of errors for the whole graph is therefore the sum of errors for each individual pair.
where M is the number of correct matches and P the number of predicted arcs. However, the F-score should be not used as an absolute indicator of the performance of a prediction. The ranking of the participant systems according to either the SER or F-score may show discrepancies depending on the proportion of substitutions amongst errors. If two predictions exhibit comparable SER scores, one may have a significantly lower F-score if it contains more substitutions than the other.
The SER measure used in GRN is thus aware of the recovery cost with respect to the application need. Its computation breaks down the prediction errors in a way that meets the expectations of the target biology community.
The predicted gene regulation networks are used for different purposes, whose requirements vary in terms of accuracy. We distinguished two main usages for which we conducted two complementary alternative evaluations. First, Systems Biology applications require very high quality and manually curated models. The predicted network cannot be used directly as is, rather it facilitates bibliographic searches by pointing out relevant sections in the literature. In this context, the important information is contained in the topology of the network, compared to the exact categorization of the regulations. We therefore specify the shape evaluation by removing the regulation types, both in the reference and the predicted networks. Multiple arcs between gene pairs are reduced to one. Thus, in the shape evaluation, there is either no arc or there is one arc between two genes. Furthermore, there are no longer any substitution errors. There is no objection to the use of F-score for the shape evaluation and the F-score ranking is the same as the SER ranking.
The second alternate evaluation focuses on gene regulations of effect types. Effect regulations indicate the functional influence of an agent on a target gene. As for pathway models, the main expectation for a regulation network is a graph with arcs labeled with effect types, i.e. the generic Regulation type and the Activation, Inhibition, Requirement types. We thus designed the effect evaluation framework by removing all mechanism regulation types (i.e. Binding and Transcription). If there is a single mechanism type arc between two genes, then this arc is relabeled as a generic Regulation. If there are one or two effect type arcs along with a mechanism arc, then the mechanism arc is removed (it is in fact relabeled as a Regulation arc and it becomes redundant with the effect arcs). Since there are different types of effects, substitution errors may occur and therefore the F-score remains inaccurate and SER is preferred.
These three evaluation settings: the official shared task evaluation, the network shape evaluation and the evaluation of effect regulations, provide useful means to assess the participant method results from several different perspectives.
The BB task aims to extract text-bound entities and events. It involves three kinds of entities: bacteria, geographical places and other habitats. The last two are defined as bacteria biotopes. All entities have to be detected, meaning that their character position in the document must be predicted. The entity spans may be discontinuous. This extension of BB'11 allows us to represent entity strings that are segmented into non contiguous parts. For example, the animal intestine entity in the text animal and human intestine is discontinuous and overlaps with the human intestine entity. Only habitats need to be categorized. Habitat categorization is characterized as the assignment of relevant concepts of the OntoBiotope ontology. The version of OntoBiotope that is used for the BB task defines 1,756 concepts in a hierarchical structure. The deepest point of the ontology contains ten levels.
The two BB task events (Localization and PartOf) are binary events. Localization links bacteria entities to their biotope entities (geographical and habitats). Many habitat entities are physically embedded such as organs in hosts, or substances in containers. This information is particularly important in the case of hosts where the interaction of the bacteria with an organ strongly depends on the host itself. Therefore, the role of the PartOf event is to link sub-parts of living organisms that are bacterial hosts, to the living organisms themselves.
The novel goal of entity categorization with a large ontology deserved a specific sub- task. The goal of sub-task 1 is the detection and categorization of habitat entities.
The goal of sub-task 2 is the extraction of the Localization and PartOf events between entities of the three types. The annotations of all candidate entities are provided to the participants. By separating this sub-task, the measure of the event extraction quality is independent of the measure of the entity extraction quality.
Finally, the goal of the third sub-task is the complete task, including the detection of entities of their three types without their categorization and the extraction of the two types of events.
Figures of the Bacteria Biotope corpus annotation.
Part of Host
The morphological similarity between ontology concepts and the entities to be tagged can be used to facilitate the categorization. We found that 60% of BB corpus habitat entities have forms different from the concept or synonym that they should be tagged with. A straightforward and naive strategy consists of a direct match of ontology habitat entries to the test text after lemmatization. It yields a high Slot Error Rate (SER) of 0.74, which is a low baseline. The participant scores range from 0.46 to 0.66 SER, significantly better than the baseline. This shows that category assignment is a non trivial problem.
The repetition rate of the entity occurrences and of their categories is an important factor to take into account when designing the prediction method. A quarter of the habitat entities occur more than once in BB'13, which is a significant proportion. It represents half of the total number of habitat occurrences. Only a small number of repeated entities (112 occurrences) belong to several different ontology categories. Consequently, the propagation of the most likely category annotations of a given entity to all its occurrences is a first-line strategy.
Distribution of discontinuous annotations in BB'13 corpus.
combination of insertion and coordination
The PartOf and Localization events account for one quarter and three quarters of the events, respectively. The frequency of PartOf events in the corpus is far from negligible. The scores obtained by BB'11 participant methods on the two event types were similar, with a slightly lower score for PartOf.
Distribution of the event arguments in the text.
% intra sentence events
% intra paragraph events
Human annotators frequently cannot choose a single bacterium name occurrence as the valid argument of a Localization event, to the exclusion of all other occurrences of the name. In this case, the annotator attaches an equivalence set of relevant bacteria occurrences to the event. The prediction of any of the members of the equivalence set is considered equal, and therefore evaluated as valid.
Rate of candidate arguments that belong to an event (Localization or PartOf).
Percentage of entities involved in events
The preparation of the BB'13 corpus was completed following a three-step annotation process, for which we used the AlvisAE Annotation Editor . First, the AlvisNLP pipeline automatically pre-annotated the entities and their categories to speed-up the manual annotation process. Next, eight biologists and computer scientists performed a double-blind manual annotation after a training period. They followed detailed guidelines that limited the number of different interpretations of the task goals. Finally, the conflict resolution phase resulted in the final annotation. This final phase was done using the AlvisAE conflict detection tool. The annotators also used a Forum and a Wiki to debate guidelines interpretations and to record their decisions. The guidelines were revised accordingly. A revision tracking tool identified and displayed the annotations that should be checked because they were potentially affected by subsequent revisions. The annotators achieved a consensus annotation by debating the conflicts.
Automatic pre-annotation scores by AlvisNLP.
Detection & categorization
Categorization with reference entity
The ToMap categorization method is very efficient (90% F-measure) when applied to the reference entities. The F-measure significantly decreases by 24 points when ToMap is applied to predicted entities (F-measure = 0.66). A detailed analysis of the errors showed that the entity detection errors are in general deletion errors (429), rather than insertion errors (31) or incorrect boundaries. In other words, the BioYaTeA term analysis method is accurate for the prediction of entity boundaries. On the other hand, the ToMap method filters out too many correct entities when assigning categories. Conversely only a few of the incorrect entities are preserved and ToMap generally assigns the correct category to the remaining entities. In terms of time gain for the annotators, the pre-annotation method significantly facilitates the entity categorization, which was the main issue. On the other hand, the missed entities had to be manually identified by the annotators.
Prediction evaluation metrics
The BB'11 metrics were used when possible for the sake of comparison of the BB'13 results to previous ones. The novelty of sub-task 1 required the design of new suitable metrics.
Measure for entity detection and categorization
where N is the number of reference entities and P the number of predicted entities. The prediction of imprecise entity boundaries and approximate categories should be counted as partial errors and not as full errors since an approximation of the biotope reference is better than no prediction at all. S represents the sum of the similarities S between reference entities and their corresponding partial matching entity. For the same reasons that were given above for GRN, the SER measure is more appropriate here than the F-measure since it overestimates the partial match errors.
In the SER formula, Insertions represent false positives i.e. predicted entities that do not overlap with any reference entity. Deletions represent false negatives, i.e. reference entities that do not overlap with any predicted entity. Substitutions are inversely proportional to the similarity between the predictions and the references that partially overlap.
The similarity between the predicted entity and the reference entity depends on two criteria: the similarity of their entity boundaries S e and the similarity of their categories S c .
In BB'11, the similarity S e of entity pairs was measured as the ratio between the size of the overlapping text segments and the size of the two merged text segments. S e is equal to 1 (maximum) if the two entities are equal and tends to be zero for barely overlapping segments. Formally, it is a variant of the Jaccard index applied to segments. The analysis of the BB'11 participant results demonstrated its significance . We extended this measure to take into account the discontinuity of entity spans.
In the same way as for entity boundaries, approximate predictions of the entity categories should not be counted as full errors. Moreover, the evaluation should favor ancestor predictions over sibling predictions, since the prediction of an overly general category remains correct, even though it is less precise. Additionally, the prediction information wealth should decrease faster than the number of nodes on the path from the prediction to the reference. The Wang semantic similarity fits these requirements . It has been successfully applied in previous work to compute semantic distances in ontologies .
Measure for event prediction
The evaluation of sub-tasks 2 and 3 results measures the quality of the event predictions. Additionally, the evaluation of sub-task 3 measures the quality of the arguments of the correct events. These goals can be formalized as a categorization problem of all the pairs of entities for which recall, precision and F1 measures are appropriate. We then used the same setting as for BB'11. The accuracy of the biotope argument is measured by S e as in sub-task 1, whereas the accuracy of the bacteria argument is measured using a strict equality.
Bacteria Track results
Five teams participated in BB, five teams in GRN and two teams participated in both. We analyzed the relevance of the corpus and evaluation metrics, the method strengths and their relevance for biological applications with respect to the results. The comparison of the 2013 results to the previous series that took place two years ago shows the evolution of community methods on challenging information extraction tasks.
Gene Regulation Network
Official scores of the GRN task.
U. of Ljubljana
Interaction + low-level
The distinction between the three types of errors (deletion, insertion and substitution) allows us to better qualify the strengths and weaknesses of the predictions. Except for IRISA, all predictions have roughly the same error profile: a high number of missed arcs and a relatively low number of substitutions and incorrect insertions of arcs. This is reflected by the low recall that ranges from 13% to 34%, while the precision ranges from 44% to 68%.
The IRISA submission shows a much more balanced profile with nearly the same number of substitutions, deletions and insertions. This submission is also bolder since its network contains more than twice as many arcs (91) than the others (the second by University of Ljubljana has 44 arcs).
The results of the shape evaluation (without regulation types) are much more optimistic, even though the relative ranking of submissions remains unchanged with the noticeable exception of IRISA (from fourth to first) with a 75% F-score and an impressive gain of 0.40 SER. Regardless of their ranking, all predictions yield a high precision score: the highest is 88% for the University of Ljubljana team and the lowest is 74% for IRISA, whose main strength is the recall compared to the others.
We conclude that systems are better at predicting regulations, but less accurate at typing them. The TEES-2.1, EVEX and University of Ljubljana submissions show the smallest SER gains (+0.12, +0.12 and +0.13 respectively) suggesting that these systems are slightly more accurate in choosing regulation types.
In the effect evaluation (without mechanism types, Binding and Transcription), the ranking of submissions remains unchanged with a nominal SER change for all. We conclude that the prediction of mechanism labels is quite accurate for all systems and that the most challenging aspect of the GRN task is the determination of effect labels (Activation, Inhibition, and Requirement), which is most important for biological applications.
Participation to the bacteria biotope task.
Sub task 1
Sub task 2
Sub task 3
Habitat detection and categorization (sub-task 1)
Our analysis of the sub-task 1 results focuses on how the detection of entities and their categorization interact. We are also interested in the way in which the task metrics can help evaluate the method relevance for real-life applications.
Sub-task 1 results in BB'13.
Official results (a)
Habitat Detection (b)
Category assignment with relaxed habitat boundaries (c)
The new categorization challenge has been successfully completed despite its novelty. As highlighted in the Background section, there is a strong need for methods capable of normalizing biotope mentions, without necessarily extracting the exact entity text. These good results on sub-task 1 are thus very promising with respect to the application needs of the bioinformatics domain.
Sub-task 1 entity detection results, with relaxed boundaries.
However, there is a need in bioinformatics for methods capable of detecting biotope mentions without high boundary accuracy. Fast reading of relevant scientific documents is a significant example of such an application. This result, in fact, meets the core of the application needs, which is promising for future developments.
Event extraction (sub-task 2)
Official results of sub-task 2 for BB'13 task.
Results of sub-task 2 measured on intra-sentence events.
Difference with the full task scores
The methods of IRISA and Boun are slightly affected by the inter-sentence evaluation since their scores decrease by 3 and 7 points, respectively. Different reasons can be suspected. IRISA method is based on a word-based language model that does not use sentence boundaries. The method may be sensitive to the length of the argument context that is shorter and less discriminant on average in single sentences than in the general case. This explains the precision decrease for intra-sentence extraction (-8 points) that could be due to over-generalization. Boun method selects the first mention of bacteria in a given paragraph as the bacteria argument of any Localization event in the paragraph. The intra-sentence evaluation focuses on local events where this method strategy fails more frequently, which explains the decrease in precision (-11 points). The impact of intra-event extraction compared to the regular task is very different depending on the participant methods. Linguistic strategies such as anaphora resolution for linking entity references are critical for certain methods, such as syntactic dependency-based approaches, but potentially not useful for sequence-based learning.
Entity detection and event extraction (sub-task 3)
Scores on sub-task 3 of BB'13 Task.
Scores with relaxed biotope boundaries
Scores with relaxed bacteria boundaries
Scores on biotope detection in sub-task 3 of BB'13 Task.
Scores on intra-sentence Localization extraction in sub-task 3 with relaxed entity boundaries.
Not surprisingly, the best scores are achieved by both methods when both bacteria and biotope boundaries are relaxed and only intra-sentence event extraction is measured (table 15).
This detailed analysis of the evaluation measures thus provides valuable suggestions and ideas regarding the issues that need to be addressed. Entity boundary detection (especially for bacteria) and inter-sentence events seems to be the main hindrances. The first challenge could be easier to complete than the latter, given the high scores obtained in BB'11.
Conclusions and Discussion
The Bacteria Track is motivated by the evolution of the biology field. The biology research domain is undergoing a shift in terms of the scale of experiments and their analysis. High throughput experiments provide comprehensive data over a large range of species. Whole cell models and cross-species metagenomics studies are conceivable provided experimental data is accurately linked with knowledge corpora contained in the literature. The goal of the Bacteria Track tasks is to demonstrate that the BioNLP community is well-grounded to accompany the progress of Microbiology research.
The GRN task targets biological processes and whole cell models, whereas BB targets ecological information for a large spectrum of bacteria species. The GRN and BB task definition and evaluation procedures are tailored to the biology knowledge modeling goals. The two corpora provide a benchmark for BioNLP IE systems that aim to measure their ability to build relevant knowledge bases. The results of participant systems on the Shared Task provide invaluable insight into their strengths and limits from which a number of conclusions can be drawn regarding the most promising research trends. For both tasks we have proposed the Slot Error Rate (SER) as a relevant evaluation measure. The SER has effectively allowed us to discriminate the performance of the submitted predictions. In particular, it contributed to a better characterization of the strengths of each submission by distinguishing three types of errors: Insertions (false positives), Deletions (false negatives) and Substitutions (mismatches).
Gene Regulation Network task
The goal of the GRN task is to present systems biologists with a regulation network, rather than a set of text-bound events. Participant systems are able to produce results that biologists can immediately grasp and evaluate. Methods based on low-level events achieved using the three levels of annotations and the domain expert inference rules. The production of knowledge from text and domain rules is a new IE paradigm that differs from the usual BioNLP information extraction framework. The analysis of the submissions showed that participants predicted mainly the low-level events that were involved in the most formalized inference rules. The other low-level events were rarely predicted or not predicted at all. More focused and distinct processing for the extraction of elementary events and the design of the abstract regulation network should improve the quality of the results. In the future additional formalized inference rules should also allow systems to focus on the extraction of elementary events and improve the quality of inferred regulation networks.
The annotation model of the GRN corpus has characteristics in common with the GE and GRO corpus models [42, 43] that also address molecular biology questions. A unified shared model for the tasks should make the generalization of the participant systems easier for all three tasks. The EVEX submission was a promising attempt towards this goal. Moreover, the GRO annotation model heavily relies on the Gene Ontology that could be used for the convergence of the three models.
Bacteria Biotopes task
The results on the BB sub-task 1 (entity recognition and categorization) are very promising with respect to the novelty of the goal. The evaluation score combines boundary and categorization accuracy in a single measure. We have shown that incorrect boundaries negatively impact the categorization and are thus penalized twice. An even more application centered evaluation metrics might reduce the impact of boundary accuracy. The results on the BB sub-tasks 2 and 3 (event extraction with or without gold entities) are below the scores on similar extraction tasks that contain only a few event and entity types. After error analysis of the predictions, we indicated plausible means of improvement. In particular, relevant approaches for the prediction of bacteria names could be applied. The long distance events between bacteria and their biotopes deserve a specific treatment.
As was done after previous BioNLP shared tasks, the data, evaluation services and resources for the two tasks were made available. The test answers are not public in order to ensure that the comparison of future results remains possible.
This work has been partially supported by the Quaero program funded by OSEO, the French state agency for innovation, the Program Investing in the Future, Grant n°ANR-10-IDEX- 0003-02» funded by the French National Research Agency and by the INRA MEM OntoBiotope Network.
The publication costs of this article were funded by the Institut National de la Recherche Agronomique (INRA, France).
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 10, 2015: BioNLP Shared Task 2013: Part 1. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S10.
- Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics. 2005, 6 (Suppl 1): S1-10.1186/1471-2105-6-S1-S1.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J: Extracting bio-molecular events from literature - The BioNLP'09 Shared Task. Computational Intelligence. 2011, 27 (4): 513-540. 10.1111/j.1467-8640.2011.00398.x.View ArticleGoogle Scholar
- Nédellec C, Bossy R, Kim JD, Kim JJ, Ohta T, Pyysalo S, Zweigenbaum P: Overview of BioNLP Shared Task 2013. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 1-7.Google Scholar
- Bossy R, Jourde J, Manine AP, Veber P, Alphonse E, van de Guchte M, Bessières P, Nédellec C: BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics. 2012, 13 (Suppl 11): S3-10.1186/1471-2105-13-S11-S3.PubMed CentralView ArticlePubMedGoogle Scholar
- Nédellec C: Learning Language in Logic - Genic Interaction Extraction Challenge. Proceedings of the Learning Language in Logic 2005 Workshop at the International Conference on Machine Learning. 2005, 31-37.Google Scholar
- Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muñiz-Rascado L, García-Sotelo JS, et al: RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Research. 2013, 41 (Database issue): D203-D213.PubMed CentralView ArticlePubMedGoogle Scholar
- Yilmaz A, Mejia-Guerra MK, Kurz K, Liang X, Welch L, Grotewold E: AGRIS: the Arabidopsis Gene Regulatory Information Server, an update. Nucleic Acids Research. 2011, 39 (Database issue): D1118-D1122.PubMed CentralView ArticlePubMedGoogle Scholar
- de Jong H: Modeling and simulation of genetic regulatory systems: a literature review. J Computational Biology. 2002, 9 (1): 67-103. 10.1089/10665270252833208.View ArticleGoogle Scholar
- Kitano H: Computational systems biology. Nature. 2002, 420 (6912): 206-210. 10.1038/nature01254.View ArticlePubMedGoogle Scholar
- Oda K, Kim JD, Ohta T, Okanohara D, Matsuzaki T, Tateisi Y, Tsujii J: New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinformatics. 2008, 9 (Suppl 3): S5-10.1186/1471-2105-9-S3-S5.PubMed CentralView ArticlePubMedGoogle Scholar
- Rodríguez-Penagos C, Salgado H, Martínez-Flores I, Collado-Vides J: Automatic reconstruction of a bacterial regulatory network using Natural Language Processing. BMC Bioinformatics. 2007, 8: 293-10.1186/1471-2105-8-293.PubMed CentralView ArticlePubMedGoogle Scholar
- Manine AP, Alphonse E, Bessières P: Learning ontological rules to extract multiple relations of genic interactions from text. Intl J Medical Informatics. 2009, 78 (12): e31-e38. 10.1016/j.ijmedinf.2009.03.005.View ArticleGoogle Scholar
- Herrgård MJ, Covert MW, Palsson BØ: Reconstruction of microbial transcriptional regulatory networks. Current Opinion in Biotechnology. 2004, 15 (1): 70-77. 10.1016/j.copbio.2003.11.002.View ArticlePubMedGoogle Scholar
- Ratkovic Z, Golik W, Warnier P: Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach. BMC Bioinformatics. 2012, 13 (Suppl 11): S8-10.1186/1471-2105-13-S11-S8.PubMed CentralView ArticlePubMedGoogle Scholar
- OntoBiotope. [http://genome.jouy.inra.fr/bibliome/MEM-OntoBiotope/]
- GenBank. [http://www.ncbi.nlm.nih.gov/genbank]
- GOLD. [http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi]
- EnvO. [http://environmentontology.org]
- PubMedBiotope. [http://bibliome.jouy.inra.fr/demo/ontobiotope/alvisir2/webapi/search]
- Golik W, Warnier P, and Nédellec C: Corpus-based extension of termino-ontology by linguistic analysis: a use case in biomedical event extraction. WS 2 Workshop Extended Abstracts, 9th International Conference on Terminology and Artificial Intelligence. 2011, 37-39.Google Scholar
- BioNLP-ST'13. [https://sites.google.com/site/bionlpst2013/]
- Bossy R, Bessières P, and Nédellec C: BioNLP Shared Task 2013 - an overview of the Gene Regulation Network task. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 153-160.Google Scholar
- Bossy R, Golik W, Ratkovic Z, Bessières P, and Nédellec C: BioNLP Shared Task 2013 - an overview of the bacteria biotope task. Proc BioNLP Shared Task 2013 Workshop. 2013, Association for Computational Linguistics (ACL), 74-82.Google Scholar
- Sierro N, Makita Y, de Hoon M, Nakai K: DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Research. 2008, 36 (Database issue): D93-D96.PubMed CentralPubMedGoogle Scholar
- Michna RH, Commichau FM, Tödter D, Zschiedrich CP, Stülke J: SubtiWiki - a database for the model organism Bacillus subtilis that links pathway, interaction and expression information. Nucleic Acids Research. 2014, 42 (Database issue): D692-D698.PubMed CentralView ArticlePubMedGoogle Scholar
- Higgins D, Dworkin J: Recent progress in Bacillus subtilis sporulation. FEMS Microbiol Rev. 2012, 36 (1): 131-148. 10.1111/j.1574-6976.2011.00310.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Makhoul J, Kubala F, Schwartz R, Weischedel R: Performance measures for information extraction. Proceedings of DARPA Broadcast News Workshop. 1999, 249-252.Google Scholar
- Galibert O, Quintard L, Rosset S, Zweigenbaum P, Nédellec C, Aubin S, et al: Named and specific entity detection in varied data: The Quæro Named Entity baseline evaluation. Proceedings of the International Conference on Language Resources and Evaluation. 2010, 4353-4358.Google Scholar
- Papazian F, Bossy R, Nédellec C: AlvisAE: a collaborative Web text annotation editor for knowledge acquisition. LAW VI '12 Proceedings of the Sixth Linguistic Annotation Workshop. 2012, 149-152.Google Scholar
- Golik W, Bossy R, Ratkovic Z, Nédellec C: Improving term extraction with linguistic analysis in the biomedical domain. Research in Computing Science. 2013, 70: 157-172.Google Scholar
- Aronson AR, Lang FM: An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association. 2010, 17 (3): 229-236. 10.1136/jamia.2009.002733.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Du Z, Payattakool R, Yu P, Chen CF: A New Method to Measure the Semantic Similarity of GO Terms. Bioinformatics. 2007, 23 (10): 1274-1281. 10.1093/bioinformatics/btm087.View ArticlePubMedGoogle Scholar
- Bettembourg C, Diot C, Burgun A, Dameron O: GO2PUB: Querying PubMed with semantic expansion of gene ontology terms. J Biomed Semantics. 2012, 3 (1): 7-10.1186/2041-1480-3-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F: EVEX in ST'13: Application of a large-scale text mining resource to event extraction and network construction. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 26-34.Google Scholar
- Provoost T, Moens MF: Detecting Relations in the Gene Regulation Network. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 135-138.Google Scholar
- Claveau V: IRISA participation to BioNLP-ST 2013: lazy-learning and information retrieval for information extraction tasks. Proceedings of the BioNLP Shared Task. 2013, 188-196. WorkshopGoogle Scholar
- Žitnik S, Žitnik M, Zupan B, Bajec M: Extracting Gene Regulation Networks Using Linear-Chain Conditional Random Fields and Rules. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 178-187.Google Scholar
- Björne J, Salakoski T: TEES 2.1: Automated Annotation Scheme Learning in the BioNLP 2013 Shared Task. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 16-25.Google Scholar
- Grouin C: Building A Contrasting Taxa Extractor for Relation Identification from Assertions: BIOlogical Taxonomy & Ontology Phrase Extraction System. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 144-152.Google Scholar
- Bannour S, Audibert L, Soldano H: Ontology-based semantic annotation: an automatic hybrid rule-based method. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 139-143.Google Scholar
- Karadeniz I, Özgür A: Bacteria Biotope Detection, Ontology-based Normalization, and Relation Extraction using Syntactic Rules. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 170-177.Google Scholar
- Kim JD, Wang Y, Yasunori Y: The Genia Event Extraction Shared Task, 2013 Edition - Overview. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 8-15.Google Scholar
- Kim JJ, Han X, Lee V, Rebholz-Schuhmann D: GRO Task: Populating the Gene Regulation Ontology with events and relations. Proceedings of the BioNLP Shared Task 2013 Workshop. 2013, 50-57.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.