Overview of the gene regulation network and the bacteria biotope tasks in BioNLP'13 shared task

Bossy, Robert; Golik, Wiktoria; Ratkovic, Zorana; Valsamou, Dialekti; Bessières, Philippe; Nédellec, Claire

doi:10.1186/1471-2105-16-S10-S1

Volume 16 Supplement 10

BioNLP Shared Task 2013: Part 1

Research
Open access
Published: 23 June 2015

Overview of the gene regulation network and the bacteria biotope tasks in BioNLP'13 shared task

Robert Bossy¹,
Wiktoria Golik¹,
Zorana Ratkovic^1,2,
Dialekti Valsamou^1,3,
Philippe Bessières¹ &
…
Claire Nédellec¹

BMC Bioinformatics volume 16, Article number: S1 (2015) Cite this article

2192 Accesses
14 Citations
1 Altmetric
Metrics details

Abstract

Background

We present the two Bacteria Track tasks of BioNLP 2013 Shared Task (ST): Gene Regulation Network (GRN) and Bacteria Biotope (BB). These tasks were previously introduced in the 2011 BioNLP-ST Bacteria Track as Bacteria Gene Interaction (BI) and Bacteria Biotope (BB). The Bacteria Track was motivated by a need to develop specific BioNLP tools for fine-grained event extraction in bacteria biology. The 2013 tasks expand on the 2011 version by better addressing the biological knowledge modeling needs. New evaluation metrics were designed for the new goals. Moving beyond a list of gene interactions, the goal of the GRN task is to build a gene regulation network from the extracted gene interactions. BB'13 is dedicated to the extraction of bacteria biotopes, i.e. bacterial environmental information, as was BB'11. BB'13 extends the typology of BB'11 to a large diversity of biotopes, as defined by the OntoBiotope ontology. The detection of entities and events is tackled by distinct subtasks in order to measure the progress achieved by the participant systems since 2011.

Results

This paper details the corpus preparations and the evaluation metrics, as well as summarizing and discussing the participant results. Five groups participated in each of the two tasks. The high diversity of the participant methods reflects the dynamism of the BioNLP research community.

The highest scores for the GRN and BB'13 tasks are similar to those obtained by the participants in 2011, despite of the increase in difficulty. The high density of events in short text segments (multi-event extraction) was a difficult issue for the participating systems for both tasks. The analysis of the BB'13 results also shows that co-reference resolution and entity boundary detection remain major hindrances.

Conclusion

The evaluation results suggest new research directions for the improvement and development of Information Extraction for molecular and environmental biology. The Bacteria Track tasks remain publicly open; the BioNLP-ST website provides an online evaluation service, the reference corpora and the evaluation tools.

Background

Motivation and related work

Large-scale experimental approaches in the field of biology shift the focus of researchers towards transversal questions that involve very diverse biological knowledge. The researcher needs new tools to deal with the growing number of relevant publications. The domain of text-mining for biology (BioNLP) develops automatic methods to assist the analysis of knowledge expressed in natural language scientific articles. Periodic shared tasks measure the progress of the community methods by formally comparing the method predictions to a reference annotation on test data [1, 2]. The goals of the shared tasks evolve with the advances in BioNLP, moving towards a better adaptation to the needs of biologists. It is reflected through the diversity of biology questions (e.g. regulation, disease, metabolism, and environment), types of documents (e.g. abstracts, papers, Web pages) and the related biologist research activity (e.g. knowledge curation, system modeling, and data normalization).

The third series of BioNLP Shared Task that took place in 2013 (BioNLP-ST'13) proposed six tasks under the Knowledge Base domain. BioNLP-ST'13 encourages the development of methods that improve the extraction of fine-grained complex events in systematic and concise ways [3]. The common organization of BioNLP-ST'13 includes an official evaluation of the participant systems by an automatic comparison of their predictions on the test sets to reference data. The evaluation took place at a fixed date after a period for the training of methods using the reference corpora provided by the task organizers. The two Bacteria Track tasks were organized within this framework.

The creation of the BioNLP Bacteria Track in 2011 [4] followed the LLL initiative in 2005 [5]. It was motivated by research questions brought up by bacteria research that encompass all levels of knowledge from the molecular to the physiological and phenotypic levels. This is exemplified by the GRN and BB tasks.

Gene Regulation Network

Gene regulation networks are key components in the understanding of cell processes and more generally of living organisms. More and more research efforts are being invested in the design of regulation networks for species of all kingdoms [6, 7]. Systems biology aims to integrate the knowledge from heterogeneous sources in consistent predictive models of gene regulation [8, 9]. Beyond experimental data, which is the main source to date, the abundance of regulation descriptions in literature has been a strong motivation for the development of dedicated Information Extraction (IE) methods [5, 10, 11]. The goal of the LLL task was the extraction of binary directed interactions between the agent protein and the target gene from which the regulation network can be derived in a straightforward way. The gene interaction information extraction task has posed a challenge for years due to the spread of information over large collections of scientific data, the complexity of the underlying biological phenomena and the linguistic diversity of the descriptions.

[2] and [12] showed that the use of a fine-grained biological model for the representation of the events facilitates the understanding and validation of the extracted knowledge by the biologists and its integration with other data sources. A fine-grained model was implemented in the Bacteria Gene Interaction (BI) task in 2011. It describes in detail interactions at the biological level and underlying cellular mechanisms at the molecular level [4]. The activation of the transcription of a gene is an example of a biological interaction. The physical binding of a protein to a DNA site is an example of a molecular level phenomenon.

The BI model is formalized by biological entities and n-ary events between entities and events. The biological entities are mainly proteins and genes, their subparts (e.g. site), families (e.g. gene cluster) and aggregates (e.g. protein complex).

The GRN task goes one step beyond BI by making the design of the regulation network its primary goal. It has several benefits compared to the BI text-bound event extraction. Regulation networks are needed by biologists in order to enrich their biological models and to integrate text knowledge with other sources of knowledge. Because the knowledge extracted from text is directly bound to its sources, end-users can more easily assess its quality, compared to other bioinformatics methods such as transcriptomic profile screening [13]. Moreover, the evaluation of regulation network quality better reflects the biologist needs because it abstracts from text-mining peculiarities, such as the linguistic complexity of the text descriptions and information redundancy.

The GRN annotation model is built upon the BI model, and also includes inference rules to automatically deduce the regulation network from text-bound events. The inference rules provides a continuity between event-based extractions and the regulation network, allowing to benefit from both types of knowledge.

The biological question behind the LLL and BI tasks is the cell process regulation network of the model bacterium Bacillus subtilis (Bs) with a focus on sporulation, one of the most studied developmental processes. This choice was motivated by the abundance of publicly available information in PubMed abstracts and the richness of the biological phenomena described in them.

Bacteria Biotope

The previous work on the Bacteria Biotope task (BB'11) has stressed the importance of microorganism environment information. The formal description of biotopes and their properties is an essential step for the study of interactions between the organisms and their environment. In particular, it is needed in order to correlate genetic specificity to environmental properties and to explain the adaptation of organisms to their habitats and their evolution. The application domains of this fundamental research are broad, from the health of humans, plants, and animals, to food processes including plant growth enhancement [4]. Biotope descriptions are abundant in scientific documents, but they cannot be used as such for biological studies. Their form is extremely variable: the biotope descriptions may be very complex from a linguistic point of view, including many embedded biotope names and properties. A normalization of biotope descriptions using a reference is required for their comparison. The extraction of the relations between the organism and the biotope entities is also difficult to automate due to the abundance of entity mentions in short spans of texts. This motivated the organization of the Bacteria Biotope Task in 2011. The goal of BB'11 was the identification and categorization of bacterial habitat entities in natural language texts, their linking to their subparts (i.e. part-of event) and their linking to the bacteria that live there (i.e. localization event). The good results of the participants on BB'11 demonstrate the feasibility of this IE task.

Since 2011, the development of new sequencing techniques has had a major impact on the field of metagenomics. Metagenomics studies microorganism sequences in their environments, thus avoiding strain cultivation. The number of metagenomics studies has grown exponentially in the last few years. This has resulted in a considerable increase in the diversity of the microorganisms that can be studied. This is appealing for information extraction tools that can automatically analyze biotope descriptions of the microorganisms so that these biotopes and genes from different metagenomics experiments can be compared on a large scale.

The categorization of biotopes is a form of normalization that is necessary for the generalization of biotope observations. BB'11 defined seven broad biotope categories that were a priori considered as relevant for biological studies. Participant methods had to categorize the extracted biotope mentions according to these categories. The limited number of categories affects the ability of bioinformatics methods to find useful correlations between gene sequences and biotopes. This motivates the use of a large set of categories organized in a hierarchical structure. Moreover, [14] has shown that the lexical information contained in ontologies can make the task easier.

OntoBiotope is an ontology of microorganism habitats [15]. Its modeling principle and its lexicon reflect the usual biotope classification used by biologists to describe microorganism isolation sites (e.g. GenBank, GOLD, EnvO) [16–18]. OntoBiotope is developed and maintained by the Meta-omics of Microbial Ecosystems (MEM) network in which 30 microbiologists from INRA (French National Institute for Agricultural Research) from all fields of applied microbiology participate. The relevance of OntoBiotope terms has been evaluated through the PubMedBiotope semantic search engine [19]. It identifies and categorizes biotopes in a collection of 600 000 PubMed abstracts by applying the ToMap method (Text to Ontology Mapping) [20] to the OntoBiotope ontology. This suggests that the ontology is fully appropriate as a new fine-grained categorization plan for the BB'13 task.

The BB'11 corpus is a collection of encyclopedia-like web pages. They are comprehensible by non-biologists and they share many linguistic characteristics in common with scientific articles. The limited size of encyclopedic webpages, compared to full research articles, is appropriate for a first attempt at a novel task, while preserving the generalization of trained IE systems.

Methods

This section describes the corpora features, the event representation and the evaluation metrics for the two tasks. More details and examples can be found on the task website [21] and the ACL BioNLP Shared Task articles that are devoted to the two tasks [22, 23].

Gene Regulation Network

Biological and molecular representation in GRN

The goal of the GRN task is the extraction of a regulation network from text. The network is represented by a directed graph where the nodes are the genes and the arcs represent the interactions between them. Biological studies qualify the kind of interaction between biological entities according to the effect of the agent on the target, or to the mechanism by which the agent regulates the target. We defined six interaction types for the GRN regulation network representing the whole range of effect and mechanism regulation types [22]. The effect can be either Activation (positive regulation), Inhibition (negative regulation), or Requirement (the agent is necessary). Additionally, regulations can either be direct, which means that the agent protein physically interacts with the target gene, or indirect, which means that the regulation may be the result of a cascade of interactions. Direct regulations are particularly informative; the literature describes direct regulations at the molecular mechanism level as either physical Binding of a protein to DNA, or as the effect of a protein on gene Transcription as represented in Figure 1.

Molecular mechanisms are frequently detailed in the literature on bacteria and are very useful to determine the nature of the regulation. Moreover, the events involve not only proteins and genes, but also parts of them, families, complexes, or even cascades of nested events.

The six types of gene interactions are thus not sufficient to represent the whole complexity of interaction descriptions: a more comprehensive annotation model is necessary. It should allow the biologists to annotate the corpus by ensuring that entities, relations and events map easily to text elements and have an unequivocal biological interpretation. It must also be flexible enough to contend with linguistic phenomena like ellipsis and metonymy. The GRN annotation model meets these requirements. It accurately represents biological concepts and phenomena and is suited for text annotations, from the most generic indirect interactions (e.g. "X inhibits the expression of Y"), to the most detailed descriptions of physical interactions. The text annotation model comprises two levels: the biological level and the molecular level. The biological level includes (1) the same interaction type events as the regulation network, (2) transcription events (Transcription by and Transcription from) and (3) regulon membership events (Member/Master of Regulon). The interaction type events will be denoted by Interaction:type in the following, as in Interaction:Regulation. Network inference rules automatically and directly infer network arcs and nodes from text annotations. In cases where the interaction arguments are not genes or proteins, but nested events, the events are reduced to the participating genes. Example 1 of Figure 2 illustrates a nested event and Example 3 shows the resulting Regulation arc.

The molecular level of the annotation model represents the role of the promoter of the regulated gene and the binding of the protein on the promoter as illustrated by Example 2 of Figure 2. The model defines a Master of Promoter event that relates the binding protein to the promoter and more generally the Bind to event relates the binding protein to any site. The model also defines a Promoter of event that relates the promoter to its gene and more generally the Site of event relates any binding site to its DNA region (gene or promoter).

Inference rules derive the Interaction arcs of the network and their types from these molecular low-level events. Inference is done in two steps, (i) inference of biological annotations from the molecular annotations, and (ii) inference of the network arcs from the biological annotations. Example 2 of Figure 2 illustrates the inference of molecular to biological annotations. Example 3 shows the result of the inference of the network arc Interaction: Binding.

[22] formally specifies the annotation model and the inference rules that produce the regulation network from the text annotations. The specifications were made available on the GRN BioNLP-ST 13 webpage, as well as a tool for checking the predicted events against the annotation model and for inferring the network from these predictions.

Information extraction challenges in the GRN task

The GRN corpus has been designed by following the BioNLP annotation standard [3]. It was selected from Pubmed abstracts on Bacillus subtilis transcription. All together, the information represented by the corpus had to be sufficient to build a regulation network centered on the sporulation of Bs.

The annotation model is based on the Bacteria Genic Interaction (BI) proposed in BioNLP- ST 2011. The manual annotation of the whole GRN corpus confirmed that it captures all of the descriptions of genic interactions without ambiguity. The regulation network that has been inferred from the annotations has been checked against state of the art knowledge [24, 25] with a focus on the sporulation of Bacillus subtilis [26]. Its formal validity was checked by applying the inference rules to the corpus annotations.

Unlike most task corpora, the GRN corpus is a set of sentences isolated from PubMed abstracts. This is done for two reasons. Isolated sentences provide all the regulation network information. The prediction of the correct relations among the entities in the sentences is challenging as previously demonstrated by the LLL and Bacteria Genic Interaction (BI) tasks. This challenge is the result of a high number of entities (almost 5 per sentence on average), their diversity (11 types) and the diversity of the events (15). The sentences are provided with the gold entities (genes, proteins, promoters, etc.) and their text span, allowing the participant methods to focus on relation extraction.

The GRN corpus was split into the training, development and test sets, ensuring that the distribution of event types and entities in the training and development sets was representative of the test set. The molecular level annotations account for 60% of the annotations, and the biological level interactions for 40%. At the biological level, Transcription and Regulation interactions combined account for half of the interactions.

The small number of arcs in the network compared to the number of events is due to two factors (Table 1). First, some regulations are repeated in the text and represented by a single arc in the network. Second, some of the network regulations are inferred from several molecular events.

Table 1 Figures of the GRN corpus.

BioNLP Shared Task 2013: Part 1

Overview of the gene regulation network and the bacteria biotope tasks in BioNLP'13 shared task

Abstract

Background

Results

Conclusion

Background

Motivation and related work

Gene Regulation Network

Bacteria Biotope

Methods

Gene Regulation Network

Biological and molecular representation in GRN

Information extraction challenges in the GRN task

Prediction evaluation metrics

Bacteria Biotopes

Corpus characteristics

Entities

Events

Corpus preparation

Prediction evaluation metrics

Measure for entity detection and categorization

Measure for event prediction

Entity pairing

Bacteria Track results

Gene Regulation Network

Bacteria Biotope

Habitat detection and categorization (sub-task 1)

Event extraction (sub-task 2)

Entity detection and event extraction (sub-task 3)

Conclusions and Discussion

Gene Regulation Network task

Bacteria Biotopes task

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us