BIOSMILE: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features

Tsai, Richard Tzong-Han; Chou, Wen-Chi; Su, Ying-Shan; Lin, Yu-Chun; Sung, Cheng-Lung; Dai, Hong-Jie; Yeh, Irene Tzu-Hsuan; Ku, Wei; Sung, Ting-Yi; Hsu, Wen-Lian

doi:10.1186/1471-2105-8-325

Research article
Open access
Published: 01 September 2007

BIOSMILE: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features

Richard Tzong-Han Tsai¹,
Wen-Chi Chou¹,
Ying-Shan Su^1,2,
Yu-Chun Lin¹,
Cheng-Lung Sung¹,
Hong-Jie Dai¹,
Irene Tzu-Hsuan Yeh^1,3,
Wei Ku¹,
Ting-Yi Sung¹ &
…
Wen-Lian Hsu¹

BMC Bioinformatics volume 8, Article number: 325 (2007) Cite this article

20k Accesses
40 Citations
Metrics details

Abstract

Background

Bioinformatics tools for automatic processing of biomedical literature are invaluable for both the design and interpretation of large-scale experiments. Many information extraction (IE) systems that incorporate natural language processing (NLP) techniques have thus been developed for use in the biomedical field. A key IE task in this field is the extraction of biomedical relations, such as protein-protein and gene-disease interactions. However, most biomedical relation extraction systems usually ignore adverbial and prepositional phrases and words identifying location, manner, timing, and condition, which are essential for describing biomedical relations. Semantic role labeling (SRL) is a natural language processing technique that identifies the semantic roles of these words or phrases in sentences and expresses them as predicate-argument structures. We construct a biomedical SRL system called BIOSMILE that uses a maximum entropy (ME) machine-learning model to extract biomedical relations. BIOSMILE is trained on BioProp, our semi-automatic, annotated biomedical proposition bank. Currently, we are focusing on 30 biomedical verbs that are frequently used or considered important for describing molecular events.

Results

To evaluate the performance of BIOSMILE, we conducted two experiments to (1) compare the performance of SRL systems trained on newswire and biomedical corpora; and (2) examine the effects of using biomedical-specific features. The experimental results show that using BioProp improves the F-score of the SRL system by 21.45% over an SRL system that uses a newswire corpus. It is noteworthy that adding automatically generated template features improves the overall F-score by a further 0.52%. Specifically, ArgM-LOC, ArgM-MNR, and Arg2 achieve statistically significant performance improvements of 3.33%, 2.27%, and 1.44%, respectively.

Conclusion

We demonstrate the necessity of using a biomedical proposition bank for training SRL systems in the biomedical domain. Besides the different characteristics of biomedical and newswire sentences, factors such as cross-domain framesets and verb usage variations also influence the performance of SRL systems. For argument classification, we find that NE (named entity) features indicating if the target node matches with NEs are not effective, since NEs may match with a node of the parsing tree that does not have semantic role labels in the training set. We therefore incorporate templates composed of specific words, NE types, and POS tags into the SRL system. As a result, the classification accuracy for adjunct arguments, which is especially important for biomedical SRL, is improved significantly.

Background

The volume of biomedical literature available on the World Wide Web has experienced unprecedented growth in recent years. Processing literature automatically by using bioinformatics tools can be invaluable for both the design and interpretation of large-scale experiments. For this reason, many information extraction (IE) systems that incorporate natural language processing (NLP) techniques have been developed for use in the biomedical field. A key IE task in this field is the extraction of relations between named entities, such as protein-protein and gene-disease interactions.

Many biomedical relation-extraction systems use either cooccurrence statistics or sentence-level methods for relation extraction. Cooccurrence-based approaches extract biomedical relations by first tagging biomedical names and verbs in a text using dictionaries, and then identify cooccurrences of specific names and verbs in phrases, sentences, paragraphs, or abstracts. A variety of statistical tests, such as pointwise mutual information (PMI), the chi-square (x²), and the log-likelihood ratio (LLR) [1], have been used to decide whether a relation expressed by cooccurrences between a given pair really exists [2–6]. Sentence-level methods, on the other hand, usually consider only pairs of entities mentioned in the same sentence [7–9]. To detect and identify a relation, these systems generally use lexico-semantic clues inferred from the sentence context of the entity targets.

When extracting relations from complex natural language texts, both of the above approaches suffer from the same limitation in that they only consider the main relation targets and the verbs linking them. In other words, they frequently ignore phrases describing location, manner, timing, condition, and extent; however, in the biomedical field, these modifying phrases are especially important. Biological processes can be divided into temporal or spatial molecular events, for example activation of a specific protein in a specific cell or inhibition of a gene by a protein at a particular time. Having comprehensive information about when, where and how these events occur is essential for identifying the exact functions of proteins and the sequence of biochemical reactions. Detecting the extra modifying information in natural language texts requires semantic analysis tools.

Semantic role labelling (SRL), also called shallow semantic parsing [10], is a popular semantic analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS), also known as propositions [11]. Each PAS is composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases) that have different semantic roles, including main arguments such as an agent¹ and a patient², as well as adjunct arguments, such as time, manner, and location. Here, the term argument refers to a syntactic constituent of the sentence related to the predicate; and the term semantic role refers to the semantic relationship between a predicate (e.g., a verb) and an argument (e.g., a noun phrase) of a sentence. For example, in Figure 1, the sentence "IL4 and IL13 receptors activate STAT6, STAT3, and STAT5 proteins in the human B cells" describes a molecular activation process. It can be represented by a PAS in which "activate" is the predicate, "IL4 and IL13 receptors" comprise the agent, "STAT6, STAT3, and STAT5 proteins" comprise the patient, and "in the human B cells" is the location. Thus, the agent, patient, and location are the arguments of the predicate.

Since SRL identifies the predicate and arguments of a PAS, it is also called predicate argument recognition [12]. In the natural language processing field, SRL has been implemented as a fully automatic process that can be operated by computer programs [13]. Given a sentence, the SRL task executes two steps: predicate identification and argument recognition. The first step can be achieved by using a part-of-speech (POS) tagger with some filtering rules. Then, the second step recognizes all arguments, including grouping words into arguments and classifying the arguments into semantic role categories. Some studies refer to these two sub-steps as argument identification and argument classification, respectively [14, 15]. In the second step, it is often difficult to determine the word boundaries and semantic roles of an argument as they depend on many factors, such as the argument's position, the predicate's voice (active or passive) and the sense (usage).

SRL has been applied to information extraction because it can transform different types of surface texts that describe events into PAS'. In the newswire domain, Morarescu et al. [16] showed that, by incorporating semantic role information into an IE system, the F-score of the system can be improved by 15% (from 67% to 82%). This finding motivated us to investigate whether SRL could also facilitate information extraction in the biomedical field. In fact, most of the top open-domain SRL systems use machine-learning-based approaches [17–19]. However, at present, there is no large-scale machine-learning-based biomedical SRL system because of the lack of a sufficiently large annotated corpus for training.

In this paper, we propose an SRL system for the biomedical domain called BIOSMILE (BIOmedical SeMantIc roLe labEler). An annotated corpus and a PAS standard are essential for the construction of a biomedical SRL system. Considering our purpose is to train a machine learning SRL system, we use PropBank [20] and follow its annotation guidelines. Since PropBank must be annotated on a corpus containing full-parsing information (like a treebank, which is a collection of full parsing trees), we use the GENIA corpus, which includes 500 abstracts with full-parsing information. To evaluate SRL for use in the biomedical domain, we started with thirty verbs, which were selected because of their high frequency or important usage in describing molecular events. We employed a semi-automatic strategy using our previously created newswire SRL system SMILE (SeMantIc roLe labEler) [19] to tag a corpus derived from the GENIA corpus, and then asked human annotators with a background in molecular biology to verify the automatically tagged results. The resulting annotated corpus is called BioProp [21]. Lastly, we trained a biomedical version of SMILE on BioProp to construct an SRL system called BIOSMILE for the biomedical domain. To improve BIOSMILE's performance on adjunct arguments, which are phrases indicating the time, location, or manner of an event, we further exploit automatically generated patterns.

The corpus construction process is explained in the Background section, and the construction of our biomedical SRL system is described in the Methods section.

Corpus selection

To construct BioProp, a biomedical proposition bank, we adopted GENIA [22] as the underlying corpus. It is a collection of 2,000 MEDLINE abstracts selected from the search results for queries using the keywords "human", "blood cells", or "transcription factors". GENIA is often used as a biomedical text mining test bed [23]. In its officially released version, it is annotated with various levels of linguistic information, such as parts-of-speech, named entities, and conjunctions. In the summer of 2005, Tateisi [24] published full parsing information for the corpus that basically follows the Penn Treebank II (PTB) annotation scheme [25] encoded in XML. The GENIA corpus annotated with full parsing information is called GENIA Treebank (GTB). Currently, GTB is a beta version containing 500 abstracts.

Verb selection

As noted earlier, we chose thirty verbs because of their high frequency or important usage in describing molecular events. To select the verbs, we calculated the frequency of each verb based on its occurrence in GENIA, our underlying corpus, rather than in MEDLINE. It is noteworthy that some verbs that occur frequently in MEDLINE are rarely found in GENIA. Since we focus on molecular events, only sentences containing protein or gene names are used to calculate a verb's frequency. We listed verbs according to their frequency and removed generally used verbs such as is, have, show, use, do, and suggest. We then selected the verbs with highest frequencies and added some verbs of biological importance. The thirty verbs with their characteristics and frequency of occurrence in BioProp are listed in Table 1.

Table 1 The thirty selected verbs

BIOSMILE: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features

Abstract

Background

Results

Conclusion

Background

Corpus selection

Verb selection

PAS standard – Proposition Bank

Framesets of biomedical verbs

Annotation of BioProp

Inter-annotator agreement

Related work

Results and discussion

Datasets

SMILE and BIOSMILE

Experiment design

Experiment 1: improvement by using biomedical proposition bank

Experiment 2: the effect of using biomedical-specific features

Evaluation metrics

Results

Experiment 1

Experiment 2

Discussion

The influence of sense change on the biomedical and newswire domains

Why NE features are not effective

Performance gained using template features

Conclusion

Methods

Formulation of semantic role labeling

Maximum entropy model

Baseline features

Named entity features

Biomedical template features

Template generation (TG) and filtering

Algorithm 1: Template generation

Applying generated templates

Appendix

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us