Semi-automatic conversion of BioProp semantic annotation to PASBio annotation
© Tsai et al. 2008
Published: 12 December 2008
Skip to main content
© Tsai et al. 2008
Published: 12 December 2008
Semantic role labeling (SRL) is an important text analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS). Each PAS is composed of a predicate (verb) and several arguments (noun phrases, adverbial phrases, etc.) with different semantic roles, including main arguments (agent or patient) as well as adjunct arguments (time, manner, or location). PropBank is the most widely used PAS corpus and annotation format in the newswire domain. In the biomedical field, however, more detailed and restrictive PAS annotation formats such as PASBio are popular. Unfortunately, due to the lack of an annotated PASBio corpus, no publicly available machine-learning (ML) based SRL systems based on PASBio have been developed. In previous work, we constructed a biomedical corpus based on the PropBank standard called BioProp, on which we developed an ML-based SRL system, BIOSMILE. In this paper, we aim to build a system to convert BIOSMILE's BioProp annotation output to PASBio annotation. Our system consists of BIOSMILE in combination with a BioProp-PASBio rule-based converter, and an additional semi-automatic rule generator.
Our first experiment evaluated our rule-based converter's performance independently from BIOSMILE performance. The converter achieved an F-score of 85.29%. The second experiment evaluated combined system (BIOSMILE + rule-based converter). The system achieved an F-score of 69.08% for PASBio's 29 verbs.
Our approach allows PAS conversion between BioProp and PASBio annotation using BIOSMILE alongside our newly developed semi-automatic rule generator and rule-based converter. Our system can match the performance of other state-of-the-art domain-specific ML-based SRL systems and can be easily customized for PASBio application development.
The amount of biomedical literature available online continues to grow rapidly today, creating a need for automatic processing using bioinformatics tools. Many information extraction (IE) systems incorporating natural language processing (NLP) techniques have been developed for use in the biomedical field. A key IE task in this field is the extraction of relations between named entities (NEs), such as protein-protein and gene-disease interactions.
An important preliminary task in SRL is to define the set of possible semantic roles for each verb sense, referred to as a roleset. A roleset can be paired with a set of syntactic frames that shows all the acceptable syntactic expressions of those roles. This is called a frameset . In 2000, the Proposition Bank project (PropBank)  published a guide, PropBank I [4, 5], which defined a format for PAS annotation. Alongside PropBank I, the project also released a corpus of PAS's for 3,325 verbs in the newswire domain to facilitate ML-based SRL system development . The semantic arguments of individual verbs in the PropBank I annotation are numbered from 0. For a specific verb, Arg0 is usually the argument corresponding to the agent , while Arg1 usually corresponds to the patient. However, higher-numbered arguments, which occupy about 10% of the total arguments, have no consistent role definitions. In addition to numbered arguments, there are also ArgMs, which refer to annotation of modifiers. (Detailed descriptions of all semantic role argument categories can be found in Additional file 1.) The semi-regular and flexible assignment of numbered arguments to semantic roles found in PropBank I facilitates formulation of the SRL task as a classification problem with machine-learning (ML) based systems. That is, given a phrase, the sentence containing it, and the predicate, a system must classify the phrase's semantic role corresponding to the predicate.
Frameset of verb "delete" in PropBank I and PASBio
//mutation, alternative splicing//
thing being removed
entity being removed
//exon, gene, chromosomal region, cell//
As you can see in Table 1, the agent is defined as "entity removing", and the patient is defined as "thing being removed" in PropBank I. However, in certain biomedical events, a developer might want to limit the agent to being a certain causal mechanism such as a mutation or alternative splicing and the patient to being an "exon, gene, chromosomal region, [or] cell".
An alternative to PropBank, the PASBio  project provides more detailed and restrictive framesets for 29 biomedical verbs. The well-known biomedical text mining researchers Cohen and Hunter  have found the PASBio annotation viable for representing the PAS's of biomedical verbs. Several applications have been developed based on PASBio or following its spirit. For example, Shah et al.  used the frameset definitions of PASBio to construct semantic patterns which can extract information about tissue-specific gene expression from biomedical literature. Later, Shah and Bork applied this approach to construct the LSAT (Literature Support for Alternative Transcripts) database system . Kogan et al.  followed the PASBio annotation to built a domain-specific set of PASs for the medical domain, which successfully extended PASBio to clinical texts. All these systems mainly use handcrafted rules to identify and classify arguments into semantic roles.
Unfortunately, due to the lack of an annotated corpus and inconsistent definitions between specific numbered arguments, no publicly available ML-based SRL systems based on the PASBio standard have been developed.
To be able to apply ML to the biomedical SRL problem, we constructed a biomedical domain specific proposition bank based on the more consistent PropBank I annotation format. The project, BioProp , defined roles for 30 common biomedical verbs and provided an annotated corpus on which we developed an ML-based SRL system, BIOSMILE . This work was expanded upon with the release of our web-based search application, BIOSMILE web search , in February 2008.
In this paper, we aim to build a bridge between BioProp and PASBio to facilitate PASBio-based SRL system development. Using our system, one will first be able to roughly classify arguments' semantic roles according to BioProp, and then translate the PAS's into PASBio annotation using a rule-based converter.
The approaches applied in this work include: (1) named entity tagging, (2) semantic role labeling following BioProp's annotation format, and (3) rule-based conversion from BioProp to PASBio annotation.
According to our observations, some BioProp arguments are equivalent to other PASBio arguments only under certain conditions, usually defined as the presence of a certain named entity (NE) in a certain argument. For example, Arg1 of the verb "express" must be a gene or gene product in PASBio. Therefore, it is necessary to first tag all NEs in the sentences. To do this, we employ our previously developed NE recognition software, NERBio [16, 17], to tag five NE types: protein, DNA, RNA, cell line, and cell type. We use a dictionary to find other NE types, such as extron and intron.
Before conversion to the PASBio annotation format, a fundamental step is to identify the PAS's of each sentence and annotate them using the BioProp format. Here, we briefly introduce how we constructed the BioProp-based SRL system, BIOSMILE, used for this task.
The first step was to construct a training corpus. In our previous work, Chou et al. , we annotated PAS's in GENIA's corpus of full parse trees, the GENIA Treebank (GTB) , using PropBank I framesets. We then defined and added framesets for biomedical verbs to fit specific usages in biomedical literature. However, all the new and modified framesets still conform strictly to the PropBank annotation format. A total of 2,304 PAS's were annotated for 49 biomedical verbs.
The second step we took was to formulate the SRL problem as an ML-based sentence tagging problem. The basic units of a sentence can be words, phrases, and constituents (nodes on a full parse tree). Punyakanok et al.  has shown that constituent-by-constituent (C-by-C, or node-by-node) tagging is the best formulation for the SRL problem; therefore, we adopted this formulation.
Finally, we constructed a biomedical full parser based on the Charniak parser  with GTB as its training data which could automatically generate parse trees for sentences. Its performance is reported in Additional file 1.
Using BioProp as the training corpus, C-by-C formulation, and the parse trees generated by our biomedical full parser, we then constructed our SRL system, BIOSMILE, following the maximum entropy ML model . Details of the features used in our SRL system can be found in .
There are two main differences between BioProp and PASBio PAS framesets annotations: (1) PASBio developers usually define framesets to represent specific biological events. Therefore, for each argument, it is necessary to include information in addition to its semantic role, such as whether the argument should be a specific NE or contain specific keywords. (2) The order of arguments for a given verb sense in a BioProp frameset may not match that in a corresponding PASBio frameset. To deal with these two differences, we build conversion rules verb by verb using our semi-automatic rule-generation tool which describe under which conditions each mapping is valid. The algorithm used by the rule-generator compares corresponding framesets for a given verb sense, checks each argument in its PASBio frameset, and tries to find an argument in its BioProp frameset that has the same semantic role under a set of conditions. When a match is found, the algorithm maps a link between the two frameset arguments, which includes a description of required conditions (NEs and keywords).
Each conversion rule consists of two elements: predicates and transformations. The predicate is the target verb. The first part of each transformation is the condition, which specifies the criteria that the arguments should follow. These criteria are defined as the composition of one or more logical predicates, which are concatenated by logical operators, such as AND, and OR. Two most common predicates are ContainsNE(ne) and ContainsKeywords(kw). The former is true if the argument contains at least one instance of the NE type ne. The latter is true if the argument contains at least one specified keyword kw. If there are no conditions for a transformation, this part can be omitted.
The second part is the mapping between a BioProp argument and a PASBio argument. The mapping consists of three elements: the source argument, an arrow "→", and the destination argument. For example, the transformation in Figure 3 defines a mapping from ArgM-LOC to Arg3. All the arguments that are not defined in the transformation source field are dropped.
As shown in Figure 3, the condition of the transformation "ARG1 → ARG1" is ContainsNE("protein"), which is interpreted as the mapping ARG1 → ARG1 holds if ARG1 contains at least one protein. For a case in which arguments match, such as that in Figure 3, the conversion rules can be automatically generated as follows:
1. For each argument pair, (argument B , argument P ), if the argument phrase does not contain any recognized NEs, a simple rule will be generated in the argument's "Rule Candidates" field: argument B → argument P
2. If the argument contains recognized NE types (NE type ), they will become the conditions imposed on the argument, and the following rule type will be generated: ContainsNE (NE type )?argument B → argument P
Users can modify the generated rules by editing the "Rule Candidates" field.
A simplified bracket form for the parse tree shown in Figure 4, with some internal bracket divisions omitted for clarity: (NP (NP (Two equally abundant mRNAs for il8ra)) (,) (NP (2.0 and 2.4 kilobases in length))).
Each constituent and its daughters are enclosed with brackets. If we replace constituent words in the phrase with a wildcard symbol "(.*)", the above bracket form becomes:
(NP (NP (.*)) (.*) (NP (.*)))
We can then use the bracket form as a pattern to match parse trees with the same structures.
To make these patterns more precise, we can add restrictions on the phrase constituents, such as limiting their semantic roles, head words and head words' UPENN POS . To restrict a constituent's semantic role, one would insert a hyphen followed by the semantic role after the constituent type. For example, (NP) might become (NP-Arg1). The head word can be defined as the most important word in a constituent , and we identify it using Collins'  rule-based method. Head words of constituents are marked with an ampersand followed by the head word – e.g. (NP@kilobase). And the UPENN POS of the head word is placed directly after, separated by a forward slash – e.g. (NP@kilobase/NNS). If we combine our above examples, we can make the pattern, "(NP-Arg1@mRNA/NNS (NP@mRNA/NNS (.*)) (NP@kilobase/NNS (.*)))", where the outside NP must be Arg1, and the inside NPs' head word must be "mRNA" and "kilobase" with POS's "NNS."
In our notation, a rule will appear as follows:
BracketFormPattern(x) ? C 0 → argument 0, C 1 → argument 1,..., C i → argument i ,..., C k → argument k ;
"BracketFormPattern" is a logical predicate which means the source argument, argument s , must match the bracket form pattern x for the transformations "C i → argument i " to occur, where C i is any constituent of a source argument annotated by PASBio.
In the example in Figure 4 for the verb "express", "ARG1" in the BioProp column does not directly match any one PASBio argument, but instead overlaps two arguments, Arg1 and Arg2. The rule-generation algorithm first generates two bracket forms for the unmatched noun phrase "Two equally abundant mRNAs for il8ra 2.0 and 2.4 kilobases in length", one for the "BioProp" column and the other for the "PASBio" column:
(NP-Arg1@mRNA/NNS (NP@mRNA/NNS (.*)) (.*) (NP@kilobase/NNS (.*))")
(NP@mRNA/NNS (NP-Arg1@mRNA/NNS (.*)) (.*) (NP-Arg2@kilobase/NNS (.*))")
Then, the first bracket form is merged with the second one as follows:
(NP-Arg1@mRNA/NNS (NP-C 0@mRNA/NNS (.*)) (.*) (NP-C 1@kilobase/NNS (.*))")
As you can see in the merged bracket form, all the PASBio constituents annotated with semantic roles are represented by the variable C i . For example Arg1 becomes C 0.
Finally, the following three rules are automatically generated in the "Rule Candidates" field:
BracketFormPattern("(NP-Arg1 (NP-C 0 (.*)) (.*) (NP-C 1 (.*))") ? C 0 → Arg1, C 1 → Arg2
BracketFormPattern("(NP-Arg1@mRNA/(NP-C 0@mRNA/(.*)) (.*) (NP-C 1@kilobase/(.*))") ? C 0 → Arg1, C 1 → Arg2
BracketFormPattern("(NP-Arg1@/NNS (NP-C 0@/NNS (.*)) (.*) (NP-C 1@/NNS (.*))") ? C 0 → Arg1, C 1 → Arg2
The first rule is the loosest, only considering the parse tree structure and SRL tags. The second also considers the head word, and the third adds POS information as well. The user can check these rule candidates, and remove or modify the inappropriate ones.
The frameset of the verb "express" in BioProp and PASBio
causer of expression
named entity being expressed
//gene or gene products//
property of the existing named entity [Arg1]
location referring to organelle, cell or tissue
The training data of our SRL system, BIOSMILE, is an extended version of BioProp . A total of 2,304 PAS's were annotated for 49 biomedical verbs. To evaluate BIOSMILE, the rule-based converter and the combined system, our in-lab biologists re-annotated the 313 annotated sentences available on PASBio's website according to the BioProp annotation format. The dataset from PASBio's website is hereafter referred to as PASBioP and the PASBioP dataset annotated using the BioProp format is referred to as PASBioB.
For SRL and conversion evaluation, the official CoNLL-2004  SRL evaluation script was used.
We followed the same experimental procedure that we used in  to evaluate BIOSMILE performance on the extended BioProp dataset, details about which can be found in Additional file 1. The average results were an F-score of 72.67%, a precision of 81.72% and a recall of 65.42%.
To evaluate the actual performance on arbitrary sentences and verbs, we used PASBioB as an extra test data. BIOSMILE achieved an overall F-score of 67.31%, a precision of 76.28% and a recall of 60.22%. (More detailed performance data for each argument type can be found in Additional file 1.) The drop in BIOSMILE's performance on PASBioB may be caused by the following factor: Even though BioProp contains all PASBio verbs, it contains very few PAS's for some verbs, which likely decreases the accuracy of ML-based SRL on those verbs. For example, there is only one PAS for "splice" and two for "begin".
We conducted two experiments – the first to test the BioProp-PASBio converter independently of BIOSMILE SRL performance, and the second to evaluate combined system performance. For both, 3-fold cross validation (CV) was applied, which involved partitioning the PASBiop dataset into three subsets. A single subset is retained as the test data, and the remaining two subsets are used as training data for generating conversion rules. The CV process is then repeated three times, with each of the test sets being used exactly once.
Rule-based converter performance (on PASBiop)
Combined system performance
After examining the PAS's which were not labeled correctly in the experiments, we have concluded that the following two factors affected conversion performance most strongly:
In cases where one BioProp argument can be divided into two or more PASBio arguments, our rules may be insufficient to disambiguate if NEs or keywords are absent. Consider the following example annotated by our system with BioProp/PASBio annotations both given concatenated by a forward slash:
... [protein extracts from the transfected COS cells Arg0/Arg0 ] [inhibited V ] [both the C alpha and C beta isoforms of the PKA catalytic subunit with equal efficacy Arg1/Arg2 ].
The frameset of the verb "inhibit" in PASBio and BioProp
the entity being inhibited by agent to get binding
the action or property being inhibited
We can see that PASBio defines both Arg1 and Arg2 as the objects being inhibited, but Arg1 is further constrained to being the entity bound by the agent. BioProp, which has no Arg2 definition, does not make this distinction. The automatically generated conversion rule for Arg1, therefore, will have the constraint ContainsKeywords("binding"). However, as the above example lacks any references to binding that would describe which entity "gets binding", the system converts to Arg2 instead of Arg1. In this case, simple NE-/keyword-based rules cannot distinguish Arg1 from Arg2.
According to our analysis, 3.83% of the PAS's in the PASBioP suffered from this problem, especially PAS's for verbs such as decrease, delete, inhibit, lost, mutate, transcribe and truncate.
Coordination ambiguity in the full parse information is another factor that affects conversion performance.
Figure 5 shows two possible full parse structures for the following sentence:
The phrase "inhibit NK-cell-medidated cytotoxicity" can be coordinated with three different phrases, each with a different meaning. This syntactic ambiguity is referred to as "coordination ambiguity"  and is a major problem in parsing. As you can see in Figure 5(a), our full parser coordinates the verb phrase "express cell-surface receptors of the ... class I peptides" with the verb phrase "inhibit NK-cell-mediated cytotoxicity." Therefore, BIOSMILE tags the noun phrase "NK cells" as "Arg0" for the verb "inhibit." However, in the gold standard annotation, the PASBio developers annotate the "cell-surface receptors of ... superfamilies" as "Arg0" for the verb "inhibit". The parse tree for the PASBio's annotation is illustrated in Figure 5(b). It coordinates the verb phrase "recognize MHC class I peptides" with the verb phrase "inhibit NK-cell-mediated cytotoxicity." Although, both these parse trees were generated by our parser initially, in the end, it chose the incorrect one, Figure 5(a), because, based on the training data, that one appeared to have the highest probability. In such cases it is impossible to distinguish the correct choice using syntactic parsing. Our results show that 1.92% PAS's in the PASBioP dataset suffered this problem.
In this paper we have demonstrated the feasibility of converting between BioProp and PASBio annotation, which will hopefully facilitate and inspire further PASBio applications. Our approach has involved the use of our previous SRL system, BIOSMILE, as well as the development two new tools, a semi-automatic rule generator and a BioProp-PASBio converter. Our rule-generation tool can save considerable human effort by automatically generating conversion rules which only need fine tuning to be usable. Our BioProp-PASBio converter can achieve very high accuracy (85.29%) using the gold-standard BioProp dataset. Our combined system (BIOSMILE + rule-based converter) achieves an F-score of 69.08% for PASBio's 29 verbs. This performance is close to state-of-the-art ML-based SRL systems in other specific domains .
This research was supported in part by the National Science Council under grant NSC 97-2218-E-155-001, NSC96-2752-E-001-001-PAE and the thematic program of Academia Sinica under grant AS95ASIA02.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 12, 2008: Asia Pacific Bioinformatics Network (APBioNet) Seventh International Conference on Bioinformatics (InCoB2008). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S12.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.