- Research article
- Open Access
A resource-saving collective approach to biomedical semantic role labeling
- Richard Tzong-Han Tsai1Email author and
- Po-Ting Lai2, 3
https://doi.org/10.1186/1471-2105-15-160
© Tsai and Lai; licensee BioMed Central Ltd. 2014
- Received: 4 September 2013
- Accepted: 22 April 2014
- Published: 27 May 2014
Abstract
Background
Biomedical semantic role labeling (BioSRL) is a natural language processing technique that identifies the semantic roles of the words or phrases in sentences describing biological processes and expresses them as predicate-argument structures (PAS’s). Currently, a major problem of BioSRL is that most systems label every node in a full parse tree independently; however, some nodes always exhibit dependency. In general SRL, collective approaches based on the Markov logic network (MLN) model have been successful in dealing with this problem. However, in BioSRL such an approach has not been attempted because it would require more training data to recognize the more specialized and diverse terms found in biomedical literature, increasing training time and computational complexity.
Results
We first constructed a collective BioSRL system based on MLN. This system, called collective BIOSMILE (CBIOSMILE), is trained on the BioProp corpus. To reduce the resources used in BioSRL training, we employ a tree-pruning filter to remove unlikely nodes from the parse tree and four argument candidate identifiers to retain candidate nodes in the tree. Nodes not recognized by any candidate identifier are discarded. The pruned annotated parse trees are used to train a resource-saving MLN-based system, which is referred to as resource-saving collective BIOSMILE (RCBIOSMILE). Our experimental results show that our proposed CBIOSMILE system outperforms BIOSMILE, which is the top BioSRL system. Furthermore, our proposed RCBIOSMILE maintains the same level of accuracy as CBIOSMILE using 92% less memory and 57% less training time.
Conclusions
This greatly improved efficiency makes RCBIOSMILE potentially suitable for training on much larger BioSRL corpora over more biomedical domains. Compared to real-world biomedical corpora, BioProp is relatively small, containing only 445 MEDLINE abstracts and 30 event triggers. It is not large enough for practical applications, such as pathway construction. We consider it of primary importance to pursue SRL training on large corpora in the future.
Keywords
- Noun Phrase
- First Order Logic
- Parse Tree
- Semantic Role
- Argument Type
Background
Biomedical semantic role labeling (BioSRL)
An example parse tree annotated with semantic roles.
Given a sentence, the SRL task executes two steps: predicate identification and argument recognition. The first step can be achieved by using a part-of-speech (POS) tagger with some filtering rules. Then, the second step recognizes all arguments, including grouping words into arguments and classifying the arguments into semantic role categories. Some studies refer to these two sub-steps as argument identification and argument classification, respectively [2, 3]. In the second step, it is often difficult to determine the word boundaries and semantic roles of an argument as they depend on many factors, such as the argument's position, the predicate's voice (active or passive) and the sense (usage). The second step can be formulated as a sentence tagging problem. A sentence can be represented by a sequence of words, a sequence of phrases, or a parse tree; the basic units of a sentence are words, phrases, and constituents (a node on a full parse tree) arranged in the above representations, respectively. Hacioglu et al. [4] showed that tagging phrase-by-phrase (P-by-P) is better than word-by-word (W-by-W). However, Punyakanok et al. [3] showed that constituent-by-constituent (C-by-C, or node-by-node) tagging is better than P-by-P. Based on Punyakanok et al.’s findings, BIOSMILE [5] also adopted C-by-C tagging for BioSRL and achieved an accuracy close to that of top general SRL systems.
C-by-C approaches can be called “discrete approaches” because they do not consider dependencies among constituents/nodes. For example, a parent node and any of its children nodes cannot both be labeled with semantic roles simultaneously. In SRL, collective approaches which label several or all nodes simultaneously have been proposed and outperform discrete approaches [6]. The Markov logic network (MLN) model is a good representative collective approach. It offers the flexibility to model dependencies with first-order-logic formulae.
In this paper, we explore the collective approach to BioSRL by building an MLN-based system. Despite the convenience of modeling dependencies and the high accuracy of MLN, we have observed that it requires more memory and longer training times on a large corpus. This is an obstacle to applying MLN to BioSRL, which requires a large amount of training data to cover the wide variety of specialized biomedical subdomains. To reduce the resources used in BioSRL training, we employ a tree-pruning filter to remove unlikely nodes from the parse tree and four argument candidate identifiers to retain candidate nodes in the tree. Nodes not recognized by any candidate identifier are discarded. The pruned annotated parse trees are used to train a resource-saving MLN-based system, which is referred to as resource-saving collective BIOSMILE (RCBIOSMILE).
Methods
An example parse tree annotated with semantic roles.
Resource-saving preprocessing
To reduce the resources used in BioSRL training, we employ a tree-pruning filter to remove unlikely nodes from the parse tree and four argument candidate identifiers to retain candidate nodes in the tree. Nodes not recognized by any candidate identifier are discarded. In the following subsections, we describe these components.
Tree pruning filter (TPF)
Examples of tree pruning. a. Pruning nodes using pruning rule 1. b. Pruning nodes using rules 2, 3, and 4.
Association rule candidate identifier (ARCI)
Rules containing several predicates are effective for identifying argument candidates. For example, if a node i starts with “in” and ends with “cell”, then it is very likely to be a location argument (ARGM-LOC). This rule can be translated into the following first-order logic formula:
Rule1: firstword (i, " in ") ∧ lastword (i, " cell ") ⇒ role (p, i, " ARGM ‒ LOC ")
We can see that this rule is composed of three predicates: firstword (i, "in"), lastword (i, "cell"), and role (p, i, "ARGM - LOC").
However, compiling the set of rules is labor-intensive. To automatically generate a rule, the main task is to decide which predicates must be included. We select several basic predicate types listed as follows.
The pool of predicates
Node type
First word and last word stem
POS of the first word and last word
POS of the verb predicate
Voice of the verb predicate
Before or after the verb predicate
Semantic role
The verb predicate
Syntactic path from the verb predicate
These features are selected from basic features used in the BIOSMILE [5] system. We formulate the rule-generation problem as a problem of association rule mining [7].
Examples of association rule mining. a. T3 efficiently induced erythroid differentiation in these cells, thus overcoming the v-erbA-mediated differentiation arrest. b. In contract mRNA representing pAT 591/EGR2 was not expressed in these cells.
The features used in BIOSMILE
Basic features | • Verb predicate – The verb predicate lemma |
• Path – The syntactic path through the parse tree from the constituent being classified to the verb predicate | |
• Constituent type (CT) | |
• Position – Whether the phrase is located before or after the verb predicate | |
• Voice – passive if the verb predicate has a POS tag VBN, and its chunk is not a VP, or it is preceded by a form of "to be" or "to get" within its chunk; otherwise, it is active | |
• Head word – Calculated using the head word table described by Collins (1999) | |
• Head POS – The POS of the Head Word | |
• Sub-categorization – The phrase structure rule that expands the predicate's parent node in the parse tree | |
• First and last Word (FW and LW) and their POS tags | |
• Level – The level in the parse tree | |
Verb predicate features | • Verb predicate's verb class |
• Verb predicate POS tag | |
• Verb predicate frequency | |
• Verb predicate's context POS | |
• Number of verb predicates | |
Full parsing features | • Parent, left sibling, and right sibling paths, constituent types, positions, head words, and head POS tags |
• Head of Prepositional Phrase (PP) parent – If the parent is a PP, then the head of this PP is also used as a feature | |
Combination features | • Verb predicate distance combination |
• Verb predicate phrase type combination | |
• Head word and verb predicate combination | |
• Voice position combination | |
Others | • Syntactic frame of verb predicate/NP |
• Headword suffixes of lengths 2, 3, and 4 | |
• Number of words in the phrase | |
• Context words & POS tags |
can be generated.
The generated association rules are employed to identify argument candidates. Nodes not matching any rule are discarded.
The transactions extracted from the sentences in Figure 4 . The full name of all abbreviations can be found in Table 1
FW(T3), LW(T3), CT(NP), PATH(NP > S < VP < VBD), verb_predicate(induce), ROLE(ARG0)
FW(efficiently), LW(efficiently), CT(ADVP), PATH(ADVP > S < VP < VBD), verb_predicate(induce), ROLE(ARGM-MNR)
FW(erythroid), LW(differentiation), CT(NP), PATH(VBD > VP < NP), verb_predicate(induce), ROLE(ARG1)
FW(in), LW(cell), CT(PP), PATH(VBD > VP > S < PP), verb_predicate(induce), ROLE(ARGM-LOC)
FW(thus), LW(thus), CT(RB), PATH(VBD > VP < RB), verb_predicate(induce), ROLE(ARGM-DIS)
FW(overcoming), LW(arrest), CT(S), PATH(VBD > VP < S), verb_predicate(induce), ROLE(ARGM-ADV)
FW(in), LW(contrast), CT(PP), PATH(PP > S < VP < VP < VBN), verb_predicate(express), ROLE(ARGM-DIS)
FW(mRNA), LW(591/egr2), CT(NP), PATH(NP > S < VP < VBN), verb_predicate(express), ROLE(ARG1)
FW(not), LW(not), CT(RB), PATH(RB > VP < VBD), verb_predicate(express), ROLE(ARGM-NEG)
FW(in), LW(cell), CT(PP), PATH(VBD > VP < S < PP), verb_predicate(express), ROLE(ARGM-LOC)
Word-based candidate identifier (WCI)
Some types of arguments can be identified by checking if they exactly match specific words with other conditions. We compile lists of words corresponding to three types of arguments. These argument types are:
The word list for identifying ARGM-DIS, ARGM-MOD, ARGM-NEG
Type | Word list |
---|---|
ARGM-DIS | Additionally, also, altogether, as well as, but also, by contrast, conversely, even, finally, further, furthermore, hence, however, importantly, in addition, in conclusion, in contrast, in fact, in parallel, in part, in particular, in sum, in summary, in this regard, in turn, indeed, instead, interestingly, likewise, moreover, nevertheless, no longer, nonetheless, not only, on the contrary, on the other hand, probably, rather, still, surprisingly, then, thereby, therefore, thus, whereas |
ARGM-NEG | Not, n’t, never, no longer |
ARGM-MOD | Can, could, may, might, shall, should, would, will, going (to), have (to), use (to) |
Modal Argument (ARGM-MOD) and Negation Argument (ARGM-NEG): If a node’s span appears right before a verb predicate and can be found in the word list for either ARGM-MOD or ARGM-NEG, it is regarded as an ARGM-MOD or ARGM-NEG candidate, respectively.
Pattern-based candidate identifier (PCI)
Extent and temporal marker patterns
Pattern type | Regular expression |
---|---|
Extent | \b(\d + %|fold|extent|(greater|less) than \d+)\b |
Temporal | \b(year|month|week|day|hour|minute|min|second|sec|(\d + |one|several)[\-]?(wk|hr|h))s?\b |
Temporal Marker (ARGM-TMP): The temporal marker indicates when an action takes place. Like extent markers, temporal markers are usually siblings of the verb-predicate node and are identified in the same manner. The identifier finds ARGM-TMP candidates by checking whether the subtrees of verb-predicate node with root sib have any nodes whose spans match the temporal marker pattern. In addition, temporal markers sometimes appear at the beginning of a sentence. Therefore, the identifier also checks if any nodes whose spans start with the first word of a sentence match the temporal marker pattern (shown in Table 3). Such nodes are also considered ARGM-TMP candidates.
Parse-tree-based candidate identifier (PTCI)
If a node n is not identified as a candidate by the above components, this module will check if the path from the verb-predicate node to n is equal to any path from the verb-predicate node to an argument node m in the training set. If yes, n will be treated as a candidate of m’s argument type.
Markov logic
First-order logic
MLN combines first order logic (FOL) and Markov networks. In FOL, formulae consist of four types of symbols: constants, variables, functions, and predicates. Constant symbols represent objects in a specific domain (e.g., Annie, Bob, Cathy, etc.). Variable symbols range over the objects in the domain. Function symbols (e.g. MotherOf) represent mappings from tuples of objects to objects. Predicates represent relationships among objects (e.g. Friends), or attributes of objects (e.g. Smokes). Constants and variables may belong to specific types. An atom is a predicate symbol applied to a list of arguments, which may be constants or variables. A ground atom is an atom whose arguments are all constants. A world is an assignment of truth values to all possible ground atoms. A knowledge base (KB) is a partial specification of a world; each atom in it is true, false, or unknown.
Markov networks
A Markov network represents the joint distribution of a set of variables X = {X1, …, X n } ∈ X as a product of factors: , where each factor f k is a non-negative function of a subset of the variables x k , and Z is the normalization constant. The distribution is usually equivalently represented as a log-linear form: , where the features g i (x) are arbitrary functions of (a subset of) the variables’ states.
Markov logic networks
An MLN is a set of weighted first-order formulae. Together with a set of constants representing objects in the domain, it defines a Markov network with one variable per ground atom and one feature per ground formula. The probability distribution over possible worlds is given by where Z is the partition function, F is the set of all first-order formulae in the MLN, G i is the set of groundings of the i-th first-order formula, and g j (x) = 1 if the j-th ground formula is true, and g j (x) = 0 otherwise. Markov logic enables us to compactly represent complex models in non-i.i.d. domains. General algorithms for inference and learning in Markov logic are discussed in Richardson and Domingos [9]. We use the1-best MIRA online learning method [10] for learning weights and employ cutting plane inference [11] with integer linear programming as its base solver for inference at test time as well as during the MIRA online learning process. As aforementioned, to avoid the ambiguity between the predicates in FOL and SRL, we refer to SRL predicates in as “verb predicate”.
Formulae
Local formulae (L)
As shown in Table 1, local formulae are derived from the features used in the SRL systems [2, 12–14] based on the maximum entropy (ME) model and support vector machine (SVM) model. We used these features in BIOSMILE [5], and we have transformed them into formulae here to employ them in our MLN model.
where w is the headword of the node i. If the “+” symbol appears before a variable, it indicates that each different value of the variable has its own weight.
Collective formulae (C)
Collective classification is a methodology that simultaneously classifies related instances. It can improve classification accuracy over non-collective methods when instances are interrelated [15–17]. MLN performs well in many collective classification tasks such as entity linking [18–20], coreference resolution [21, 22] and biomedical event extraction [23]. In node-by-node SRL, related instances are nodes having linguistic dependencies.
There are two main types of linguistic dependencies in SRL: tree dependency and path dependency. They have been shown to be effective in improving the consistency of SRL results [3]. Nodes with tree dependencies and path dependencies can be treated as tree collectives and path collectives, respectively. In MLN-based SRL, collectives can be implemented with collective formulae that model dependencies among nodes.
Candidate identification formulae (CI)
In our resource-saving preprocessing step, candidate identifiers recognize the most likely semantic roles for each node. The information can be transformed into formulae to improve the accuracy of MLN inference. For a node i retained as a candidate node by our resource-saving preprocessing, an observed predicate candidate_role(p, i, r) is added to our MLN-based inference system.
Results
Dataset
The statistics of the BioProp corpus
Role | Number |
---|---|
Core argument types | 11 |
Adjunctive argument types | 21 |
Feature | Number |
Constituent types | 17 |
Unique words | 5258 |
Part-of-speech | 34 |
Other | Number |
Verb predicate types | 30 |
Abstracts with PAS’s | 445 |
Sentences with PAS’s | 1622 |
Propositions | 1962 |
Core arguments such as ARGX, R-ARGX and C-ARGX play the main semantic roles in a PAS. ARGX (ARG0–ARG5, ARGX) are the most necessary arguments of a given verb predicate. C-ARGX is used to represent multi-node arguments. A node labelled C-ARGX is assumed to be a continuation of the closest node to the left labelled ARGX. A node labelled R-ARGX is assumed to be a relative pronoun of the closest node to the left labelled ARGX. Adjunctive arguments (ARGM-X) play the semantic roles of location, manner, time, or extent in a PAS.
Evaluation metric
t-test
In order to evaluate our performance under an unbiased circumstance, we apply a two-sample paired t-test, which is defined as follows:
If the resulting t-score is equal to or less than 1.67 with a degree of freedom of 29 and a statistical significance level of 95%, the null hypothesis is accepted; otherwise it is rejected.
To retrieve the average F-scores and their deviations required for the t-test, we randomly sampled thirty training sets (g1, …,g30) and thirty test sets (d1, …, d30) from the 445 abstracts. Each training set and test set contains 365 and 89 abstracts, respectively. We trained the model on gi and tested it on di. Afterwards, we summed the scores for all thirty test sets and calculated the averages for performance comparison.
Configuration settings
Configuration settings and argument-wide SRL performance
Configuration | Feature/Model/Preprocessing | ARGX | ARGM | Overall ARG | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RP | L | MLN | C | CI | P | R | F | ΔF | P | R | F | ΔF | P | R | F | ΔF | |
BIOSMILE | ✓ | ✓ | ✓ | 91.59 | 85.48 | 88.43 | - | 81.36 | 67.50 | 73.79 | - | 88.72 | 80.00 | 84.14 | - | ||
CBIOSMILE | ✓ | ✓ | ✓ | 90.44 | 89.19 | 89.81 | +1.38* | 81.12 | 67.45 | 73.66 | −0.13 | 87.93 | 82.57 | 85.16 | +1.02* | ||
RCBIOSMILE: TPF | TPF | ✓ | ✓ | ✓ | 90.47 | 89.22 | 89.84 | +1.41* | 81.16 | 65.99 | 72.79 | −1 | 88.00 | 82.14 | 84.97 | +0.83 | |
RCBIOSMILE: ARCI | ARCI | ✓ | ✓ | ✓ | ARCI | 90.76 | 88.55 | 89.64 | +1.21* | 82.18 | 66.65 | 73.6 | −0.19 | 86.6 | 81.87 | 85.04 | +0.9* |
RCBIOSMILE: WCI | WCI | ✓ | ✓ | ✓ | WCI | 90.4 | 89.24 | 89.85 | +1.24* | 80.77 | 68.34 | 74.04 | +0.25 | 87.77 | 82.92 | 85.27 | +1.13* |
RCBIOSMILE: PCI | PCI | ✓ | ✓ | ✓ | PCI | 90.38 | 89.24 | 89.8 | +1.37* | 77.59 | 70.91 | 74.1 | +0.31 | 86.68 | 83.65 | 85.14 | +1* |
RCBIOSMILE: PTCI | PTCI | ✓ | ✓ | ✓ | PTCI | 90.4 | 89.08 | 89.74 | +1.31* | 77.29 | 70.98 | 74 | +0.21 | 86.6 | 83.56 | 85.05 | +0.91* |
RCBIOSMILE | All | ✓ | ✓ | ✓ | All | 90.86 | 89.34 | 90.10 | +1.67* | 78.14 | 70.14 | 73.93 | +0.14 | 87.23 | 83.49 | 85.32 | +1.18* |
Extraction performance
PAS-wide SRL performance
Configuration | ARGX | Overall ARG | ||||||
---|---|---|---|---|---|---|---|---|
P | R | F | ΔF | P | R | F | ΔF | |
1. BIOSMILE | 73.35 | 71.95 | 72.64 | - | 53.84 | 53.58 | 53.71 | - |
2. CBIOSMILE | 79.43 | 79.33 | 79.38 | +6.74* | 58.31 | 58.31 | 58.31 | +4.60* |
3. RCBIOSMILE | 80.01 | 79.57 | 79.79 | +7.15* | 59.53 | 59.46 | 59.49 | +4.78* |
In PAS-wide evaluation (Table 6), CBIOSMILE shows an improvement of 4.60% (F-score) over BIOSMILE, and an even larger advantage can be observed on core arguments (6.74%). RCBIOSMILE enlarges the improvement to 4.78% (F-score) over BIOSMILE, and an even larger advantage can be observed on core arguments (7.15%).
Resource savings
We first examine the effects of tree pruning filters. On average, rules 1,2,3, and 4 removes 10%, 36%, 5%, and 13% of all nodes, respectively. Applying all four tree pruning rules can filter 43% of all nodes.
The cost of time and memory
Configuration | Training time (per iter.) | Test time (per inst.) | Maximum used memory |
---|---|---|---|
1. BIOSMILE | 20 s | 60 ms | 1092 MB |
2. CBIOSMILE | 137 s | 71 ms | 1092 MB |
3. RCBIOSMILE | 25.4 s | 43 ms | 102 MB |
Discussion
Advantages of (R) CBIOSMILE over BIOSMILE
CBIOSMILE and RCBIOSMILE excel at correcting two error types, (1) duplicate arguments and (2) overlapping arguments. An example of a duplicate argument would be:
Partial amino acid sequences obtained from purified EBF were used to isolate [cDNA clonesARG0], [whichR-ARG0] [by multiple criteriaARGM-MNR] [encodeVerb Predicate] [EBFARG1]
Here BIOSMILE, working node by node, labels both “cDNA clones” and “multiple criteria” as ARG0—the former because it appears before “which”, and the latter because is the nearest noun phrase to the verb predicate “encode”. (R) CBIOSMILE avoids this error because the tree collective formulae limit the maximum number of any core argument type to one. Therefore, it labels only the node with the highest likelihood of being ARG0.
Similarly, (R) CBIOSMILE can avoid overlapping errors such as:
Isolation of [a rel-related human cDNAARG0] [thatR-ARG0] [potentiallyARGM-MNR] [encodesVerb Predicate] [the 65-kD subunit of NF-kappa B] (published erratum appears in Science 1991 Oct 4; 254) 5028 (: 11)
In this example, BIOSMILE incorrectly labels the two overlapping nodes “Isolation of a rel-related human cDNA” and “a rel-related human cDNA” as ARG0. (R) CBIOSMILE does not make such errors because path collective formulae assert that only one node in a path of a parse tree can be an argument.
Advantages of RCBIOSMILE over CBIOSMILE
Argument-wide SRL performance for each argument type
Type | Description | BIOSMILE | CBIOSMILE | RCBIOSMILE |
---|---|---|---|---|
ARG0 | Main arguments whose definitions depend on the corresponding predicatecp | 90.20% | 91.85% | 92.24% |
ARG1 | 88.43% | 89.89% | 90.07% | |
ARG2 | 82.85% | 83.00% | 82.90% | |
ARGM-ADV | Adverbials, these are used for syntactic elements which clearly modify the event structure of the verb in question, but which do not fall under any of the argument types above | 52.41% | 51.24% | 53.48% |
ARGM-DIS | Discourse markers, these are markers which connect a sentence to a preceding sentence | 78.51% | 75.73% | 73.68% |
ARGM-LOC | Locative modifiers indicate where some action takes place | 72.15% | 71.99% | 71.44% |
ARGM-MNR | Manner adverbs specify how an action is performed | 78.09% | 79.18% | 79.54% |
ARGM-MOD | Modals are: will, may, can, must, shall, might, should, could, would. Phrasal modals such as "going (to)", "have (to)" and "used (to)" are also included | 96.21% | 96.63% | 97.28% |
ARGM-NEG | Negation, this tag is used for elements such as "not", "n't", "never", "no longer" and other markers | 96.65% | 95.98% | 96.52% |
ARGM-TMP | Temporal markers show when an action took place | 56.52% | 56.30% | 57.99% |
Overall | 84.14% | 85.16% | 85.32% |
[Although lymphokine genes are coordinately regulated upon antigen stimulationARGM-ADV], [theyARG1] are [regulatedVerb Predicate] [by the mechanisms common to all as well as those which are unique to each geneARG0] .
The parse-tree-based candidate identifier employed by RCBIOSMILE recognizes the node whose span is “Although lymphokine genes are coordinately regulated upon antigen stimulation” as a candidate of ARGM-ADV or ARGM-TMP, and RCBIOSMILE adds the predicates “candidate_role(p, i, ARGM-ADV)” and “candidate_role(p, i, ARGM-TMP)” into the inference system. Therefore, this node can be successfully labeled as ARGM-ADV. Because CBIOSMILE does not use the candidate identifier, it has difficulty labeling such long-span nodes with the correct argument types.
Disadvantages of (R) CBIOSMILE
RCBIOSMILE and CBIOSMILE, which use collective learning, outperform BIOSMILE in most argument types. Surprisingly, they perform worse than BIOSMILE in ARGM-DIS and ARGM-ADV. We believe that this may be because ARGM-DIS can be easily recognized by position information (ARGM-DIS usually appears at the beginning of a sentence), and duplication and overlapping errors (which collective approaches excel at handling) seldom occur in ARGM-ADV.
Pruning errors of RCBIOSMILE
We observed that RCBIOSMILE cannot recognize some arguments that can be recognized by BIOSMILE and CBIOSMILE. After analysis, we found that such errors are caused by RCBIOSMILE’s pattern-based pruning step, as in the following sentence:
However, the profound T cell deficit of nude mice indicates that [the thymusARG0] is by far the most potent site for [inducingVerb Predicate] [the expansionARG1] [per seARGM-MNR], even if other sites can induce some response acquisition.
In this example, since the syntactic path from the node whose span corresponds to “the thymus” does not link to an ARG0 node and a verb-predicate node, this node is pruned. Therefore, RCBIOSMILE predicts a nearer noun phrase “the most potent site” as ARG0.
Related work
Biomedical semantic role labeling corpus
PASBio [26] is the first PAS standard used in the biomedical field, but it does not provide a SRL corpus. GREC [27] is an information extraction corpus focusing on gene regulation events. However, GREC does not support the Treebank format SRL annotations [28]. BioProp is the only corpus that provides SRL annotations and annotates semantic role labels on syntactic trees. It is created by [24]. BioProp selects 30 most frequent or significant verbs found in biomedical literatures, and defines the standard of the biomedical PAS. Furthermore, in accordance with the style of PropBank [29], which annotates PAS on Penn Treebank (PTB) [28], BioProp annotates their PAS on the GENIA TreeBank(GTB) beta version [30]. GTB includes a collection of 500 MEDLINE abstracts selected from the search results with the following keywords: human, blood cells, and transcription factors and contains a TreeBank that follows the style of Penn Treebank.
Biomedical semantic role labeling system
Most semantic role labeling systems follow the pipeline method, which includes predicate identification, argument identification and argument classification. However, in recent years, instead of using the pipeline method, several researches have shown that using the collective learning method can outperform the traditional pipeline method. Riedel et al. [11] uses Markov Logic to collectively learn these stages on SRL. However, to the best of our knowledge, there seems to be no existing SRL system using MLN in the biomedical field. Dahlmeier et al. [31] uses the domain adaption approaches to improve SRL in biomedical field. Bethard et al. [32] considers SRL as a token-by-token labeling problem and focuses on the SRL in transport proteins. BIOSMILE is a biomedical SRL system that focuses on 30 frequently appearing or important verbs in biomedical literatures and trained on the BioProp, and it is based on the Maximum Entropy (ME) Model.
Conclusions
Currently, a major problem of BioSRL is that most systems label every node in a full parse tree independently; however, some nodes always exhibit dependency. In general SRL, collective approaches based on the Markov logic network (MLN) model have been successful in dealing with this problem. In this paper, we explore the collective approach to BioSRL by building an MLN-based system.
Despite the convenience of modeling dependencies and the high accuracy of MLN, we have observed that it requires more memory and longer training times on a large corpus. This is an obstacle to applying MLN to BioSRL, which requires a large amount of training data to cover the wide variety of specialized biomedical subdomains. To reduce resource usage, we designed a pattern-based method to prune parse-tree nodes that may not have semantic roles. This method is applied to the parse trees in BioProp. To minimize the efforts of domain experts in manual pattern compilation, we developed an automatic pattern generation approach. The pruned annotated parse trees are used to train a resource-saving MLN-based system, which is referred to as resource-saving collective BIOSMILE (RCBIOSMILE).
Our experimental results show that our proposed CBIOSMILE system outperforms BIOSMILE, which is the top BioSRL system. Furthermore, our proposed RCBIOSMILE maintains the same level of accuracy as CBIOSMILE using 92% less memory and 57% less training time. This greatly improved efficiency makes RCBIOSMILE potentially suitable for training on much larger BioSRL corpora over more biomedical domains. Compared to real-world biomedical corpora, BioProp is relatively small, containing only 445 MEDLINE abstracts and 30 event triggers. It is not large enough for practical applications, such as pathway construction. We consider it of primary importance to pursue SRL training on large corpora in the future.
Declarations
Acknowledgments
This research was supported in part by the National Science Council under grant NSC 98-2221-E-155-060-MY3. We especially thank the BMC editors and reviewers for their valuable comments, which helped us improve the quality of the paper.
Authors’ Affiliations
References
- Cohen KB, Hunter L: A critical review of PASBio's argument structures for biomedical verbs. BMC Bioinforma. 2006, 7 (S-3): S5-View ArticleGoogle Scholar
- Xue N: Proceedings of EMNLP 2004. Calibrating features for semantic role labeling. 2004, 88-94.Google Scholar
- Punyakanok V, Roth D, Yih W-t, Zimak D: Proceedings of COLING-04. Semantic role labeling via integer linear programming inference. 2004, 1346-1352.Google Scholar
- Hacioglu K, Pradhan S, Ward W, Martin J, Jurafsky D: CoNLL 2004 Shared Task. Semantic Role Labeling by Tagging Syntactic Chunks. 2004Google Scholar
- Tsai R, Chou W-C, Su Y-S, Lin Y-C, Sung C-L, Dai H-J, Yeh I, Ku W, Sung T-Y, Hsu W-L: BIOSMILE: a semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features. BMC Bioinforma. 2007, 8 (1): 325-10.1186/1471-2105-8-325.View ArticleGoogle Scholar
- Riedel S: Proceedings of UAI 2008. Improving the accuracy and efficiency of map inference for markov logic. 2008, Helsinki: AUAI PressGoogle Scholar
- Agrawal R, Imielinski T, Swami A: Mining association rules between sets of items in large databases. SIGMOD Rec. 1993, 22 (2): 207-216. 10.1145/170036.170072.View ArticleGoogle Scholar
- Agrawal R, Srikant R: Proceedings of the 20th International Conference on Very Large Data Bases. Fast Algorithms for Mining Association Rules in Large Databases. 1994, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 487-499. 672836Google Scholar
- Richardson M, Domingos P: Markov logic networks. Mach Learn. 2006, 62 (1): 107-136.View ArticleGoogle Scholar
- Crammer K, Singer Y: Ultraconservative online algorithms for multiclass problems. J Mach Learn Res. 2003, 3: 951-991.Google Scholar
- Riedel S, Meza-Ruiz I: Proceedings of the Twelfth Conference on Computational Natural Language Learning; Manchester, United Kingdom. Collective semantic role labelling with Markov logic. 2008, Stroudsburg, PA, USA: Association for Computational Linguistics, 193-197. 1596357Google Scholar
- Gildea D, Jurafsky D: Automatic labeling of semantic roles. Comput Linguist. 2002, 28 (3): 245-288. 10.1162/089120102760275983.View ArticleGoogle Scholar
- Pradhan S, Hacioglu K, Krugler V, Ward W, Martin JH, Jurafsky D: Support vector learning for semantic argument classification. Mach Learn. 2005, 60 (1–3): 11-39.View ArticleGoogle Scholar
- Surdeanu M, Harabagiu S, Williams J, Aarseth P: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1; Sapporo, Japan. Using predicate-argument structures for information extraction. 2003, Stroudsburg, PA, USA: Association for Computational Linguistics, 8-15. 1075098Google Scholar
- Neville J, Jensen D: Proc AAAI-2000 Workshop on Learning Statistical Models from Relational Data. Iterative classification in relational data. 2000, 13-20.Google Scholar
- Taskar B, Abbeel P, Koller D: Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence; Alberta, Canada. Discriminative probabilistic models for relational data. 2002, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 485-492. 2073934Google Scholar
- Getoor L: Advanced Methods for Knowledge Discovery from Complex Data. Link-based Classification. 2005, London: Springer, 189-207.Google Scholar
- Dai H-J, Tsai R, Hsu W-L: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP); Chiang Mai, Thailand. Entity Disambiguation Using a Markov-Logic Network. 2011, 846-855.Google Scholar
- Dai H-J, Wu JC-Y, Tsai RT-H: Collective instance-level gene normalization on the IGN corpus. PLoS One. 2013, 8 (11): e79517-10.1371/journal.pone.0079517.View ArticlePubMed CentralPubMedGoogle Scholar
- Dai H-J, Tsai RT-H, Hsu WL: Joint learning of entity linking constraints using a Markov-logic network. IJCLCLP. 2014, 19 (1): 11-32.Google Scholar
- Dai H-J, Chen C-Y, Wu C-Y, Lai P-T, Tsai RT-H, Hsu W-L: Coreference resolution of medical concepts in discharge summaries by exploiting contextual information. J Am Med Inform Assoc. 2012, 19 (5): 888-896. 10.1136/amiajnl-2012-000808.View ArticlePubMed CentralPubMedGoogle Scholar
- Dai H-J, Chang YC, Tsai RT-H, Hsu WL: Integration of gene normalization stages and co-reference resolution using a Markov logic network. Bioinformatics. 2011, 27 (18): 2586-2594.PubMedGoogle Scholar
- Poon H, Vanderwende L: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics; Los Angeles, California. Joint inference for knowledge extraction from biomedical literature. 2010, Stroudsburg, PA, USA: Association for Computational Linguistics, 813-821. 1858122Google Scholar
- Chou W-C, Tsai RT-H, Su Y-S, Ku W, Sung T-Y, Hsu W-L: Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006; Sydney, Australia. A semi-automatic method for annotating a biomedical proposition bank. 2006, Stroudsburg, PA, USA: Association for Computational Linguistics, 5-12. 1641993Google Scholar
- Carreras X, Marquez L: Proceedings of the Ninth Conference on Computational Natural Language Learning (CONLL '05). Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. 2005View ArticleGoogle Scholar
- Wattarujeekrit T, Shah P, Collier N: PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinforma. 2004, 5 (1): 155-10.1186/1471-2105-5-155.View ArticleGoogle Scholar
- Thompson P, Iqbal S, McNaught J, Ananiadou S: Construction of an annotated corpus to support biomedical information extraction. BMC Bioinforma. 2009, 10 (1): 349-10.1186/1471-2105-10-349.View ArticleGoogle Scholar
- Bies A, Ferguson M, Katz K, MacIntyre R, Tredinnick V, Kim G, Marcinkiewicz MA, Schasberger B: Technical report. Bracketing Guidelines for Treebank II Style Penn Treebank Project. 1995, Philadelphia, Pennsylvania, USA: University of PennsylvaniaGoogle Scholar
- Kingsbury P, Palmer M: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002). From Treebank to PropBank. 2002Google Scholar
- Kim JD, Ohta T, Tateisi Y, Tsujii J: GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics. 2003, 19 (suppl 1): i180-i182. 10.1093/bioinformatics/btg1023.View ArticlePubMedGoogle Scholar
- Dahlmeier D, Ng HT: Domain adaptation for semantic role labeling in the biomedical domain. Bioinformatics. 2010, 26 (8): 1098-1104. 10.1093/bioinformatics/btq075.View ArticlePubMedGoogle Scholar
- Bethard S, Lu Z, Martin J, Hunter L: Semantic role labeling for protein transport predicates. BMC Bioinforma. 2008, 9: 277-10.1186/1471-2105-9-277.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.