A generalizable NLP framework for fast development of pattern-based biomedical relation extraction systems
© Peng et al.; licensee BioMed Central Ltd. 2014
Received: 9 July 2013
Accepted: 15 August 2014
Published: 23 August 2014
Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task.
A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations.
In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.
Due to the continued growth of biomedical publications, it has become very difficult for scientists to keep up with the new findings reported in the literature. As a consequence, we have observed an increase in the effort spent on automatically extracting information from research literature and developing biomedical text mining tools.
This paper aims to address the relation extraction task, which identifies selected types of relationships among entities (e.g., proteins) reported in text.
Approaches to the relation extraction task can be categorized into two major classes: (1) machine learning-based approaches and (2) pattern-based approaches. Machine learning-based approaches are data-driven and can derive models from a set of annotated data [1–7]. The use of machine learning methods can be quite effective, but the performance of resulting systems depends on the quality and the amount of annotated data. For example, large annotated corpora become available for the protein-protein interaction relation task, reflecting a general community-wide interest . But this situation does not always hold for relations of different scientific interest, because preparing annotated corpora is generally time consuming and expensive and it also requires domain expertise and significant effort to ensure accuracy and consistency. In contrast, pattern-based approaches do not require annotated data to train a system. However, they do require domain experts to be closely involved in the design and implementation of the system to capture the patterns used for extracting the necessary information. Some systems rely on extraction patterns defined at the surface textual level or based on outputs from a shallow parser [9–12]. Others use deep parsers with hand-crafted patterns [13–17]. As found in OpenDMAP , a semantic grammar may be utilized with text literals, syntactic constituents, semantic types of entities, and hyponomy. In all cases, rigid extraction patterns are manually encoded in the systems. Owing to rigid patterns, pattern-based approaches usually achieve a high precision but are often cited for low recall. While it is feasible to manually identify and implement high quality patterns to achieve a good precision, it is often impractical to exhaustively encode all the patterns necessary for a high recall in this manner.
Our work enables the fast development of pattern-based systems, while mitigating some of these concerns. We aim to reduce the involvement of domain experts and their manual annotation, and to attain high precision and recall.
Our approach starts by identifying a list of trigger words for the target relation (e.g., “associate” for the binding relation) and their corresponding Trigger specifications (e.g., the number and type of arguments expected for each trigger). Given this information, we make use of linguistic principles to derive variations of lexico-syntactic patterns in a systematic manner. These patterns are matched with the input text in order to extract target relations.
To improve the applicability of the generated patterns, we incorporate two additional design features. The first is the use of text simplification. This allows us to design a small set of lexico-syntactic patterns to match simple sentence constructs, rather than try to account for all complex syntactic constructs by generating an exhaustively large amount of patterns. Second, the framework exploits referential relations. With this method, two phrases referring to the same entity (e.g., coreference relation) or in a particular relation (e.g., meronymy relation, also known as part-of relation) are detected in text, and links are established between them. These links can be used when seeking the most appropriate phrase referring to the target entity and, hence, allow for extraction of target entities beyond lexico-syntactic patterns.
The proposed approach is based on the property of the language, rather than task-specific knowledge. Therefore, it is generalizable for different trigger words and potentially applicable to many different types of information targeted in biomedical relation extraction tasks.
We acknowledge several studies underlying our proposed framework. The automated pattern generation employed in this study shares the fundamental assumptions of certain linguistic theories, such as Lexicalized Tree Adjoining Grammar (LTAG) , Head-Driven Phrase Structure Grammar , and Lexical Functional Grammar . In particular, we believe that the concept underlying our method is similar to that of LTAG. The paradigm of inferring patterns exploited in our method shares the ideas with [22–30], but we focus on a specific set of patterns pertaining to the expression of biomedical relations.
Simplifying a sentence as a prerequisite for biomedical information extraction was studied in the past [9, 11, 31–34]. The use of meronymy and its opposite holonymy, among other relationships found in the biomedical ontology, was discussed in . Some of these relations were later considered in biomedical information extraction systems in order to improve their performance [36–38]. These relations and paradigms are in conjunction with our own two additional referential relationships: coreference and hyponymy. We integrate them in our framework and examine their utility for biomedical relation extraction.
To evaluate the framework, we test it by producing an extraction system for six relations that were part of the BioNLP-ST 2011 and 2013 GE tasks. We show that by just taking the specification of trigger words (root word only), we produce a relation extraction system with results that compare favorably with state-of-the-art results on this corpus. We further show that we can achieve good precision and recall with the patterns generated from the trigger, and that simplification and referential relation linking can increase the recall without compromising the precision.
A. Architecture overview
In the above example, Line 1 shows the trigger word, “phosphorylate” in this case. Line 2 indicates that it is the trigger for the relation “phosphorylation”. Line 3 specifies that the trigger syntactically chooses two noun phrases, designated as NP 0 and NP 1. Lines 4–5 add selectional restrictions, by requiring NP 0 to be a gene or gene product (GGP) and NP 1 to be either a GGP or a protein part. Lines 6–7 show that if NP 0 and NP 1 can be extracted, and if both NP 0 and NP 1 meet the above constraints, then the framework will assign their semantic roles of agent and theme, respectively.
Now consider the following example sentence:
(2) The c-Jun amino-terminal kinase phosphorylates NFAT4.
From (2), we will extract “the c-Jun amino-terminal kinase” as the agent and “NFAT4” as the theme of the phosphorylation relation. This extraction is the result of matching the text fragment with a pattern that is partly derived from the trigger specification. This pattern should not only capture the general syntactic form of a clause involving a transitive verb in an active voice, but also capture the selection restrictions imposed by the word “phosphorylates” and the arguments. Thus, this pattern contains information described in two places: (1) lexical trigger that specifies the arguments, the selection restrictions on the argument, and the role they play, and (2) the syntactic constraints corresponding to different constructs (in this example, the active clause). We call the former “trigger specification”, and the latter “pattern templates”. Actual lexico-syntactic patterns are obtained by merging the trigger specifications and pattern templates. As we shall see later (section B), the combination of these two is mediated by the frame that is mentioned in the trigger specification.We now briefly discuss four modules of the system framework: Pattern generation, Pattern matching, Sentence simplification, and Referential relation linking (Figure 1).
The Pattern generation (section C) module uses trigger specifications and predefined pattern templates to derive lexico-syntactic patterns for each trigger word. The Pattern matching (section D) module then matches fragments of text with lexico-syntactic patterns, and extracts the textual expressions in the trigger and argument positions. In order to more effectively match with the patterns, the Sentence simplification (section E) module is used to preprocess the input text. It aims to ensure that the lexico-syntactic patterns generated in the previous step are able to be matched even in complex sentences. Finally, the Referential relation linking (section F) module links arguments identified by the pattern matching module with the target entities they refer to, where applicable. This step enables the system to find relations between more appropriate entities than the ones referred by textual expressions in the argument position.
In addition to the above four system modules, there are two external modules. One is the Parsing module, which is used by the pattern matching step. The other is the Entity typing module, which assigns semantic types or categories to noun phrases. Both are found to be useful to enhance the precision of the relation extraction task [12, 18, 39].
B. Trigger specification
Trigger specifications are used to locate triggers and arguments in text for target relations. To make it easier to specify triggers, we ask users to provide the trigger’s root, which is the primary lexical unit of a word. From the root morpheme, we can derive other forms of triggers using our previous work . For example, from “phosphorylate” we derive “phosphorylates”, “phosphorylated”, “phosphorylation”, etc. In general, we generate different possible forms of triggers and confirm whether they are used in the literature. In a few cases, we ask the users for this confirmation. This generation is based on well-known English inflection rules, and this can be used to match to the appropriate trigger template.
Although these two specifications share the same trigger word, they represent different types of relations: gene expression and transcription. The gene expression relation requires its theme (NP 1) to be a gene, whereas the transcription relation requires its theme (NP 0) to be an RNA. These examples show that argument types in the trigger specification are essential to the framework to achieve a high precision, because they emphasize the selection restrictions on arguments.
C. Pattern generation
Provided with a trigger specification, we use the “frame” to associate a trigger with a set of pattern templates to derive lexico-syntactic patterns. In the following subsections, we will define frames and pattern templates, and then discuss how they can be combined to generate lexico-syntactic patterns.
A frame is a set of pattern templates sharing the same syntactic nature of the constituents that are likely to be associated with the trigger. It specifies the arguments of the trigger. We found that the most frequent frame in biomedical documents is:
(5) Frame:NP 0 /NP 1
We distinguish NP 0 and NP 1, because semantically they play different roles and have different types in the trigger specification, and syntactically they represent different grammatical constituents. The above frame may be realized by the standard active form “NP 0 V NP 1”, where V is a verb, and NP 0 and NP 1 appear at the left and right of the verb, respectively.
Relations can be semantically “directional” or “non-directional”. For example, phosphorylation is a directional relation, but binding is non-directional. This is because “A binds B” and “B binds A” may be used to specify the same relation between “A” and “B”, but “A phosphorylates B” and “B phosphorylates A” represent two different relations. If a relation is directional (or non-directional), we would expect that all triggers for that relation have the property as well. In our framework, we use an additional binary constraint “ 〈direction〉= directional/non-directional” in the trigger specification to distinguish non-directional relations from the other, because currently it is the only place where users interact with the framework. To generate appropriate patterns, this directionality constraint in the trigger specification will cause an appropriate defined frame to be chosen: the non-directional frame differs from the directional one by allowing for the swapping of the agent and the theme.
C.2 Pattern templates
A pattern template is specified by a sequence of words or phrases β1,…,β n , followed by a set of constraints. Each constraint assigns a value for one of the β i features.
To reduce the number of pattern templates, we limit pattern templates to capture one argument at a time. So the pattern templates will capture pairs <trigger, NPi >. After templates are instantiated and arguments are extracted, we combine pairs if they have the same trigger. Thus we can extract relations with multiple arguments. We believe that considering one argument at a time is more flexible and manageable, because such pairs can be applied independently, while constraints on combinations can cover many different relations.
We further categorize pattern templates into two groups: one with explicit argument, and one with null argument. We will discuss pattern templates for argument realization in the next section, and then introduce methods to generate lexico-syntactic patterns. Lastly, we will discuss pattern templates with null argument.
C.3 Pattern templates for argument realization
Argument realization, which is at the heart of the area of linguistics, is the study of the possible syntactic expressions of the arguments of a verb . In this study, we extend argument realization to nominal and adjectival triggers derived from verbs as well.
We use NP1 in pattern templates (7) and (8) in contrast to NP0 in template (6), because their roles are different. For example, in trigger specification of (1), NP0 is always the agent and NP1 is always the theme. Furthermore, in combination with the constraints expressed within the trigger specification, the use of template (6) will succeed only if NP0 is GGP, whereas the use of template (8) will succeed even if NP1 is a protein part.
C.4 Generation of patterns
The pattern generation module automatically creates lexico-syntactic patterns given a list of trigger specifications and frames.
To associate pattern templates with frames, verb type information is used. For example, one constraint in English is that only transitive verbs can be passivized. Therefore Frame:NP0/NP1 contains template (6), (7), and (8), but Frame:NP0 contains template (6) only. Given the trigger specification of (1) for “phosphorylate” having Frame:NP0/NP1, we will automatically generate lexico-syntactic patterns like “NP0 phosphorylates”, “NP1 is phosphorylated”, etc..
The automatic generation procedure is similar to the concept of LTAG. In LTAG, a tree family associates a verb lexeme with a given subcategorization. The subcategorization includes a set of grammatical structures that represent all the possible lexico-syntactic variations for that verb lexeme. So grammatical structures can be obtained by combining lexical rules and syntactic transformations. Compared with LTAG, the “frame” in our approach is essential but not exactly a subcategorization in LTAG. The trigger specifications are similar to tree families in LTAG, which associate a trigger lexeme with a given frame. In addition, we also consider verb nominalization and adjectivization.
C.5 Pattern templates with null argument
Both (a) and (b) are grammatically correct and express the same underlying idea in (15) and (16), but we tend to write (a) rather than (b) as a shorthand. Such null argument structure is similar to the null complement anaphora (deep anaphora) in a modern syntactic theory  and the implicit argument in a semantic theory . For the relation extraction task, we observe that the elided argument may be found as its antecedent and determined by another trigger that selects it. Our framework recovers them as part of the relation extraction process, by applying for the null argument pattern templates. It should be noted that such elliptical constructions can appear in various positions of a sentence, e.g., at the beginning (15a) or at the end (16a). These templates always rely on the whole sentence construct, therefore are too cumbersome to express. We designed some pattern templates to match sentences like (15a) and (16a). Whether there exists a more general and clearer way to express these types of pattern templates needs to be further explored.
D. Pattern matching
Consider the text fragment where “JNK” and “NFAT4” have already been tagged as gene or gene product.
(17) JNK phosphorylates NFAT4
This fragment is captured by the generated lexico-syntactic patterns derived from Frame:NP0/NP1 and the trigger specification “phosphorylate” in (1). The next step is to extract the actual agent and theme. Specifically, pattern template (7) matches the “phosphorylates” word as a trigger, and “NFAT4” as NP1. The trigger specification (1) checks NP1’s type, which is GGP, and assigns its role for a theme. Therefore, we get <phosphorylatestrigger, NFAT4theme >. Similarly, by using pattern template (6) we can extract <phosphorylatestrigger, JNKagent >.
For pattern matching, we would like to mention two issues. First, as illustrated above, the pattern matching engine must be able to check the types of NPs are consistent with those mentioned in the trigger specifications. For this purpose, any method that assigns types to noun phrases or named entities, such as BioNex  or Genia tagger  can be employed. In our evaluation, we have used the BioNex tool.
The above examples belong to pattern templates (7) and (8), respectively. However, none of them can be directly matched because of the complex way in which the predicates are expressed. This construction of consecutive verb groups makes basic pattern matching extremely laborious, because of the many variations they can introduce. In this framework, we would like to avoid constructing complex pattern templates, thereby reducing the burden of system development. We notice that (1) syntactically, such consecutive verb groups form a dependent-auxiliary construction: dependent-auxiliary + main-verb, and (2) semantically, the “agent” and “theme” are always related to the last main-verb, rather than the auxiliary. Thus, we match the consecutive verb group as a whole, then choose the last verb as the head of the whole sequence.
The most frequent adjuncts that are likely to be skipped are the adverbial adjuncts, e.g. “abundantly” and “in the granulocyte fraction of human peripheral blood cells” in (19a). In addition, adjective-nominal adjuncts are also skipped, e.g “Abundant” and “in human peripheral granurocytes” in (19b).
E. Sentence simplification
So far, we have discussed how arguments can be extracted by matching patterns. But even with a large number of patterns automatically generated in the proposed manner, the recall of the resulting system is still low because sentence constructions and writing styles vary considerably in actual text, and the number of variations to be considered is overwhelmingly high. For example, consider sentence (20):
(20) Active Raf-1 phosphorylates and activates the mitogen-activated protein (MAP) kinase/extracellular signal-regulated kinase kinase 1 (MEK1), which in turn phosphorylates and activates the MAP kinases/ extracellular signal regulated kinases, ERK1 and ERK2. (PMID 8557975)
This and many other instances that we observed in biomedical research articles motivated us to separate the various structures of a sentence first, and then match patterns to the simplified sentences.
Complex constructs, e.g., coordinations and relative clauses, pose a challenge for state-of-the-art full parsers. However, even if these constructs can be detected correctly using full parsers, new patterns are still needed to skip parts of a construct (e.g., skipping conjuncts in a coordination or skipping relative clauses). When using a dependency parser, more collapsed rules involving prepositions, conjuncts, as well as information about the referent of relative clauses are used to get direct dependencies between content words . Both cases will increase the complexity of patterns and, thus, increase the pattern encoding effort.
Alternatively in this framework, we introduce sentence simplification as a preprocessing module. Given an input sentence, this module outputs a set of generated simplified sentences, thus conceals the syntactic complexity from the pattern matching step.
E.1 Complex constructs for simplification
In this section, we will describe syntactic constructs that the preprocessing module simplifies. For further details of our sentence simplifier, iSimp, we refer to .
For a coordination, the original sentence can be split into multiple ones, each containing one conjunct.
There are two types of relative clauses that frequently appear in biomedical text: full relative clauses and reduced relative clauses. Full relative clauses (23a) are introduced by relative pronouns, such as “which”, “who”, and “that”. Reduced relative clauses (23b) and (23c) start with a gerund or past participle and have no overt subject. A sentence containing a relative clause can be simplified into two sentences: the original sentence without the relative clause and the other that consists of referred noun phrase as a subject and the relative clause.
Appositions can be detected by searching for two noun phrases separated by a comma, when they are not part of a coordination. In addition, because one noun phrase (appositive) normally renames or describes the other, it usually begins with a determiner or a number (as shown above). Appositions can be simplified into two sentences: one with the referred noun phrase and the other with the appositive.
When simplifying parenthesized elements, an additional sentence is created only with the parenthesized elements without the preceding nouns phrase.
E.2 Dealing with attachment ambiguities
Examples in (26) refer to relative clause attachment ambiguities, where there is a complex NP of the type “NP1prep NP2” followed by a relative clause. In such cases, it is unclear whether to attach the relative clause to the first noun phrase (NP1) or the second one (NP2). Other kinds of attachment ambiguity include PP-attachment, e.g., “NP1 and NP2 PP” (27), and the attachment involving coordination, e.g., “Adj NP1 and NP2” (28). Solving the attachment problem is important in sentence simplification, but we believe it is not a purely syntactic problem . Semantic information is also necessary to make a decision. Therefore, in this study, we produce alternative attachments as candidates while simplifying sentences, and leave the decision to the pattern matching module where type information is available.
F. Referential relation linking
By using patterns and sentence simplification, the system can detect textual expressions in the argument position. Sometimes, the referred entity is mentioned somewhere else in the text. Consider example (29). The system can extract binding relation <dimerizedtrigger, the proteintheme > from (29), but the actual target entity is “c-Fox”. To link these phrases, we developed patterns to extract referential relations.
(29) The stability of c-Fox was decreased when the protein was dimerized with phosphorylated c-Jun.
F.1 Referential relations
Referential relation patterns are designed to extract the relationship of one nominal phrase to another, when one provides the necessary information to interpret the other . By utilizing referential relations, an extraction system is able to identify an actual target entity beyond the initially extracted arguments.
Co-referential relations (or co-references) occur when multiple expressions refer to the same referent. For instance, in the previous example, “the protein” and “c-Fox” both refer to the same object. In a co-referential relation, the anaphoric reference can be a pronoun or definite noun phrase, and its antecedent can be the actual name of protein or gene. In this study, co-referential relations are not extracted, except for the case of a relative pronoun, because we consider their detection as a separate and independent task from pattern-based extraction.
Part-whole relations are useful when an argument extracted for a trigger comprises a part of the target entity. For example:
(30) Both Eomes and Runx3 bind at the Prf1 locus.
For biomedical information extraction, this framework focuses on relations between protein parts and a protein, e.g., a residue in a protein. Such part-whole relations in example (30) can be captured by patterns like “NPwholecontains NPpart” or by the existence of keywords like “locus”, “promoter”, and “domain”.
Member-collection relations are useful in linking a generic reference to a group of entities that are specified in other places in text. For example:
(31) expression of adhesion molecules including integrin alpha, L-selectin, ICAM-3, and H-CAM
The above example illustrates that the generic reference “adhesion molecules” can be extracted as an argument of the trigger “expression”. Meanwhile, specific referred entities include “integrin alpha”, “L-selectin”, “ICAM-3” and “H-CAM”. We consider patterns like “NP, such as NP (, NP)*” to identify this type of relations.
To achieve this goal, we identify the fragments having keywords such as “acts as” or “is identified as”, which are similar to the ones in  and . Moreover, the apposition construct can also hold a hyponymy relation between the appositive and the referred noun phrase.
F.2 Linking entities through referential relations
This example contains one transcription relation. Our goal is to extract its trigger and argument, namely <transcribedtrigger, tumor necrosis factor alphatheme > which are highlighted in the sentence. We assume “tumor necrosis factor alpha” is typed as a gene.
Given the trigger “transcribe” and using pattern template (8) as discussed earlier, we can extract <transcribedtrigger, the earliest genestheme >. But “the earliest genes” is not a named entity (This can be discovered by using a named entity recognition tool). In addition, we extract one member-collection relation <onemember, the earliest genescollection > and one hyponymy relation <tumor necrosis factor alphahyponym, onehypernym >. The first relation enables us to infer <transcribedtrigger, onetheme >, since the collection of genes (“the earliest genes”) are “transcribed” and, then, one of its members can be “transcribed” as well. Then, the latter relation allows us to state “tumor necrosis factor alpha” is the “one” in this context and hence to conclude <transcribedtrigger, tumor necrosis factor alphatheme >.
The algorithm for the linking is as follows. First we collect all referential relations in the document. Then we use the patterns to get instances for a trigger. If the instance’s argument is not an informative reference, we recursively search for all of its references in the detected referential relations. If an appropriate reference of an entity is found, we link it to the trigger, by creating a new pair <trigger, referred entity >. This search procedure ends when we exhaust all possibilities. As a result, more than one pair may be created and all pairs are proposed.
G. Evaluation design
Our framework is designed to extract a variety of relations. For the evaluation of our framework, we need test sets containing different types of relations. Furthermore, the data set should include trigger annotations needed to automatically generate patterns. We chose to use the corpora of BioNLP-ST (Shared Tasks) 2011 and 2013 GE tasks, which included several event extraction subtasks [51, 52].
G.1 BioNLP-ST GE task
The BioNLP-ST GE task series aim to extract various events from biomedical text. The first shared task workshop was held in 2009, and the most recent one in 2013. In this study, both 2011 and 2013 corpora are used for the evaluation. We will refer to them as “GE 2011” and “GE 2013” hereafter.
In GE 2011 task, evaluation results were reported on (W)hole, (A)bstract, and (F)ull paper collections, respectively. The abstract collection contains paper abstracts, the full text collection contains full papers, and the whole collection contains both abstracts and full text. Following the same setting, we also report our results on W, A, and F. GE 2011 corpus covers nine types of events: Gene_expression, Transcription, Localization, Protein_catabolism, Phosphorylation, Binding, Regulation, Positive_regulation, and Negative_regulation. Among these, we focused on events with simple entities as themes. Thus, Regulation and its subtypes were removed because their themes could be other events with other triggers. As a result, only the first 6 types of events were evaluated. The first five events were called “Simple Event” collectively. In the GE 2013 corpora, we consider the same events as well.
G.2 Trigger selection
-ion, over-, co-, non-, re-
-ion, non-, co-
-tion, -tional, -tionally
-sis, -tic, -tically
-ion, under-, hyper-
-ation, co-, re-
-ion, re-, trans-
G.3 Evaluation measurement
the event types are the same;
the triggers are the same; and
the arguments are the same.
Same triggers and arguments means that “the given text span is equivalent to a gold span if it is entirely contained within an extension of the gold span by one word both to the left and to the right.” For example, if (a1,b1) is the given span and (a2,b2) is the gold span, they are the same iff a1≥a2 and b1≤b2.
G.4 System implementation
This section describes one implementation of the framework.
The raw text was parsed by Charniak-Johnson parser using David McClosky’s biomedical model . We chose Charniak-Johnson parser because it was convenient in comparing the evaluation with existing systems [54, 55]. But other constituent parsers would also work with little integration effort.
We consider the typing as a critical component of the framework. For example, (1) for relations like phosphorylation, the theme needs to be a noun phrase of type protein or protein part; (2) for triggers like “associate”, the binding relation should not be extracted if its themes’ are not proteins or protein parts; and (3) for triggers like “express” and “detect”, the themes’ type must be gene or mRNA, and the relation is either gene expression or transcription, respectively. This implementation of the framework uses a modified version of BioNex, which was developed based on ideas from  and used in RLIMS-P . BioNex can detect semantic types of entities referred by nouns or noun phrases, such as protein, gene, chemical or their part. The type detection is based on considering the head nouns and their suffixes, and comparing them with a predefined list for each type.
Patterns were generated and matched from the parse tree using the tree regular expression . Thus pattern templates were designed using tree regular expression as well. 26 pattern templates were created. To extract the predicate in a consecutive verb group, (e.g., “bind” in “is known to bind”), we looked at the verb phrase subtree and searched for its rightmost children. When the last verb phrase in the group was found, we picked its head.
For the simplification task, we used iSimp, which is a sentence simplifier specifically created for biomedical text . Currently, iSimp can detect six types of simplification constructs: coordination, relative clause, apposition, introductory phrase, subordinate clause, and parenthetical element. It uses shallow parsing and state transition networks to detect all forms of simplifications. The detection of various simplification constructs is based on the chunks (noun phrases, verb groups, and prepositional phrases), and from these, iSimp generates simplified sentences. iSimp also handles nested constructs. For an in-depth description of this process, we refer the reader to .
The discussion above describes an implementation of the framework. In order to evaluate the framework using the BioNLP-ST GE data, we implemented a relation extraction system for the six events in these data sets. The relation extraction system is obtained from this implementation by specifying the triggers, which were chosen by considering a subset of the trigger words marked in the training set for the six events in the GE 2011 training set. In particular, we chose only frequently occurring verbal trigger words. Note the trigger specifications require only the base form of these verbal triggers (e.g., “phosphorylate” and “interact”). Because this set of triggers are limited in the subcategorization variety, they fall into a handful set of predefined trigger specifications. As a result, we are able to quickly complete the trigger specification for these words.
This relation extraction system implementation is available as a web service accessible: http://research.bioinformatics.udel.edu/ixtractr. Unlike the evaluations conducted in this paper, the web service does not have gene mentions marked in the text as the input. Instead, we integrated an in-house module to detect gene mentions. Because this module only accepts PMIDs as the input rather than full text, the current web service only supports PMID input as well.
Results and discussion
A. Results on GE 2011 corpus
Statistics of the data sets after modification
Evaluation results on the whole, abstract, and full paper collections from the training set of BioNLP-ST 2011 GE task
Subset with selected triggers
Evaluation results on the whole, abstract, and full paper collections from the development set of BioNLP-ST 2011 GE task
Subset with selected triggers
Comparative results of subset events with selected triggers on the whole, abstract, and full paper collections from the training set of BioNLP-ST 2011 GE task
Using simplification and referential relations
Comparative results of subset events with selected triggers on the whole, abstract, and full paper collections from the development set of BioNLP-ST 2011 GE task
Using sentence simplification
Using sentence simplification and referential relations
Results on the (W)hole, (A)bstract, and (F)ull paper collections from the testing set of BioNLP-ST 2011 GE task 1
We would like to note that although Table 3, 4, 5, 6, and 7 show the results on different partitions of the 2011 data sets, the system remains unchanged because the trigger word list (extracted from the training set) remains the same.
B. Analysis of false positives and negatives on GE 2011 corpus
We randomly chose 50 false positive (FP) cases and 180 false negative (FN) cases with 30 for each event type from the training set of GE 2011 corpora in order to analyze reasons for failure. We identified two major types of errors.
B.1 Parsing errors
B.2 Missing pattern templates
Another case of false negatives is due to the trigger word being a noun but not the head of the noun phrase. For example, our pattern templates could be applied for fragments “transcription of NP” and “expression of NP” but could not be applied for fragments “ of NP2” or “ of NP2”. We impose such a constraint in order to maintain a high precision. The analysis showed, however, that we could generalize the constraints in the future with some effort, especially in deciding on the words that can head the NPs.
Similarly, we need to generalize null argument structures further. For example, consider the fragment
(33) targets c-Fos for degradation
We have a pattern template using “via” but not “for”. There are a few other cases, where null argument pattern templates could have been applied, but these new templates need to be further checked.
C. Results and analysis on GE 2013 corpus
Evaluation results from the training, development, and testing sets of BioNLP-ST 2013 GE task 1
For the trigger “phosphorylation” in the third sentence, the author neglected to mention the theme because it can be inferred from the context: (1) the previous sentence also mentions this “BMP-6 induced phosphorylation”, but its theme has a general term “Smad”, and (2) the actual proteins “Smad1/5/8” are clearly specified in the first sentence. As a result, to infer the theme of the trigger “phosphorylation” in the third sentence, we not only need the syntax information, but also the discourse-level processing.
Note that the system used in this evaluation remains the same as the one that was used on the GE 2011 task. No changes were made to accommodate any differences between the GE 2011 and 2013 corpora. The focus in this framework is on the patterns and hence almost all processing is syntax-based. While some of our earlier work on relation extraction has integrated discourse-level processing with syntax-based patterns , the integration of such discourse-level processing is beyond the scope of this work. However, examples as above suggest that the need for discourse-level processing may be important for full-length based extraction. We intend to investigate incorporating the generalized discourse-level processing into our framework in the future, so that it can be useful for full-text based extraction.
In this work, we have designed a framework for development of biomedical relation extraction systems. The framework requires as input only a list of triggers and their specifications to retrieve relations of interest. It utilizes linguistic generalizations that help speed up the development process by proposing various lexico-syntactic patterns as well as improve the performance, particularly the recall, by making use of sentence simplification and referential relations.
To evaluate the framework, we developed a relation extraction system, which was produced using general resources and the only aspect specific to the evaluation was the selection of trigger words that appear in the corpus. Except for the specification of triggers, other aspects (parser, typing system, simplification, pattern matching system) are general purpose systems that already existed. The fact that only the specification of the triggers is required from domain experts, together with the fact that no training set is required, meets our goals for developing the framework: ability to create effective relation extraction systems for new relations where resources (e.g., annotated corpus or database) are not publicly available.
We evaluate the performance of the system by producing a relation extraction system and evaluating it on the BioNLP-ST 2011 and 2013 GE tasks. The system achieved F-scores of 68.48% on the GE 2011 test set, and 74.44% on the GE 2013 test set. Our analysis shows that we can achieve high precision and good recall with the range of patterns automatically generated from triggers and that simplification and referential relation linking serve to increase the recall while maintaining the precision.
In the future, we would like to extend the framework in two ways. So far, we only considered the triggers that are verbs and their derived forms. Next, we would like to account for triggers that are primarily nouns or adjectives. Also, we would like to extend the framework to take complex entities (e.g., relations themselves) as arguments rather than just simple entities (e.g., genes or proteins).
We are developing systems for additional relations. In general, it is a challenging task to identify all the triggers for the relation and to complete their specifications. This study demonstrates a generalizable relation extraction framework that can be quickly implemented for new relations, initially focusing on a few triggers that appear frequently. While not accounting for a long tail of less frequent triggers, our framework allows additional trigger specifications to be added with little impact on the existing trigger list. Thus as new triggers are found, they can be integrated in the system. Using the framework and this approach, we have developed a system for miRNA-target extraction. Preliminary evaluation based on an in-house corpus of 200 abstracts shows an F-score of the system over 90% (manuscript in preparation). We would like to use the experience in developing this and other relation extraction to design a process involving user interaction in generating trigger specifications for new relations. In general, the specification of a trigger needs both domain knowledge as well as linguistic knowledge. The domain expert will be able to suggest the trigger words for a relation, whereas linguistic knowledge will be more useful in preparing the trigger specifications of sub-categorization, thematic roles, etc.
In our framework, we already have a predefined set of subcategorization frames and thematic roles that can be utilized in the specifications. This can be used to engage the user in the interactive process. At the beginning, the users who are domain experts will provide a list of trigger words. Then the process will derive various forms of triggers using the linguistic knowledge and ask users to choose. If necessary, the process will use these triggers to generate simple examples for the users to confirm which predefined specification should be associated to the trigger. The whole process will communicate with users in an interactive way, which we expect is able to further speed up the development of new relation extraction systems.
a Simple Event includes Phosphorylation as well, same as in the BioNLP-ST 2011 GE Task 1.
Research reported in this manuscript was supported by the National Library of Medicine of the National Institutes of Health under award number G08LM010720. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This material is also based upon work supported by the National Science Foundation under Grant No. DBI-1062520. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We thank the BioNLP Shared Task organizer for making the annotated corpora publicly available. We thank Dr. Catalina O. Tudor for useful discussion and comments for this manuscript.
- Vlachos A, Craven M: Biomedical event extraction from abstracts and full papers using search-based structured prediction. BMC Bioinformatics. 2012, 13 (Suppl 11): S5-10.1186/1471-2105-13-S11-S5.View ArticlePubMed CentralPubMedGoogle Scholar
- Riedel S, McClosky D, Surdeanu M, McCallum A, Manning CD: Model combination for event extraction in BioNLP 2011. Proceedings of the BioNLP Shared Task 2011 Workshop. 2011, Portland, Oregon: Association for Computational Linguistics, 51-55.Google Scholar
- Björne J, Salakoski T: Generalizing biomedical event extraction. Proceedings of the BioNLP Shared Task 2011 Workshop. 2011, Portland, Oregon: Association for Computational Linguistics, 183-191.Google Scholar
- Bui QC, Katrenko S, Sloot PM: A hybrid approach to extract protein–protein interactions. Bioinformatics. 2011, 27 (2): 259-265. 10.1093/bioinformatics/btq620.View ArticlePubMedGoogle Scholar
- Kim S, Yoon J, Yang J, Park S: Walk-weighted subsequence kernels for protein–protein interaction extraction. BMC Bioinformatics. 2010, 11: 107-10.1186/1471-2105-11-107.View ArticlePubMed CentralPubMedGoogle Scholar
- Miwa M, Sætre R, Miyao Y, Tsujii J: Protein–protein interaction extraction by leveraging multiple kernels and parsers. Int J Med Inform. 2009, 78 (12): e39-10.1016/j.ijmedinf.2009.04.010.View ArticlePubMedGoogle Scholar
- Airola A, Pyysalo S, Björne J, Pahikkala T, Ginter F, Salakoski T: All-paths graph kernel for protein–protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics. 2008, 9 (Suppl 11): S2-10.1186/1471-2105-9-S11-S2.View ArticlePubMed CentralPubMedGoogle Scholar
- Pyysalo S, Airola A, Heimonen J, Björne J, Ginter F, Salakoski T: Comparative analysis of five protein–protein interaction corpora. BMC Bioinformatics. 2008, 9 (Suppl 3): S6-10.1186/1471-2105-9-S3-S6.View ArticlePubMed CentralPubMedGoogle Scholar
- Tudor CO, Vijay-Shanker K: RankPref: Ranking sentences describing relations between biomedical entities with an application. Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. 2012, Montreal, Canada: Association for Computational Linguistic, 163-171.Google Scholar
- Cohen KB, Verspoor K, Johnson HL, Roeder C, Ogren PV, Baumgartner WA, White E, Tipney H, Hunter L: High-precision biological event extraction: effects of system and of data. Comput Intell. 2011, 27 (4): 681-701. 10.1111/j.1467-8640.2011.00405.x.View ArticlePubMed CentralPubMedGoogle Scholar
- Hakenberg J, Leaman R, Ha Vo N, Jonnalagadda S, Sullivan R, Miller C, Tari L, Baral C, Gonzalez G: Efficient extraction of protein–protein interactions from full-text articles. IEEE/ACM Trans Comput Biol Bioinform (TCBB). 2010, 7 (3): 481-494.View ArticleGoogle Scholar
- Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH: Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics. 2005, 21 (11): 2759-2765. 10.1093/bioinformatics/bti390.View ArticlePubMedGoogle Scholar
- Kilicoglu H, Bergler S: Adapting a general semantic interpretation approach to biological event extraction. Proceedings of the BioNLP Shared Task 2011 Workshop. 2011, Portland, Oregon: Association for Computational Linguistics, 173-182.Google Scholar
- Quirk C, Choudhury P, Gamon M, Vanderwende L: MSR-NLP entry in BioNLP shared task 2011. Proceedings of the BioNLP Shared Task 2011 Workshop. 2011, Portland, Oregon: Association for Computational Linguistics, 155-163.Google Scholar
- Kim J, Rebholz-Schuhmann D: Improving the extraction of complex regulatory events from scientific text by using ontology-based inference. J Biomed Semantics. 2011, 2 (Suppl 5): S3-10.1186/2041-1480-2-S5-S3.View ArticlePubMed CentralPubMedGoogle Scholar
- Fundel K, Küffner R, Zimmer R: RelEx – relation extraction using dependency parse trees. Bioinformatics. 2007, 23 (3): 365-371. 10.1093/bioinformatics/btl616.View ArticlePubMedGoogle Scholar
- Rinaldi F, Schneider G, Kaljurand K, Hess M, Andronis C, Konstandi O, Persidis A: Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach. Artif Intell Med. 2007, 39 (2): 127-136. 10.1016/j.artmed.2006.08.005.View ArticlePubMedGoogle Scholar
- Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL, Ogren PV, Cohen KB: OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics. 2008, 9: 78-10.1186/1471-2105-9-78.View ArticlePubMed CentralPubMedGoogle Scholar
- Schabes Y: Stochastic lexicalized tree-adjoining grammars. Proceedings of the 14th conference on Computational linguistics-Volume 2. 1992, Nantes, France: Association for Computational Linguistics, 425-432.View ArticleGoogle Scholar
- Pollard C, Sag IA: Head-driven phrase structure grammar. 1994, Chicago: University of Chicago PressGoogle Scholar
- Bresnan J: Lexical-functional syntax. 2001, Hoboken: Wiley-BlackwellGoogle Scholar
- Kipper K, Korhonen A, Ryant N, Palmer M: Extending VerbNet with novel verb classes. Proceedings of LREC; Genova, Italy, Volume 2006. 2006, 1-1.Google Scholar
- Chen J, Vijay-Shanker K: Automated extraction of TAGs from the Penn Treebank. New Developments in Parsing Technology, Volume 23. 2005, New York: Springer, 73-89.Google Scholar
- The XTAG Research Group: A lexicalized tree adjoining grammar for English. Tech. rep., Technical Report IRCS-01-03, IRCS, University of Pennsylvania 2001Google Scholar
- Levin B: English verb classes and alternations: a preliminary investigation. 1993, Chicago: University of Chicago PressGoogle Scholar
- Dolbey AE: BioFrameNet: a FrameNet extension to the domain of molecular miology. PhD thesis. University of California: Berkeley; 2009Google Scholar
- Rebholz-Schuhmann D, Jimeno-Yepes A, Arregui M, Kirsch H: Measuring prediction capacity of individual verbs for the identification of protein interactions. J Biomed Inform. 2010, 43 (2): 200-207. 10.1016/j.jbi.2009.09.007.View ArticlePubMedGoogle Scholar
- Lippincott T, Rimell L, Verspoor K, Korhonen A: Approaches to verb subcategorization for biomedicine. J Biomed Inform. 2013, 46 (2): 212-227. 10.1016/j.jbi.2012.12.001.View ArticlePubMedGoogle Scholar
- Rimell L, Lippincott T, Verspoor K, Johnson HL, Korhonen A: Acquisition and evaluation of verb subcategorization resources for biomedicine. J Biomed Inform. 2013, 46 (2): 228-237. 10.1016/j.jbi.2013.01.001.View ArticlePubMedGoogle Scholar
- EvidenceFinder. http://labs.europepmc.org/evf,
- Jonnalagadda S, Gonzalez G: BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical information extraction. AMIA Annual Symposium Proceedings; Washington, DC, Volume 2010. 2010, American Medical Informatics Association, 351-351.Google Scholar
- Miwa M, Saetre R, Miyao Y, Tsujii J: Entity-focused sentence simplification for relation extraction. Proceedings of the 23rd International Conference on Computational Linguistics. 2010, Beijing, China, 788-796.Google Scholar
- Peng Y, Tudor CO, Torii M, Wu CH, Vijay-Shanker K: iSimp: A sentence simplification system for biomedical text. IEEE International Conference on Bioinformatics and Biomedicine (BIBM2012). 2012, Philadelphia, PA, 211-216.Google Scholar
- Ogren PV: Coordination resolution in biomedical texts. PhD thesis. University of Colorado at Boulder; 2011Google Scholar
- Jimeno-Yepes A, Jiménez-Ruiz E, Berlanga-Llavori R, Rebholz-Schuhmann D: Reuse of terminological resources for efficient ontological engineering in life sciences. BMC Bioinformatics. 2009, 10 (Suppl 10): S4-10.1186/1471-2105-10-S10-S4.View ArticleGoogle Scholar
- Van Landeghem S, Björne J, Abeel T, De Baets B, Salakoski T, Van de Peer YZ: Semantically linking molecular entities in literature through entity relationships. BMC Bioinformatics. 2012, 13 (Suppl 11): S6-10.1186/1471-2105-13-S11-S6.View ArticlePubMed CentralPubMedGoogle Scholar
- Miwa M, Thompson P, Ananiadou S: Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics. 2012, 28 (13): 1759-1765. 10.1093/bioinformatics/bts237.View ArticlePubMed CentralPubMedGoogle Scholar
- Van Landeghem S, Pyysalo S, Ohta T, Van de Peer Y: Integration of static relations to enhance event extraction from text. Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010, Uppsala, Sweden: Association for Computational Linguistics, 144-152.Google Scholar
- Narayanaswamy M, Ravikumar K, Vijay-Shanker K: A biological named entity recognizer. Proceedings of the Pacific Symposium on Biocomputing. 2003, Kauai, Hawaii, 427-427.Google Scholar
- Miller JE, Torii M, Vijay-Shanker K: Building domain-specific taggers without annotated (domain) data. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2007, Prague, Czech Republic, 1103-1111.Google Scholar
- Levin B: Hovav MR: Argument realization. 2005, Cambridge, UK: Cambridge University PressView ArticleGoogle Scholar
- Smith NA: Ellipsis happens, and deletion is how. Univ Md Working Papers Linguist. 2001, 11: 176-191.Google Scholar
- Gerber M, Chai JY: Semantic role labeling of implicit arguments for nominal predicates. Comput Linguist. 2012, 38 (4): 755-798. 10.1162/COLI_a_00110.View ArticleGoogle Scholar
- Tsuruoka Y, Tsujii J: Bidirectional inference with the easiest-first strategy for tagging sequence data. Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing 2005. 2005, Vancouver, Canada, 467-474.View ArticleGoogle Scholar
- De Marneffe MC, Manning CD: Stanford typed dependencies manual. 2008, [http://nlp.stanford.edu/software/dependenciesmanual.pdf]Google Scholar
- Huddleston R, Pullum GK: The Cambridge grammar of the English language. 2002, Cambridge, UK: Cambridge University PressGoogle Scholar
- Siddharthan A: Syntactic simplification and text cohesion. University of Cambridge 2003Google Scholar
- Hartmann RRK, Stork FC: Dictionary of language and linguistics. 1972, New York: WileyGoogle Scholar
- Hearst MA: Automatic acquisition of hyponyms from large text corpora. 1992, Nantes, France: Association for Computational LinguisticsView ArticleGoogle Scholar
- Snow R, Jurafsky D, Ng AY: Learning syntactic patterns for automatic hypernym discovery. Adv Neural Inform Process Syst. 2004, 17: 1297-1304.Google Scholar
- Kim JD, Nguyen N, Wang Y, Tsujii J, Takagi T, Yonezawa A: The genia event and protein coreference tasks of the BioNLP shared task 2011. BMC Bioinformatics. 2012, 13 (Suppl 11): S1-10.1186/1471-2105-13-S11-S1.View ArticlePubMed CentralPubMedGoogle Scholar
- Kim JD, Yue W, Yamamoto Y: The Genia event extraction shared task, 2013 edition - Overview. Proceedings of the Workshop on BioNLP Shared Task 2013. 2013, Sofia, Bulgaria, 20-27.Google Scholar
- Stenetorp P, Topić G, Pyysalo S, Ohta T, Kim JD, Tsujii J: BioNLP shared task 2011: Supporting resources. Proceedings of the Workshop on BioNLP Shared Task 2011. 2011, Portland, Oregon, 112-120.Google Scholar
- McClosky D: Any domain parsing: automatic domain adaptation for natural language parsing. PhD thesis. Department of Computer Science, Brown University 2009Google Scholar
- Tateisi Y, Yakushiji A, Ohta T, Tsujii J: Syntax annotation for the GENIA corpus. Proceedings of the Workshop on the 1st International Joint Conference on Natural Language Processing (IJCNLP). Volume 5. 2005, Jeju Island, Korea, 222-227.Google Scholar
- Levy R, Andrew G: Tregex and Tsurgeon: tools for querying and manipulating tree data structures. Proceedings of the Fifth International Conference on Language Resources and Evaluation. 2006, Genoa, Italy, 2231-2234.Google Scholar
- Lappin S, Leass HJ: An algorithm for pronominal anaphora resolution. Comput Linguist. 1994, 20 (4): 535-561.Google Scholar
- Qiu L, yen Kan M, seng Chua T: A public reference implementation of the rap anaphora resolution algorithm. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). 2004, Lisbon, Portugal, 291-294.Google Scholar
- BioNLP-ST 2013 GE task results. http://bionlp-st.dbcls.jp/GE/2013/results,
- Narayanaswamy M, Ravikumar K, Vijay-Shanker K: Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics. 2005, 21 (suppl 1): i319-i327. 10.1093/bioinformatics/bti1011.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.