Sortal anaphora resolution to enhance relation extraction from biomedical literature
© Kilicoglu et al. 2016
Received: 4 November 2015
Accepted: 1 April 2016
Published: 14 April 2016
Entity coreference is common in biomedical literature and it can affect text understanding systems that rely on accurate identification of named entities, such as relation extraction and automatic summarization. Coreference resolution is a foundational yet challenging natural language processing task which, if performed successfully, is likely to enhance such systems significantly. In this paper, we propose a semantically oriented, rule-based method to resolve sortal anaphora, a specific type of coreference that forms the majority of coreference instances in biomedical literature. The method addresses all entity types and relies on linguistic components of SemRep, a broad-coverage biomedical relation extraction system. It has been incorporated into SemRep, extending its core semantic interpretation capability from sentence level to discourse level.
We evaluated our sortal anaphora resolution method in several ways. The first evaluation specifically focused on sortal anaphora relations. Our methodology achieved a F1 score of 59.6 on the test portion of a manually annotated corpus of 320 Medline abstracts, a 4-fold improvement over the baseline method. Investigating the impact of sortal anaphora resolution on relation extraction, we found that the overall effect was positive, with 50 % of the changes involving uninformative relations being replaced by more specific and informative ones, while 35 % of the changes had no effect, and only 15 % were negative. We estimate that anaphora resolution results in changes in about 1.5 % of approximately 82 million semantic relations extracted from the entire PubMed.
Our results demonstrate that a heavily semantic approach to sortal anaphora resolution is largely effective for biomedical literature. Our evaluation and error analysis highlight some areas for further improvements, such as coordination processing and intra-sentential antecedent selection.
Pulmonary arterial hypertension (PAH) is a rare and progressive disease of the pulmonary arterial circulation ….
There are currently 3 classes of drugs approved for the treatment of PAH: prostacyclin analogues, endothelin receptor antagonists, and phosphodiesterase type 5 inhibitors. …
Although definitive evidence will require randomized and properly controlled long-term trials, the current evidence supports the long-term use of these drugs for the treatment of patients with PAH.
Endothelin receptor antagonists-TREATS-PAH
Phosphodiesterase type 5 inhibitors-TREATS-PAH
In doing so, the system would also be able to move beyond sentence level processing to discourse level processing, bringing us closer to discourse understanding, the ultimate goal in NLP.
Several types of coreference are often distinguished. For example, anaphora is a coreference relation in which a coreferential mention (anaphor), such as these drugs above, refers to a previously mentioned entity (antecedent) in text. Cataphora refers to a relation in which the coreferential expression (cataphor) refers to an entity subsequent to the expression in text (consequent). Broader views of coreference also consider relation types such as bridging and appositive. Different types of coreference can be indicated with mentions of varying types. For example, a major type of anaphora (pronominal anaphora) is indicated by pronouns, such as it, their, itself. In Example (1), the anaphor these drugs is a demonstrative noun phrase, therefore the anaphora relation can be referred to as nominal anaphora. Nominal anaphora is sometimes also referred to as sortal anaphora since such anaphors carry semantic type (sort) information, in contrast to pronominal expressions. For instance, in Example (1), the antecedents of these drugs can only be drug or drug class instances. In the studies focusing on coreference resolution in biomedical literature, sortal anaphors have attracted most attention, since they occur more frequently than other types. Castaño and Pustejovsky  found that approximately 60 % of anaphora instances in their corpus of MEDLINE abstracts were sortal. This was confirmed by Gasperin and Briscoe , who found that the majority of anaphora instances involved definite and demonstrative noun phrases in their corpus of full-text articles about Drosophila melanogaster.
SemRep semantic interpreter
The antiviral agent amantadine has been used to manage Parkinson’s disease or levodopa-induced dyskinesias for nearly 5 decades.
SemRep processing relies on the UMLS SPECIALIST Lexicon , MedPost part-of-speech tagger , and underspecified syntactic analysis, and it is supported by MetaMap  for normalizing noun phrases to UMLS Metathesaurus concepts. Entrez Gene  serves as a supplementary source to the UMLS Metathesaurus with respect to gene/protein terms. Indicator rules are used to map lexical and syntactic phenomena to predicates. Indicators include lexical categories, such as verbs, nominalizations, and prepositions, and syntactic constructions, such as appositives or modifier-head structure in the simple noun phrase. For instance, in Example (2), the ISA predicate is indicated by the fact that its arguments are in a restrictive appositive construction ([antiviral agent ][amantadine]), while TREATS is lexically indicated by the verb manage. Using an ontology engineering approach , SemRep has been extended to domains that are outside the scope of the UMLS, such as disaster information management and public health. It has also been the basis for the Semantic MEDLINE web application  and SemMedDB, a PubMed-scale repository of semantic predications .
In the current study, our goal has been to extend semantic interpretation capabilities of SemRep through anaphora resolution. Based on the observed prominence of sortal anaphora in biomedical literature [2, 3], we focused specifically on sortal anaphora resolution. Our rule-based methodology has a linguistic orientation and makes heavy use of UMLS semantic knowledge. A major contribution of our work is that our approach is not restricted to certain entity types, in contrast to other biomedical coreference resolution studies that focus on certain types of biomedical entities (e.g., chemicals, genes, cells , gene/proteins [13, 14]). Our study is also distinct in its focus on the impact of anaphora resolution on relation extraction at a large scale.
Evaluation of sortal anaphora resolution on our annotated corpus
Partial evaluation of sortal anaphora resolution on the Protein Coreference Dataset used in the BioNLP 2011 shared task 
Evaluation of the impact of anaphora resolution on SemRep predications, for which we compared SemRep results with and without anaphora resolution on a separate set of 300 sentences
Estimation of the quantitative effect of anaphora resolution at a larger scale, for which we compared the number of predications and relation types extracted by SemRep using anaphora resolution with that extracted without anaphora resolution on 1 million MEDLINE citations
The results show that our semantic approach is effective in recognizing sortal anaphora relations and that its incorporation into SemRep allows it to replace generic and uninformative relations with more specific and informative ones. We have incorporated our resolution approach into SemRep, making it an option in semantic processing. The annotated corpus of MEDLINE citations is available at http://skr3.nlm.nih.gov/SortalAnaphora/.
Pioneering work in coreference resolution in general English focused on the interaction of pronominal anaphora with syntactic structure and discourse constraints [16–18]. Availability of corpora annotated for coreference, such as MUC7 , led to the prominence of supervised learning approaches for this task [20–22]. More recently, Haghighi and Klein  presented a deterministic algorithm that relies on syntactic, semantic, and discourse constraints and demonstrated good performance on several corpora. Lee et al.  extended this approach to propose a sieve architecture, which applies a set of deterministic coreference models (i.e., sieves) one at a time from highest to lowest precision, each sieve using the output of the previous one. Sieves include various string matching algorithms as well as speaker identification and pronoun resolution models. Their approach yielded state-of-the-art performance on the OntoNotes corpus , the current standard for evaluating coreference resolution systems for general English. The sieve architecture has been made part of the Stanford CoreNLP toolkit  and has been extended for multilingual coreference resolution by systems participating in the CoNLL 2012 Shared Task . In addition to such end-to-end coreference resolution approaches, much effort has also been devoted to specific coreference resolution subtasks, such as recognizing non-referential mentions (e.g., pleonastic it) [27, 28] and anaphoricity detection (i.e., determining whether a mention is anaphoric or not) [29, 30].
In the biomedical domain, most coreference resolution research has involved biomedical literature. Castaño and Pustejovsky  focused on pronominal and sortal anaphora resolution of bio-entities, using semantic information from UMLS. They achieved 73.8 % F1 score on a small set of MEDLINE abstracts. Their algorithm is based on scoring potential antecedents according to their compatibility with the anaphor (e.g., number agreement). The candidate with the highest score is taken as the antecedent. A similar approach is taken by Kim et al. , who additionally investigated the role of Centering Theory , syntactic parallelism between the anaphor and the antecedent, coordinate noun phrases, and appositive constructions. Their approach yielded a F1 score of 63 % on a different set of MEDLINE abstracts. Yang et al.  and Torii and Vijay-Shanker , on the other hand, used supervised machine learning techniques for anaphora resolution. Taking a noun phrase clustering approach and casting the problem as a binary classification task, Yang et al.  achieved an F1 score of 81.7 % on a small set of MEDLINE abstracts. Focusing on sortal anaphora only, Torii and Vijay-Shanker  reported 71.6 % precision and 77 % recall in cross-validation experiments. Diverging from this line of research that focused on anaphora resolution in MEDLINE abstracts, Gasperin and Briscoe  annotated a corpus of five full-text molecular biology articles with sortal anaphora as well as with types of domain-specific associative coreference, such as homology and related biotype (e.g., the relationship between a gene and its product, a protein). Their annotation also included set-membership relations. The Bayesian probabilistic model they used achieved an F1 score of 57 % for coreference, although it performed poorly on associative relations. In line with the observation that coreference resolution could improve event extraction pipelines, a supporting task was proposed in the BioNLP 2011 shared task on biological event extraction . A corpus of MEDLINE citations annotated for sortal and pronominal anaphora were provided to participants (BioNLP Protein Coreference Dataset). The best system was an adaptation of an existing coreference resolution system for newswire text and achieved an F1 score of 34 % , a significant performance loss from its performance on news text. Using Stanford CoreNLP sieve-based coreference resolution , Choi et al.  also obtained very poor results, confirming a trend of performance degradation of systems developed for the general domain. Underutilization of semantic information seems to be a factor for this trend . Conversely, using domain-specific semantic information, Nguyen et al.  achieved an F1 score of 62.4 % on the same corpus. With a hybrid approach, D’Souza and Ng  reported an F1 score of 67.4 %. Improvements of varying degrees in event/relation extraction have been reported with the incorporation of coreference resolution [36–39]. A common feature of all these studies is that they focus on a pairwise relation representation of coreference and on specific entity types. In contrast, in the CRAFT corpus , full coreference chains are annotated in the spirit of the OntoNotes corpus, and all semantic types are considered (drugs, diseases, etc.). In addition, full-text articles are annotated, rather than abstracts. We are not aware of any resolution studies based on this recent corpus.
In the biomedical domain, coreference resolution has also been addressed in clinical narratives, drug labels, and consumer health texts. The 2011 i2b2/VA shared task was concerned with coreference resolution in clinical reports . Training and evalution corpora annotated with coreference mention clusters were provided. Entity types considered for coreference included problem, person, test, treatment, and anatomical site. Rule-based, supervised learning, and hybrid approaches were proposed; a supervised learning approach which incorporated world knowledge and document structure  obtained the best results in one corpus, while a rule-based system  performed best in the other. Segura-Bedmar et al.  developed a corpus of drug interaction documents annotated with anaphora (DrugNer-AR) and obtained an F1 score of 76 % using Centering Theory constraints, semantic knowledge obtained with MetaMap , and drug class information for resolution. Névéol and Lu  improved the specificity of SemRep predications by using simple anaphora resolution heuristics in consumer medication texts and MeSH scope notes. Focusing on consumer health questions, Kilicoglu et al.  incorporated resolution of anaphora and ellipsis (a specific type of coreference characterized by the absence of one of the referents) to their question frame extraction pipeline and reported an 18 point improvement in F1 score in this task thanks to anaphora resolution.
In this section, we first discuss our data and the annotation study. Next, we describe the algorithm that we developed for anaphora resolution. We conclude the section by describing our evaluation of the algorithm.
Data and annotation
In order to develop, refine, and evaluate a sortal anaphora resolution module, we annotated a corpus of 320 MEDLINE citations with pairwise anaphora relations. Since we aimed at a general approach that takes into account all semantic types and consequently supports SemRep, we collected MEDLINE abstracts on a range of topics, including molecular biology and clinical medicine. Most molecular biology citations were previously used for evaluating some specific aspect of SemRep, such as nominalization processing , or were annotated for SemRep benchmarking . Citations on clinical medicine were identified by issuing a SemMedDB query for the predicate types TREATS/PREVENTS and PROCESS_OF in the period from 2011 to 2013 and retrieving a random subset of the query results.
One hundred forty-nine citations were double-annotated by two of the authors of this paper (GR, MF) to develop and refine annotation guidelines as well as to calculate inter-annotator agreement. Once a satisfactory inter-annotator agreement was achieved, the rest of the corpus (171 citations) was annotated by one of the authors only (GR). We used the double-annotated portion of the corpus for training and refining the algorithm, and the other portion for testing. The corpus was pre-annotated with entities extracted by SemRep (using the default UMLS 2006AA release) to assist the annotators and simplify the task. Annotators were instructed to annotate the named entities missed by SemRep, if relevant for the sortal anaphora annotation task; however, we did not require them to annotate a specific semantic type for these entities and simply used SPAN as a generic type. For example, from the phrase The flamenco gene, SemRep only extracts a concept for the phrase head gene (the extracted concept is Genes), as there is no specific concept for flamenco gene in the UMLS. This is clearly an inadequate mapping for the phrase. Therefore, the full phrase was annotated as SPAN since it acts as the antecedent in an anaphora relation.
In the first phase of annotation, each annotator annotated 5 abstracts to familiarize themselves with the task. They discussed their annotations with the primary author, who adjudicated their differences. In the next step, each annotator independently annotated batches of approximately 50 abstracts at a time. After each batch, we calculated inter-annotator agreement to assess their progress, and the annotators reconciled their differences to create the gold standard reference for the batch. We calculated inter-annotator agreement for both the anaphoric mentions and the anaphora relations. As the inter-annotator agreement measure, we used the F1 score of one set of the annotations, with the other set taken as the gold standard, a measure often used for inter-annotator agreement in biomedical relation annotation [49, 51]. It has been shown that κ statistic , more typically used to calculate inter-annotator agreement, approximates F1 score in cases that lack a well-defined number of negative instances, which makes chance agreement close to zero . After a satisfactory inter-annotator agreement was reached, one of the annotators (GR) annotated the rest of the corpus (171 citations) on her own.
The algorithm consists of two main phases: anaphor detection and anaphor-antecedent linking. The first phase is concerned with recognizing the noun phrases that are sortal anaphors and marking them as such. The second phase of the algorithm inspects these sortal anaphors and attempts to link them to their corresponding antecedents. Both phases of the algorithm presuppose a variety of linguistic information (lexical, morphological, syntactic, and semantic), made available by the core machinery of SemRep. Lexical and morphological information include individual tokens, their lemmas, part-of-speech tags, and inflection status (i.e., whether singular or plural) provided, to a large extent, by the SPECIALIST Lexicon . Syntactic information includes noun phrases as well as their heads and modifiers, identified with a shallow syntactic parser. Some syntactic constructions, such as appositives and coordinate noun phrases, are relevant to anaphora resolution and are identified as well. Semantic information is provided by MetaMap and includes mappings from noun phrases to UMLS Metathesaurus concepts with their CUIs and semantic types. The algorithm also relies on taxonomic relations encoded in the UMLS Metathesaurus, such as the one between Amantadine and Antiviral Agents. Such relations are already extracted as part of SemRep’s hypernymy processing (i.e., ISA relations) . Anaphora resolution requires that the entire previous discourse, not just the sentence with the anaphor, be available to identify antecedents. To facilitate this, we extended SemRep to take into account all linguistic information from previous sentences in addition to the current sentence, taking the first step toward discourse-level processing.
The noun phrase is in an appositive construction with the noun phrase that immediately follows it. For example, the definite noun phrase the gene in …the gene, BRCA1, … is not anaphoric.
The noun phrase has a modifier that is mapped to the UMLS separately from the head; in other words, the modifier is an example of a rigid designator . This precludes the Src family from being considered a potential anaphor, since Src is a rigid designator. A similar condition applies to noun phrases which are followed by a prepositional phrase cued by of. For example, the symptoms in the symptoms of lupus erythematosus is ruled out as an anaphor.
The noun phrase is cataphoric. We distinguish cataphoric phrases as those that contain the word following, as in the following signs.
The number feature of the noun phrase head is incompatible with that of the determiner. For example, in the fragment Both short-term dynamic psychotherapy and cognitive therapy have a place …, Both short-term dynamic psychotherapy is chunked as an individual noun phrase and this constraint rules it out as a potential anaphor, since the determiner both is plural and the head psychotherapy is singular. This step is applied mainly to address a shortcoming of noun phrase chunking, even though the number agreement principle between the head and the determiner is general.
The head of the noun phrase is not associated with a UMLS Metathesaurus concept. These are excluded due to lack of semantic information to use in subsequent steps.
In the SemRep pipeline, anaphor-antecedent linking is performed before indicator rules and argument identification rules are applied to generate semantic predications, so that argument identification rules can take the results of anaphora resolution into account when determining the arguments of predicates.
As preparation for this phase, we combine the linguistic analyses from the sentences prior to the sentence containing the sortal anaphor under consideration, including coordination information. Anaphora resolution needs to take into consideration the entire discourse preceding the anaphor, especially in the context of MEDLINE abstracts, which are often relatively short.
The next step in anaphor-antecedent linking is selection of antecedents consonant with the sortal anaphor. To select these antecedents, we process the noun phrases (including coordinate noun phrases) that precede the sortal anaphor. Two consonance criteria are applied: semantic consonance and number agreement.
The UMLS concept corresponding to A is an ancestor of the UMLS concept corresponding to B AND they are not in a meronymic (part-whole) relationship (Taxonomy constraint)
B belongs to a UMLS semantic group , which has as one of its associated headwords the headword of A (Headword constraint)
A and B have the same headword but map to different UMLS Metathesaurus concepts and the number of tokens in B is greater than that in A (Shared Headword constraint)
The Taxonomy constraint is similar to the definition of hypernymy in Rindflesch and Fiszman . Meronymy is assumed between A and B if their UMLS concepts both belong to Anatomy semantic group and they do not have Cell semantic type; in other words, if they both correspond to biological units higher than the cell. This constraint is necessary since the UMLS concept hierarchy encodes meronymic as well as taxonomic relationships. While meronymy may be useful for associative coreference , we did not find it useful for sortal anaphora. The Taxonomy constraint predicts semantic compatibility between cetirizine and the drug, while it finds that right ventricle and heart are incompatible, since the relationship between them is one of meronymy.
For the Headword constraint, we developed a headword list for several semantic groups based on our training set. For example, Disorder headwords include condition, ailment, abnormality, and problem, while the Therapeutic Modality headwords include medication, intervention, and agent. Such word lists are useful to compensate for the fact that UMLS concepts corresponding to such general terms are often not in the expected taxonomic relation with specific instances of these semantic classes. The Headword constraint predicts compatibility between the illness and Immune reconstitution inflammatory syndrome, which are not in a taxonomic relationship in the UMLS, for example. On the other hand, the Shared Headword constraint predicts compatibility between the reaction and anaphylactoid reaction, because reaction and anaphylactoid reaction are mapped to different UMLS Metathesaurus concepts and they share the same headword. Finally, we stipulate that neither the sortal anaphor nor the antecedent candidate belong to the semantic group Concept, which includes semantic types such as Idea or Concept, Conceptual Entity, and Functional Concept, and is too broad and heterogenous to be useful in anaphora resolution. For candidates that are coordinate noun phrases, the semantic consonance constraints are applied between the sortal anaphor and each of the conjuncts in the coordinate noun phrase.
The other consonance measure, number agreement, is a commonly used feature in coreference resolution. Our implementation uses the number feature provided by the SPECIALIST Lexicon. The sortal anaphor and an antecedent candidate are taken as compatible with respect to number if their heads agree on this feature (i.e., if both are plural or both are singular). The number feature for unknown words is taken as singular. Number agreement also takes into account coordinate noun phrases: a plural sortal anaphor is taken to be consonant with an antecedent candidate that is a coordinate noun phrase.
If there are antecedent candidates in the same sentence, the one closest to the anaphor is taken as the antecedent.
Else, we move to the closest preceding sentence with compatible antecedent candidates and the leftmost compatible candidate in that sentence is chosen as the antecedent.
These steps seek to predict discourse salience of entities, in a sense similar to prediction of the preferred center in Centering Theory .
Integrating anaphora resolution with relation generation
After generating anaphora links with the steps outlined above, SemRep attempts to use these links in relation generation, if appropriate. For an anaphora link to be used in relation generation, we require that the sortal anaphor noun phrase serve as the subject or object argument of a predicate. In such cases, rather than using the sortal anaphor in the predication, we simply substitute it with its antecedent(s) as the relevant argument(s). The rest of the relation generation procedure remains the same. In cases where sortal anaphor does not serve as an argument of a predicate, the corresponding anaphora link simply remains unused. For instance, in Example (1), the anaphor these drugs was recognized as the subject of the predicate treatment, which indicates a TREATS relation. With no anaphora resolution, relation generation would simply generate the predication Drugs-TREATS-PAH. With anaphora resolution, the subject argument in this predication (Drugs) is replaced by the UMLS concepts corresponding to the antecedents (Prostacyclin analogues, Endothelin receptor antagonists, and Phosphodiesterase type 5 inhibitors), resulting in three informative predications instead of the less informative Drugs-TREATS-PAH.
In this study, we evaluated both anaphora resolution and its contribution to relation extraction. Additionally, we assessed the quantitative impact of anaphora resolution on the PubMed scale repository of biomedical relations supported by SemRep, SemMedDB .
As a baseline method for anaphora resolution, we considered a noun phrase containing one of the determiners or adjectives of interest (see Anaphor detection section) as a sortal anaphor and took the closest preceding noun phrase whose head word and number match those of the sortal anaphor as its antecedent. This baseline method is a more informed one than the one used by Segura-Bedmar et al.  and Kilicoglu and Demner-Fushman , who simply considered the closest preceding noun phrase as the antecedent.
The character offsets of its arguments (anaphor and antecedent) overlap with those of a relation in the reference standard and the semantic types of the arguments match (i.e., approximate match).
OR one or both of its arguments are subsumed by a span annotation in the reference standard (i.e., no explicit semantic type matching is required).
OR the concept corresponding to the antecedent matches that of the antecedent in the relation in the reference standard (i.e., no antecedent character offset overlap is required).
An adult male bullmastiff dog was treated for paraparesis and ataxia due to discospondylitis and disc herniation. At this time, the dog had a nonhealing ulcer between the pads of the left hindfoot.
In this example, the first phrase An adult male bullmastiff dog was annotated as the antecedent (with generic SPAN type), which cannot be fully mapped to a UMLS concept. The algorithm identifies this noun phrase as the antecedent; however, it uses the concept corresponding to its head dog as the argument of the anaphora. Using the second evaluation criterion, we consider such cases true positives.
A radioaerosol technique was used to assess the effects on mucus clearance of 14 days treatment with formoterol or tiotropium , as well as single doses of these drugs. RESULTS: The 4 h whole lung retention of radioaerosol was significantly higher after 14 days treatment with tiotropium (P = 0.016), but not after 14 days treatment with formoterol . However, patients bronchodilated after 14 days treatment with both drugs , so that the deposited radioaerosol may have had an increased distance to travel in order to be cleared by mucociliary action.
In this example, the anaphor both drugs refers to the drugs tiotropium and formoterol. Following annotation guidelines, the annotators annotated the closest mentions to the anaphor in the sentence preceding the one with the anaphor. The algorithm, on the other hand, identified as the antecedent the coordinate noun phrase formoterol or tiotropium in the first sentence. At the ontological semantic level, this is equivalent to the annotated antecedents, although the corresponding mentions do not overlap with those in the reference standard. Using the third criterion, we consider such cases true positives, as well.
For comparison with other anaphora resolution approaches and corpora, we also evaluated our approach against the BioNLP Protein Coreference Dataset , the most widely used coreference resolution corpus focusing on biomedical literature. We limited our evaluation on this dataset to sortal anaphora instances and did not consider the cases of pronominal anaphora. Sortal anaphora instances constitute 12.5 % of all the coreference relations in this corpus (n =69). These instances were identified by removing the anaphora relations indicated by pronominal anaphors from the dataset.
To assess the contribution of anaphora resolution to relation extraction, we processed with SemRep a set of 1 million MEDLINE citations that included abstracts, dated from April 2014 to June 2015. Two sets of output were generated: one set was generated without anaphora resolution and the other with anaphora resolution. We report results concerning the quantitative impact of anaphora resolution on this set. From the 1 million citation set, we also selected 300 sentences, for which anaphora resolution resulted in additional semantic predications. One of the authors (GR) manually examined the predications in these sets and evaluated their correctness. In the absence of a predication reference standard for these sentences, we only calculated precision. In this study, we recognize that categorizing predications as simply true positive or false positive does not adequately elucidate the contribution of anaphora resolution to relation extraction, because in some cases, anaphora resolution increases the specificity and informativeness of an existing predication, rather than generating a new additional predication. For instance, in Example (1), Drugs-TREATS-PHA is a somewhat uninformative (but not incorrect) predication and, if anaphora resolution succeeds, it would be substituted by the predication Prostacyclin analogues-TREATS-PAH, which is correct and more informative. To accommodate such positive changes, we marked predications like Drugs-TREATS-PHA as partially correct in this evaluation.
Results and discussion
In this section, we present results pertaining to inter-annotator agreement as well as the evaluation results and discuss these results in detail. We conclude the section by providing an error analysis.
Inter-annotator agreement computed using F1 score
The inter-annotator agreement showed a clear improvement trend over three iterations, indicating that with sufficient guidelines and practice, good agreement can be achieved for this task. Discussing the annotation differences and reconciling them before moving on to the next batch also seem beneficial.
The average numbers of anaphoric mentions and relations are higher in the test set than in the training set. In creating the training set, we did not filter citations based on the presence of anaphoric expressions (e.g., the gene). We did, however, perform this filtering step in the test set, which led to a higher proportion of citations with anaphora relations (14.6 % in the training set vs. 21.5 % in the test set).
Member counts in set-membership relations
It should also be noted that more than one SPAN annotation was created per citation (1.49 on average), which indicates that the antecedents often involve entities that do not map to UMLS concepts in a straightforward manner (such as An adult male bullmastiff dog discussed above). We also found that approximately 85 % of all anaphora relations were inter-sentential, showing that coreference resolution is highly important for discourse-level text understanding.
Anaphora resolution evaluation
Anaphora resolution algorithm
Ablation study results
Shared Headword constraint
The anaphoricity filter improved performance significantly (4 percentage points), in contrast to previous studies in which such filtering often resulted in poorer results . Among the semantic compatibility measures, the Taxonomy constraint had the greatest impact on performance (an improvement of more than 37 points). The effects of the Headword and Shared Headword constraints were much smaller (1.6 and 0.5 points respectively). The Number agreement yielded a noticeable, positive improvement (about 5.5 points). These results show that a strong semantic constraint coupled with number agreement can successfully identify antecedent candidates. The effect of removing the set-membership recognition component was close to that of removing the Taxonomy constraint, lowering the F1 score to half, another indication of the importance of set-membership anaphora in biomedical literature.
Performance on intra- vs. inter-sentential anaphora
Anaphora resolution evaluation on the BioNLP protein coreference dataset (development portion)
D’Souza and Ng 
Effect on semantic interpretation
Effect of anaphora resolution on semantic interpretation
Partially correct → True positive
False positive → False positive
Partially correct → False positive
True positive → False positive
Partially correct → Partially correct
True positive → True positive
The cyst fluids were shown to be a rich source for acidic glycoproteins . The study of these proteins can potentially lead to the identification of more effective biomarkers for ovarian cancer.
Although the anaphora relation between these proteins and acidic glycoproteins is captured correctly by the algorithm, the resulting predication in Example (5c) is a precision error, because the original predication that it is based on (Example (5b)) is incorrect, due to misidentification of the subject as proteins, instead of the correct subject biomarkers.
NASH is a distinct entity from NAFLD, and is characterized by the presence of inflammation with hepatocytes damage, with or without fibrosis. While several therapeutic strategies have been proposed to improve this condition, the present review aims to discuss nonmedicinal interventions used to reduce liver involvement or to prevent the disease altogether.
Without anaphora resolution, SemRep generates the predication in Example (6b), which, while correct, is uninformative, and therefore, considered partially correct. Because the acronym NASH cannot be resolved to nonalcoholic steatohepatitis, anaphora resolution algorithm does not recognize it as an antecedent candidate, identifying inflammation instead as the antecedent for the anaphora the disease. This leads to the incorrect predication in Example (6c).
Meanwhile, three genes ( API5, AIFM1, and NFkappaB1 ) showed changes of expression in the hippocampus of Ts65Dn mice compared with normal mice…However, some well-known genes related to cell apoptosis, such as the caspase family, Bcl-2, Bad, Bid, Fas, and TNF , did not show changes in expression levels. The genes we found which were differentially expressed in the hippocampus of Ts65Dn mice may be closely related to cell apoptosis.
∗BAD gene-PART_OF-Entire hippocampus
∗TNF gene-PART_OF-Entire hippocampus
∗FAS gene-PART_OF-Entire hippocampus
∗BID gene-PART_OF-Entire hippocampus
Overall SemRep precision with and without anaphora resolution
Partially correct predications ignored
Enhanced with anaphora resolution
Partially correct predications awarded 0.5 points
Enhanced with anaphora resolution
Effect of sortal anaphora resolution at PubMed scale
The rates of change for the top 10 predicate types in 1 million Medline abstracts
Three major error types were caused by coordination processing, anaphoricity filtering and semantic consonance measures. While coordination processing errors predominantly led to recall errors (82 %), precision vs. recall errors due to anaphoricity filtering and semantic constraints were more evenly distributed (57 % and 63 % are recall errors, respectively). Among the less prevalent error types, number constraint errors were almost exclusively recall errors (95 %), while salience rules and UMLS mappings caused a higher number of precision errors (both 59 %).
Therefore, data that support the long-term therapeutic benefits of these long-term PAH therapies are limited and derived primarily from uncontrolled, observational studies …
this paper …provides an overview of clinically available phosphodiesterase inhibitors and discusses tadalafil in relationship to sildenafil ( Revatio(R)), the first phosphodiesterase inhibitor approved for PAH.
As discussed above, removing the anaphoricity filter lowered the F1 score by 4 %, lowering precision while increasing recall, showing that its overall effect was positive, although a more nuanced constraint could further improve results.
…progressive focal glomerulosclerosis in the Imai rats is associated with oxidative stress, inflammation , and impaired Nrf2 activation. These abnormalities are accompanied by activation of …
After fasting glucose was optimized by insulin glargine, nateglinide or acarbose was initiated and then crossed over after second wash out period. … Both drugs …
The percentage of patients discontinued treatment within 12 months was 41.4 % for chlorpromazine, 39.5 % for sulpiride, 36.7 % for clozapine, 40.2 % for risperidone, 39.6 % for olanzapine, 46.9 % for quetiapine, and 40.2 % for aripiprazole , a nonsignificant difference (p=0.717); there were no significant differences among these seven treatments on discontinuation due to relapse, …
From January 1993 to March 1998, 268 patients were randomized to adjuvant chemotherapy (135 patients) or surgery alone (133 patients). All patients underwent gastrectomy with D2 or greater lymph node dissection. The chemotherapy regimen consisted …
On the other hand, lack of a full UMLS mapping for impaired Nrf2 activation in Example (10) above led to an error in coordination processing, which also relies on such information, and it resulted in a negative effect on anaphora resolution.
Stereotypies and orobuccolingual dyskinesias are the most frequently observed tardive disorders, particularly in the elderly population exposed to neuroleptics , …. The development of these disorders is dependent on the potency of the drug , duration of exposure, and …
We report a case of pulmonary infection caused by a rare Nocardia species, Nocardia beijingensis , in a 48-year-old man who received multiple immunosuppressive therapy after renal transplantation. This pathogen was isolated from a bronchoscopic protected specimen brush ….
In this example, the salience-based selection prefers Nocardia species over Nocardia beijingensis as antecedent because it is the leftmost compatible antecedent candidate in the sentence. This leads to both a precision and a recall error.
Among the substances which are commonly used are ion channel modulators (e.g. pregabalin, gabapentin, carbamazepine). The aim of this study was to investigate the use of these drugs in clinical practice in a larger patient cohort.
These errors can be considered soft errors, since the extracted anaphora relations can still be useful for the downstream relation generation step.
We presented a general, linguistically-oriented methodology that relies heavily on UMLS semantic knowledge to recognize sortal anaphora relations in biomedical literature. In contrast to previous studies on this topic, we did not focus on specific entity types. Our semantic approach resulted in a 4-fold increase in F1 score over the baseline. The methodology has been incorporated into a general biomedical semantic relation extraction tool, SemRep, and we showed that its overall effect on relation extraction is positive. Since SemRep supports a literature-based biomedical knowledge management tool, Semantic Medline , and a PubMed-scale repository of semantic relations, SemMedDB , which in turn support tasks such as literature-based discovery  and question answering , we believe that enhancing SemRep with anaphora resolution will benefit such downstream applications. While our study focused on MEDLINE citations, the methodology makes few assumptions regarding the type of input text. Therefore, we believe it would be largely applicable to full-text articles, although discourse-based constraints (e.g., the distance between the anaphor and the antecedent candidate) would probably need to be taken into account. With relatively short length of MEDLINE citations, such constraints were not needed. Anaphora resolution is made available as an option in the web-based SemRep tool2. The annotated corpus used for training and evaluation is publicly available at http://skr3.nlm.nih.gov/SortalAnaphora. A UMLS license is required.
While the overall effect of anaphora resolution is positive and renders more informative predications, the evaluation and error analysis revealed areas for potential improvement. These include a more nuanced approach to salience-based best antecedent selection for intra-sentential anaphora relations, an improved method to detect rigid designators, and an improved coordination processing, which would enhance not only anaphora resolution but also the core semantic interpretation capability of SemRep. Future work also involves pronominal anaphora resolution, which we have not attempted in this study due to its relative sparsity in biomedical literature.
1 The asterisk (∗) indicates an incorrect predication.
The authors thank Dongwook Shin for his assistance with the annotation tool. This work was supported by the intramural research program at the U.S. National Library of Medicine, National Institutes of Health.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Zheng J, Chapman WW, Crowley RS, Savova GK. Coreference resolution: A review of general methodologies and applications in the clinical domain. J Biomed Inform. 2011; 44(6):1113–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Castaño J, Zhang J, Pustejovsky J. Anaphora resolution in biomedical literature. In: Proc International Symposium on Reference Resolution for NLP. Alicante, Spain: University of Alicante: 2002.Google Scholar
- Gasperin C, Briscoe T. Statistical anaphora resolution in biomedical texts. In: Proceedings of COLING 2008. Stroudsburg, PA, USA: Association of Computational Linguistics: 2008. p. 257–264.Google Scholar
- Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003; 36(6):462–77.View ArticlePubMedGoogle Scholar
- Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004; 32(Database issue):267–70.View ArticleGoogle Scholar
- McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical terminologies. In: Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care: 1994. p. 235–9.Google Scholar
- Smith LH, Rindflesch TC, Wilbur WJ. MedPost: a part-of-speech tagger for biomedical text. Bioinformatics. 2004; 20(14):2320–1.View ArticlePubMedGoogle Scholar
- Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010; 17(3):229–36.View ArticlePubMedPubMed CentralGoogle Scholar
- Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D. The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, MD, USA: Association of Computational Linguistics: 2014. p. 55–60.Google Scholar
- Rosemblat G, Shin D, Kilicoglu H, Sneiderman C, Rindflesch TC. A methodology for extending domain coverage in SemRep. J Biomed Inform. 2011; 46(6):1099–107.View ArticleGoogle Scholar
- Kilicoglu H, Fiszman M, Rodriguez A, Shin D, Ripple A, Rindflesch T. In: (Salakoski T, Schuhmann DR, Pyysalo S, editors.)Semantic, MEDLINE: A Web Application to Manage the Results of PubMed Searches. Turku, Finland: Turku Centre for Computer Science (TUCS); 2008, pp. 69–76.Google Scholar
- Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC. SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics. 2012; 28(23):3158–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim JJ, Park JC. BioAR: Anaphora Resolution for Relating Protein Names to Proteome Database Entries. In: ACL 2004: Workshop on Reference Resolution and its Applications. Barcelona, Spain: Association of Computational Linguistics: 2004. p. 79–86.Google Scholar
- Nguyen NLT, Kim JD, Miwa M, Matsuzaki T, Tsujii J. Improving protein coreference resolution by simple semantic classification. BMC Bioinformatics. 2012; 13:304.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim JD, Nguyen N, Wang Y, Tsujii J, Takagi T, Yonezawa A. The Genia event and protein coreference tasks of the BioNLP shared task 2011. BMC Bioinformatics. 2012; 13(Suppl 11):S1.View ArticlePubMedPubMed CentralGoogle Scholar
- Hobbs JR. Resolving pronoun references. Lingua. 1978;44:311–38. Reprinted in Grosz et al; 1986.Google Scholar
- Lappin S, Leass HJ. An algorithm for pronominal anaphora resolution. Comput Linguist. 1994; 20(4):535–61.Google Scholar
- Grosz BJ, Weinstein S, Joshi AK. Centering: a framework for modeling the local coherence of discourse. Comput Linguist. 1995; 21(2):203–25.Google Scholar
- Hirschman L, Chinchor N. Appendix F: MUC-7 Coreference Task Definition (version 3.0). In: 7th Message Understanding Conference (MUC-7). Fairfax, VA: 1998.Google Scholar
- Soon WM, Ng HT, Lim DCY. A machine learning approach to coreference resolution of noun phrases. Comput Linguist. 2001; 27(4):521–44.View ArticleGoogle Scholar
- Ng V, Cardie C. Improving Machine Learning Approaches to Coreference Resolution. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Morristown, NJ, USA: Association of Computational Linguistics: 2002. p. 104–11.Google Scholar
- Rahman A, Ng V. Supervised Models for Coreference Resolution. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2. Stroudsburg, PA, USA: Association of Computational Linguistics: 2009. p. 968–77.Google Scholar
- Haghighi A, Klein D. Simple Coreference Resolution with Rich Syntactic and Semantic Features. Singapore: Association for Computational Linguistics; 2009, pp. 1152–61.View ArticleGoogle Scholar
- Lee H, Chang A, Peirsman Y, Chambers N, Surdeanu M, Jurafsky D. Deterministic Coreference Resolution Based on Entity-centric, Precision-ranked Rules. Comput Linguist. 2013; 39(4):885–916.View ArticleGoogle Scholar
- Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R. OntoNotes: The 90 % Solution. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. New York City, NY, USA: Association of Computational Linguistics: 2006. p. 57–60.Google Scholar
- Pradhan S, Moschitti A, Xue N, Uryupina O, Zhang Y. CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. In: Joint Conference on EMNLP and CoNLL - Shared Task. Jeju, Korea: Association of Computational Linguistics: 2012. p. 1–40.Google Scholar
- Bergsma S, Yarowsky D. NADA: A Robust System for Non-Referential Pronoun Detection. In: Proceedings of DAARC. Berlin Heidelberg, Germany: Springer: 2011. p. 12–23.Google Scholar
- Weissenbacher D, Nazarenko A. A bayesian classifier for the recognition of the impersonal occurrences of the ‘it’ pronoun. In: Discourse Anaphora and Anaphor Resolution Colloquium. Portugal: Discourse Anaphora and Anaphor Resolution Colloquium: May 2007. p. 145–150.Google Scholar
- Ng V, Cardie C. Identifying Anaphoric and Non-Anaphoric Noun Phrases to Improve Coreference Resolution. In: COLING 2002: The 19th International Conference on Computational Linguistics. Stroudsburg, PA, USA: Association of Computational Linguistics: 2002. p. 1–7.Google Scholar
- Poesio M, Alexandrov-Kabadjov M, Vieira R, Goulart R, Uryupina O. Does Discourse-new Detection Help Definite Description Resolution? In: Sixth International Workshop on Computational Semantics: 2005. p. 236–46.Google Scholar
- Yang X, Su J, Zhou G, Tan CL. An NP-Cluster Based Approach to Coreference Resolution. In: Proceedings of COLING’04. Morristown, NJ, USA: Association of Computational Linguistics: 2004. p. 226–32.Google Scholar
- Torii M, Vijay-Shanker K. Sortal Anaphora Resolution in Medline Abstracts. Computational Intelligence. 2007; 23(1):15–27.View ArticleGoogle Scholar
- Kim Y, Riloff E, Gilbert N. The Taming of Reconcile As a Biomedical Coreference Resolver. In: Proceedings of the BioNLP Shared Task 2011 Workshop. Portland, OR, USA: Association of Computational Linguistics: 2011. p. 89–93.Google Scholar
- Choi M, Verspoor K, Zobel J. Analysis of Coreference Relations in the Biomedical Literature. In: Proceedings of the Australasian Language Technology Association Workshop 2014. Melbourne, Australia: Australasian Language Technology Association: 2014. p. 134–8.Google Scholar
- D’Souza J, Ng V. Anaphora Resolution in Biomedical Literature: A Hybrid Approach. In: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. Orlando, FL, USA: ACM: 2012. p. 113–22.Google Scholar
- Yoshikawa K, Riedel S, Hirao T, Asahara M, Matsumoto Y. Coreference Based Event-Argument Relation Extraction on Biomedical Text. J Biomed Semant. 2011; 2(Suppl 5):S6.View ArticleGoogle Scholar
- Miwa M, Thompson P, Ananiadou S. Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics. 2012; 28(13):1759–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Kilicoglu H, Bergler S. Biological event composition. BMC Bioinformatics. 2012; 13(Suppl 11):S7.View ArticlePubMedPubMed CentralGoogle Scholar
- Lavergne T, Grouin C, Zweigenbaum P. The contribution of co-reference resolution to supervised relation detection between bacteria and biotopes entities. BMC Bioinformatics. 2015; 16(Suppl 10):S6.View ArticlePubMedPubMed CentralGoogle Scholar
- Cohen KB, Lanfranchi A, Corvey W, Baumgartner WA, Roeder C, Ogren PV, et al. Annotation of all coreference in biomedical text: Guideline selection and adaptation. In: Proceedings of BioTxtM 2010: 2nd workshop on building and evaluating resources for biomedical text mining. Valletta, Malta: ELRA: 2010. p. 37–41.Google Scholar
- Uzuner Ö, Bodnari A, Shen S, Forbush T, Pestian J, South BR. Evaluating the state of the art in coreference resolution for electronic medical records. JAMIA. 2012; 19(5):786–91.PubMedPubMed CentralGoogle Scholar
- Xu Y, Liu J, Wu J, Wang Y, Tu Z, Sun J, et al. A classification approach to coreference in discharge summaries: 2011 i2b2 challenge. JAMIA. 2012; 19(5):897–905.PubMedPubMed CentralGoogle Scholar
- Glinos D. A search based method for clinical text coreference resolution. In: Proceedings of the 2011 i2b2/VA/Cincinnati Workshop on Challenges in Natural Language Processing for Clinical Data: 2011.Google Scholar
- Segura-Bedmar I, Crespo M, de Pablo-Sánchez C, Martínez P. Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents. BMC Bioinformatics. 2010; 11(Suppl 2):S1.View ArticlePubMedPubMed CentralGoogle Scholar
- Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc (JAMIA). 2010; 17(3):229–36.View ArticleGoogle Scholar
- Névéol A, Lu Z. In: (Veinot TC, Ümit V Çatalyürek, Luo G, Andrade H, Smalheiser NR, editors.)Automatic integration of drug indications from multiple health resources. Arlington, VA, USA: ACM; 2010, pp. 666–73.Google Scholar
- Kilicoglu H, Fiszman M, Demner-Fushman D. Interpreting Consumer Health Questions: The Role of Anaphora and Ellipsis. In: Proceedings of the 2013 Workshop on Biomedical Natural Language Processing. Sofia, Bulgaria: Association of Computational Linguistics: 2013. p. 54–62.Google Scholar
- Kilicoglu H, Fiszman M, Rosemblat G, Marimpietri S, Rindflesch T. Arguments of Nominals in Semantic Interpretation of Biomedical Text. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Uppsala, Sweden: Association of Computational Linguistics: 2010. p. 46–54.Google Scholar
- Kilicoglu H, Rosemblat G, Fiszman M, Rindflesch T. Constructing a semantic predication gold standard from the biomedical literature. BMC Bioinformatics. 2011; 12(1):486+.View ArticlePubMedPubMed CentralGoogle Scholar
- Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J. brat: a Web-based Tool for NLP-Assisted Text Annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: Association of Computational Linguistics: 2012. p. 102–7.Google Scholar
- Thompson P, Iqbal SA, McNaught J, Ananiadou S. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics. 2009; 10:349.View ArticlePubMedPubMed CentralGoogle Scholar
- Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960; 20(1):37.View ArticleGoogle Scholar
- Hripscak G, Rothschild AS. Agreement, the F-measure, and reliability in information retrieval. JAMIA. 2005; 12(3):296–8.Google Scholar
- McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. Proc Medinfo. 2001; 10(pt 1):216–20.Google Scholar
- Kilicoglu H, Demner-Fushman D. Coreference Resolution for Structured Drug Product Labels. In: Proceedings of the 2014 Workshop on Biomedical Natural Language Processing. Baltimore, MD, USA: Association of Computational Linguistics: 2014. p. 45–53.Google Scholar
- Miller CM, Rindflesch TC, Fiszman M, Hristovski D, Shin D, Rosemblat G, et al. A closed literature-based discovery technique finds a mechanistic link between hypogonadism and diminished sleep quality in aging men. Sleep. 2012; 35(2):279–85.PubMedPubMed CentralGoogle Scholar
- Hristovski D, Dinevski D, Kastrin A, Rindflesch TC. Biomedical question answering using semantic relations. BMC Bioinformatics. 2015; 16(1):6+.View ArticlePubMedPubMed CentralGoogle Scholar