Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents
© Segura-Bedmar et al. 2010
Published: 16 April 2010
Skip to main content
© Segura-Bedmar et al. 2010
Published: 16 April 2010
Drug-drug interactions are frequently reported in the increasing amount of biomedical literature. Information Extraction (IE) techniques have been devised as a useful instrument to manage this knowledge. Nevertheless, IE at the sentence level has a limited effect because of the frequent references to previous entities in the discourse, a phenomenon known as 'anaphora'. DrugNerAR, a drug anaphora resolution system is presented to address the problem of co-referring expressions in pharmacological literature. This development is part of a larger and innovative study about automatic drug-drug interaction extraction.
The system uses a set of linguistic rules drawn by Centering Theory over the analysis provided by a biomedical syntactic parser. Semantic information provided by the Unified Medical Language System (UMLS) is also integrated in order to improve the recognition and the resolution of nominal drug anaphors. Besides, a corpus has been developed in order to analyze the phenomena and evaluate the current approach. Each possible case of anaphoric expression was looked into to determine the most effective way of resolution.
An F-score of 0.76 in anaphora resolution was achieved, outperforming significantly the baseline by almost 73%. This ad-hoc reference line was developed to check the results as there is no previous work on anaphora resolution in pharmalogical documents. The obtained results resemble those found in related-semantic domains.
The present approach shows very promising results in the challenge of accounting for anaphoric expressions in pharmacological texts. DrugNerAr obtains similar results to other approaches dealing with anaphora resolution in the biomedical domain, but, unlike these approaches, it focuses on documents reflecting drug interactions. The Centering Theory has proved being effective at the selection of antecedents in anaphora resolution. A key component in the success of this framework is the analysis provided by the MMTx program and the DrugNer system that allows to deal with the complexity of the pharmacological language. It is expected that the positive results of the resolver increases performance of our future drug-drug interaction extraction system.
A drug-drug interaction occurs when one drug influences the level or activity of another drug. Drug-drug interactions are common adverse drug reactions and unfortunately they are a frequent cause of death in hospitals . Several published drug safety issues have showed that adverse effects of drugs may be detected too late, when millions of patients have already been exposed . Therefore, they have an important impact on patient safety because they can be quite dangerous and their relatively high incidence among certain population groups such as geriatric or polydrug patients. In addition, drug interactions account for 16.6% of adverse drug reactions causing hospitalization , thus they are a direct cause of the increase of health care costs.
There are different resources which describe information about drug interactions (for example, DRUG-REAX System or the drug interaction appendix of the British National Formulary, but unfortunately there is a lack of consistency in the inclusion and grading of drug interactions across them , and they rarely include the whole range of drug interactions reported in the medical literature . Therefore, the development of automatic methods for collecting, maintaining and interpreting this information is crucial to achieve a real improvement in their early detection. Natural Language Processing can provide an interesting way to reduce the time spent by health care professionals on reviewing the literature.
Although beta-adrenergic blockers or calcium channel blockers and digoxin may be useful in combination to control atrial fibrillation, their additive effects on AV node conduction can result in advanced or complete heart block.
In addition triamterene, metformin and amiloride should be co-administered with care as they might increase dofetilide levels.
Heuristic approaches that integrate different knowledge sources like gender and number agreement, syntactic patterns or semantic information to obtain a plausible list of candidates [8–10]. The major drawback of these approaches is that it is very labor-intensive and time-consuming to construct the domain knowledge base necessary for resolving the anaphors.
Machine learning approaches compute the most likely candidate based on previous examples. These approaches can sort out the referred problem in heuristic approaches, however it usually comes across the data sparseness problem of language modeling, so they require a large amount of data to train [11, 12].
In the biomedical domain, the lack of available corpora motivated that early approaches were mostly based on heuristics. In this sense, Castano et al.  present a method for resolving anaphoric expressions for candidates taken from MedLine articles and abstracts. By defining a different range of resolution scope for each type of anaphoric expression, it uses different morphological, syntactic and semantic features such as number or semantic type agreement (UMLS typing-based system), longest common subsequence for similarity among candidate antecedents and coercion-type matching (most suitable agent / patient linguistic role according to the verb) from the most frequent bio-relevant verbs in Medline. Each possible antecedent of a certain anaphora was given a different cumulative score according to the significance of its linguistic features and the one with the best salience measure was chosen. General results are 73.8% F-score over a corpus of 46 MedLine abstracts which were annotated by a domain expert.
Lin et al.  also apply this scoring technique but they restrict the types of nominal anaphoric expressions to be taken into account, enrich the syntactic features with new values and apply coercion-type matching as before, using Genia corpus . General results are 92% F-score in pronominal anaphora and 78% in nominal anaphora in 32 Medline abstracts (MedStract) . This approach is improved in  by using new resources like WordNet or PubMed for finding semantic relationships among concepts not found in UMLS. They extend the MedStract corpus with 100 Medline abstracts obtaining 87.43% F-score for pronominal anaphora and 80.61% for nominal anaphora.
Anaphora resolution applied to the field of protein interactions can be found in , which presents an anaphora resolution system integrated in a larger protein-protein interaction extraction study, so-called BioAR. It identifies antecedents of pronouns by applying patterns for parallelism and centering theory . Nominal phrase anaphors are identified according to the most salient score, using similar features as in [13, 20]. Experimental results are 75% precision and 56.3% recall in pronoun resolution and 75% precision and 52.2% in definite noun phrase resolution from 120 unseen biological interactions extracted by BiolE system .
Likewise, in  the impact of anaphora resolution on the result of a protein interaction extraction system is analyzed by using the Guitar system  over the 20 full texts and abstracts of the Medstract corpus and three articles taken from the Journal of Biological Chemistry. From the 402 protein-protein interactions in the corpus, only 20 were conveyed by an anaphoric expression. Results show 70% recall in anaphora resolution in abstracts and 52.65% in full texts. No data about precision are available. Results suggest small improvements in protein extraction.
Regarding machine learning approaches to anaphora resolution in biomedical documents, Nguyen and Kim  carries out a comparative study with three different corpora: MUC and ACE, accounting for the news domain, and Genia for bio-medical documents. They build a machine learning-based pronoun resolver using a Maximum Entropy ranker model that selects the most likely antecedent candidate from a set of candidates by using a huge set of linguistic features divided into baseline attributes like pronoun type, number, gender, string, distance, etc.(mostly used in other approaches) and innovative features like grammatical roles, most semantically appropriate candidate or context information about the anaphoric pronoun. From the latter group, those improving baseline for each of the corpus were selected obtaining 79.55% (Genia), 64.61%(ACE), 60.42%(MUC) in success rate.
Summary of the main approaches to biomedical anaphora resolution
Castano et al. 
46 medline abstract
Lin et al. 
32 MedLine abstract (MedStract)
F=0.92 pronominal, F=0.78 nominal
Kim et al. 
Centering theory for pronominal anaphors and scoring method for nominal anaphors
120 biological interactions
F=0.64 pronominal, F=0.59 nominal
Liang and Lin 
MedStract + 100 Med- Line abstract
F=0.87 pronominal, F=0.80 nominal
Segura-Bedmar et al., 
Scoring method and a set of semantic and morphological restrictions
49 MedLine abstracts
F=0.85 pronominal, F=0.50 nominal
Nguyen and Kim 
Maximum Entropy ranker model
Success rate: 79.55%
This section describes our approach for anaphora resolution in Drug-Drug Interaction documents. Figure 1 shows the pipeline architecture of our drug-drug interaction prototype. Firstly, texts are processed by the MMTx program. This tool performs sentence splitting, tokenization, POS-tagging, chunking, and linking of phrases with UMLS concepts. This way, MMTx allows to recognize a variety of biomedical entities occurring in texts. Then, drugs found in such documents are classified into drug families by the DrugNer system , which is is based on a set of nomenclature rules recommended by the World Health Organization (WHO) International Nonproprietary Names (INNs)  Program to identify and classify pharmaceutical substances. Over this basis, anaphora resolution is carried out to account for both nominal phrases referring to drugs and pronouns. Finally, the output of the previous modules will be sent to the relation extraction module that will exploit this information in order to account for drug interactions in biomedical documents.
Two different stages have been distinguished in the creation of this corpus: compilation and annotation.
There is no corpus dedicated to the resolution of the anaphoric expressions occurring in drug interaction descriptions in pharmacological documents, so the first challenge was to build a corpus for research purposes.
DrugBank is an annotated database with about 4900 drug entries. Each entry contains more than 100 data fields that gather detailed chemical and pharmacological information (type, category, brand names, chemical formula, drug interactions, etc).
A collection of 49 unstructured and plain documents was taken randomly from the field 'interactions' in the DrugBank database. Documents have on average 40 sentences and 716 words. Documents were downloaded by using an automatic robot developed with the free tool openKapow .
Pronominal anaphora. In this case an entity is referred to by a pronoun: personal (it,they), reflexive (itself,themselves), relative (which, that) and distributive (both, each, either and neither). Pronominal forms in first and second person (I, me, you, your and who) were disregarded for not referring to drugs.
Nominal (phrase) anaphora. This is the case of an entity being referred to by a nominal phrase. These phrases consists of a definite article (the), possessive (its, their), demonstrative (this, these, those), distributive (both, such, each, either, neither) followed by a generic term for drugs (such as antibiotic, medicine, medication, etc) or a drug property or effect, e.g., these anticoagulants, its pharmacological effects.
Distribution of pronominal anaphors in the corpus
Personal (it, they)
Reflexive (itself, themselves)
Relative (which, that)
Distributive (both each, either, neither
Demonstrative (these, this, those, that)
Indefinite (all, some, many, one)
Distribution of nominal anaphors in the corpus.
Possessive (its, theirs)
Distributive (both, each, either, neither)
Demonstrative (these, this, those, that)
Indefinite (other, another, all)
The anaphora resolution issue can be split into three different phases: identification of anaphoric expressions, determination of anaphor scope and selection of antecedents.
Rules to recognize pleonastic-it expressions.
IT [MODALVERB [NOT]?]? BE [NOT]? [AJD|ADV| VP]* [THAT|WHETHER]
It is not known whether other progestational contraceptives are adequate methods of contraception during acitretin therapy.
IT [MODALVERB [NOT]?]? BE [NOT]? ADJ [FOR np] TO VP
If it is not possible to discontinue the diuretic, the starting dose of trandolapril should be reduced.
IT [MODALVERB [NOT]]? [SEEM|APPEAR|MEAN|FOLLOW] [THAT] *
It does not appear that the SSRIs reduce the effectiveness of a mood stabilizer in these populations
Regarding nominal phrase anaphora, candidates were selected if attached to a drug family (analgesics, anticoagulants, etc) or to a generic term for drugs (such as 'medicine', 'medication' or 'drug'). Candidates consisting of specific terms for drugs like 'aspirin', 'fluvoxamine', etc., were disregarded. To achieve this, our module uses the concept unique identifier (CUI) provided by MMTx to distinguish between generic and proper noun for drugs.
Candidate anaphors consisting of a possessive article were restricted by only selecting those phrases attached to the semantic type 'Qualitative Concept' in UMLS, that is, those accounting for drug properties or effects, e.g., 'its pharmacological effect', 'their anticoagulant properties'.
Regular expressions to detect correlative expressions.
[BOTH|EITHER|NEITHER] [N P|P P|U NK] [AND|OR|NOR] [NP|PP|UNK]
These pharmacokinetic effects seen during diltiazem coadministration can result in increased clinical effects (e.g., prolonged sod ation)of both midazolam and triazolam.
Lexical patterns to determine grammatical number.
Exception for singular:
Rules to detect coordinative structures.
( [NP|PP|UNK],)* [NP|PP|UNK] [AND|OR|NOR] [NP|PP|UNK]
While all the selective serotonin reuptake inhibitors ( SSRIs ) e.g. fluoxetine, sertraline and paroxetine inhibit P450 2D6, they may vary in the extent of inhibition.
The choice of a center (antecedent) for a certain anaphora is from the set of entities (centers) of the previous utterance (locality).
Entities mentioned in an utterance are more central than others according to this function (subject>object>other).
Each anaphoric expression in an utterance has exactly one antecedent (center).
In this approach, anaphoric expressions were associated to just one antecedent (third claim). This antecedent is taken from the previous ordered sequence of entities (centers) (first and second claims). Basically the system tries to match an anaphoric expression against candidates in the same sentence sorted by position from left to right. The more central an entity is, the higher the possibility it is to be located on the left side of a sentence (subjects are usually at the beginning); in case no antecedent matches, it moves backward up to the previous sentence and searches for antecedents from left to right again.
However, it was observed that the Centering Theory cannot account for certain types of anaphoric expressions whose antecedents are in most cases to be found locally. Relative, reflexive and possessive anaphoric expressions find their antecedent in the previous context in most of the cases, so it was decided not to apply Centering Theory on this kind of expressions and link them to the closest nominal phrase that satisfied their semantic and morphological restrictions.
For each of the ordered list of candidates selected in the previous phase, the system checks one by one whether their linguistic features are consistent with features of the anaphoric expression. Nominal phrases and pronouns present number agreement with their antecedents. Nominal phrases in coordinative or appositive relation were taken as the same center (antecedent). Additionally, nominal phrase anaphors following centering restrictions were determined to match nominal phrases representing drugs, in particular those phrases classified by MMTx according to one of the following semantic types: pharmacological substances (phsu), antibiotics (antb) or clinical drug (clnd). Likewise, these phrases must not be composed of abstract drugs (drug families or phrases such as 'the medicine' or 'this drug'), but a drug specifically.
Global results for the baseline and the approach.
Results for pronominal anaphora resolution.
Results for nominal anaphora resolution.
Syntax changes from a domain to another. Most approaches in the biomedical domain deal with documents from MedLine accounting for any biomedical topic, whereas our documents focus on drug interactions. Subsequently, we consider that language style of our documents must be linguistically oriented to the reflection of such relations. Only works  and  deal with documents accounting for protein interactions.
Other works mostly address the anaphora resolution issue by using a set of morphosyntactic properties, so resolution is going to be determined by the way that a document has been analyzed. For example, expressions like these drugs or this medication are required to be analyzed by a knowledge resource that identifies and analyze them both syntactically (they are nominal phrases in the subject, object or other type of position in the sentence) and semantically (they stand for drugs). Some approaches make use of open-domain analyzers like  with RASP. Conversely, other approaches makes use of the corpus Genia that has been manually tagged and it does not contain annotation errors (this has a definite influence over results). The degree of precision in annotation is extremely important since results depend on such results. Our system makes used of MMTx, that although has shown to be useful for the analysis of biomedical texts, has several syntactic and semantic parsing errors.
To our opinion, from the list of approaches referred to in the Background section  is the closest to ours. As discussed, such an approach addresses the issue of anaphora resolution in the domain of protein interactions, has developed an ad-hoc tool called BioIE to deal with morphosyntactic complexity of this kind of documents and resolution problems have been faced with an approach that also used Centering Theory. As it can be seen in Table 1, our work obtains similar results to  for nominal phrase anaphora resolution and better results for pronominal anaphora.
Compiling a comprehensive database of drug-drug interactions is a relation extraction task that requires the resolution of anaphoric expressions in biomedical and pharmacological texts. It is believed that anaphora resolution would improve the recall of any extraction method and it would be particularly useful for semiautomated compilation of drug-drug interactions.
The described approach for anaphora resolution uses Centering Theory in order to select the scope of the anaphoric expressions and assign the correct antecedent. In contrast, a simple heuristic that selects the closer nominal phrase has been experimentally useful in this domain for some types of expressions, relative pronouns and possessive nominal anaphors.
A key component of the approach is the use of several domain resources, including the MMTx biomedical parser and the UMLS meta-thesaurus. Other approaches that have deal with biomedical documents have used domain-independent parsers that do not adequately handle the syntactic complexity of biomedical language, including terminology. Unfortunately, MMTx only provides shallow syntactic information, so it can be expected that full syntactic parsing improves the performance of the linguistic rule-based analyzer. UMLS has been useful in order to identify the anaphors and implement semantic restrictions to candidate resolution.
Future work will consider the overall contribution of the anaphora resolution module in the broader task of drug-drug interaction extraction and their evaluation on a larger corpus. Although sources of interaction information like Medline abstracts and DrugBank may share a common literary style, the distribution of interactions is very different and it also deserves investigation. Moreover, semantic information about drug families provided by DrugNer can be valuable for the improvement in the resolution of certain nominal anaphors. In the following example, DrugNer could identify 'venlafaxine' like a antidepressant drug, and this would help to correctly resolve the anaphor 'the antidepressant effect', in the following sentence: Coadministration of naloxone with venlafaxine did not modify the antidepressant effect.
Additional extensions of this work include the extending the coverage of the approach to other kinds of biomedical entities (such as genes, diseases or drug targets), the increasing of the size of the corpus in order to make more reliable conclusions, and the application of machine learning techniques that have been successfully applied on other domains.
This research paper is supported by projects TIN2007-67407-C03-01 and S-0505/TIC-0267. The authors are grateful to Maria Segura Bedmar, manager of the Drug Information Center of the Mostoles University Hospital, Spain, for her valuable assistance in the annotation of the corpus and evaluation of the system.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 2, 2010: Third International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.