A linguistic rule-based approach to extract drug-drug interactions from pharmacological documents
© Segura-Bedmar et al; licensee BioMed Central Ltd. 2011
Published: 29 March 2011
Skip to main content
© Segura-Bedmar et al; licensee BioMed Central Ltd. 2011
Published: 29 March 2011
A drug-drug interaction (DDI) occurs when one drug influences the level or activity of another drug. The increasing volume of the scientific literature overwhelms health care professionals trying to be kept up-to-date with all published studies on DDI.
This paper describes a hybrid linguistic approach to DDI extraction that combines shallow parsing and syntactic simplification with pattern matching. Appositions and coordinate structures are interpreted based on shallow syntactic parsing provided by the UMLS MetaMap tool (MMTx). Subsequently, complex and compound sentences are broken down into clauses from which simple sentences are generated by a set of simplification rules. A pharmacist defined a set of domain-specific lexical patterns to capture the most common expressions of DDI in texts. These lexical patterns are matched with the generated sentences in order to extract DDIs.
We have performed different experiments to analyze the performance of the different processes. The lexical patterns achieve a reasonable precision (67.30%), but very low recall (14.07%). The inclusion of appositions and coordinate structures helps to improve the recall (25.70%), however, precision is lower (48.69%). The detection of clauses does not improve the performance.
Information Extraction (IE) techniques can provide an interesting way of reducing the time spent by health care professionals on reviewing the literature. Nevertheless, no approach has been carried out to extract DDI from texts. To the best of our knowledge, this work proposes the first integral solution for the automatic extraction of DDI from biomedical texts.
A DDI occurs when one drug influences the level or activity of another, for example, raising its blood levels and possibly intensifying its side effects or decreasing drug concentrations and thereby reducing its effectiveness. The detection of DDI is an important research area in patient safety since these interactions can become very dangerous and increase health care costs. Although there are different databases supporting health care professionals in the detection of DDI, these databases are rarely complete, since their update periods can reach three years . Drug interactions are frequently reported in journals of clinical pharmacology and technical reports, making medical literature the most effective source for the detection of DDI. Thus, the management of DDI is a critical issue due to the overwhelming amount of information available on them .
Information Extraction (IE) can be of great benefit in the pharmaceutical industry allowing identification and extraction of relevant information on DDI and providing an interesting way of reducing the time spent by health care professionals on reviewing the literature. Moreover, the development of tools for automatically extracting DDI is essential for improving and updating the drug knowledge databases. Nevertheless, no approach has been carried out to extract DDI from biomedical texts.
Most research has centered around biological relationships (genetic and protein interactions (PPI)) due mainly to the availability of annotated corpora in the biological domain, a fact that facilitates the evaluation of approaches. In general, current approaches can be divided into three main categories: linguistic-based, pattern-based and machine learning-based approaches.
Main approaches for PPI extraction
Link grammar + patterns
dependency parsing + pattern matching
Verspoora et al. 
semantic grammar + pattern matching
Link grammar parser + SVM1
Chen et al. 
Airola et al., 
Aimed, BioInfer, HPRD50, IEAP, LLL
The comparison among different works is not always possible because many of them have been evaluated on different corpora. Therefore, it is risky to draw conclusions on the performance of the different techniques. In general terms, the linguistic-based approaches perform well for capturing relatively simple binary relationships between entities in a sentence, but fail to extract more complex relationships expressed in various coordinate and relational clauses . We believe that the performance of linguistic-based approaches is strongly influenced by the shortage of biomedical parsers. General purpose parsers, which have been trained on generic newswire texts, are not able to deal with the complexity of the biomedical sentences that tend to cause problems due to their length and high degree of ambiguity .
Pattern-based approaches usually achieve high precision, but low recall. They are not capable of handling long and complex sentences, so common in biomedical texts. Furthermore, these approaches are limited by the extent of the patterns, since relations spanning several sentences cannot be detected by them. Linguistic phenomena including modality and mood, which can alter or even reverse the meaning of the sentence, have hardly ever been studied by the pattern-based approaches. Thus, pattern-based approaches are not able to correctly process anything other than short and straightforward sentences , which, on the other hand, are quite rare in biomedical texts.
In general, machine learning-based approaches have achieved better performance than linguistic-based and pattern-based ones, as demonstrated in the last BioCreative challenge . One important advantage of these approaches is that they can be easily extended to new set of data or a new task or domain. However, machine learning-based approaches depend heavily on the annotated corpora for training and testing. Corpus annotation is an expensive work, usually involving an extensive time and labor.
Although many approaches have been proposed to extract biomedical relations, only a few of them achieve successful results. One important reason is that only a few approaches have dealt with the issue of the complexity of biomedical sentences . However, language structures such as apposition, coordination and complex sentences are very common in the biomedical literature. We think that the detection of these linguistic phenomena is essential to successfully tackle the extraction of biomedical relations, in particular, DDI.
Most biomedical corpora (BioInfer , BioCreAtIvE-PPI  or AIMed ) have focus on describing genetic or protein interactions, but none contains DDI. While NLP techniques are relatively domain-portable, corpora are not . For this reason, we have created the first annotated corpus that studies the phenomena of interations among drugs.
The DrugDDI corpus consists of 579 documents describing DDI. These documents were randomly selected from the DrugBank database  and analyzed by the UMLS MetaMap Transfer (MMTx) tool  that performs sentence splitting, tokenization, POS-tagging, shallow syntactic parsing, and linking of phrases with Unified Medical Language System (UMLS) Metathesaurus concepts. Thus, MMTx allows to recognize a variety of biomedical entities, including drugs. The DrugDDI corpus consists of 66,021 phrases from which 22.6% (14,930) are drugs. It contains 3,775 sentences with two or more drugs, although only 2,044 sentences have at least one interaction. A total of 3,160 DDI were annotated at sentence level with the assistance of a pharmacist. The average number of interactions per document is 5.46 and per sentence 0.54.
Aspirin may decrease the effects [of probenecid] PP , [sulfinpyrazone] NP , and [phenylbutazone] NP
In order to extract them, it is necessary to interpret the coordinate structure in it: probenecid, sulfinpyrazone, and phenylbutazone, in which the conjunction and coordinates the conjunct probenecid with sulfinpyrazone and with phenylbutazone.
Although a wide variety of structures can be conjoined, not all coordinations are acceptable. Coordination of Likes Constraint (CLC)  (also called Law of Coordination of Likes) asserts that syntactically different categories cannot be conjoined. However, based on the corpus observation, this constraint is too restrictive for the kind of parsing provided by MMTx. For example, the above sentence demonstrates that being of the same syntactic category is too strong requirement for conjuncts in a coordinate construction, since a prepositional phrase, of probenecid, can be conjoined with two noun phrases: sulfinpyrazone and phenylbutazone. In fact, we have observed in the corpus that coordinate structures involving constituents with different syntactic categories are very common. Sometimes it is due to the fact that MMTx is not able to determine the syntactic type of a phrase, classifying it as an unknown phrase (that is, with the tag UNK).
Patterns to detect coordinate, correlative and appositive structures.
([NP|PP|ADJ|UNK],) * [NP|PP|ADJ|UNK] CONJ [NP|PP|ADJ|UNK]
(VP,) * VP CONJ VP
[BOTH|EITHER|NEITHER][NP|PP|UNK] [AND|OR|NOR] [NP|PP|UNK]
[NP | PP | UNK|APPOSITION]
APPOSITIVE (,)? (()? MARKER [APPOSITIVE (,)?]+ (AND|OR)? (APPOSITIVE)? ())?
There are divergent views within Linguistics with regard to what is or is not an apposition (also called appositional or appositive structure).  and  restrict the category of apposition to coreferential noun phrases (called appositives) that are juxtaposed and refer to the same extralinguistic entity.  and  expand this definition with the inclusion of constructions such as clauses and sentences as possible elements of an apposition.  admits as apposition only those constructions which can be linked by a marker of apposition.
Although the above approaches provide insights into the category of apposition, they provide either an inadequate or an incomplete description of apposition. The objective of this work is not to provide formal and complete description of apposition, but rather to identify appositions, in particular, those that contain drugs. Thus, we only deal with appositions that are linked by a marker of apposition since this kind of apposition appears frequently in the sentences that contain DDIs. Markers are helpful clues for detecting these structures. The markers of apposition that we have used in this approach are: such as, like, including, for example, e.g. and i.e.. Appositions that are not linked by any marker are also frequent in scientific texts, however, the lack of markers makes the detection of this kind of apposition extremely difficult. Moreover, we have observed they hardly ever occur in expressions describing DDI.
[Catecholamine-depleting drugs] NP , such as [Reserpine] NP , may have an additive effect when given [with beta-blocking agents] PP
(1) Catecholamine-depleting drugs with beta-blocking agents, and (2) Reserpine with beta-blocking agents.
Thus, it is essential to detect and resolve the appositions occurring in sentences, prior to the application of the lexical patterns responsible for DDI extraction. The appositions are firstly encapsulated and then unfolded when the relation is obtained by any lexical pattern.
Biomedical texts usually consist of extremely long sentences. Long sentences are usually complex or compound-complex sentences, that is, contain two or more clauses. For example, the following sentence contains two independent clauses (marked with clause1 and clause2).
Coadministration of CRIXIVAN and other drugs [that inhibit CYP3A4] rel [may decrease the clearance of indinavir] clause 1 and [may result in increased plasma concentrations of indinavir] clause 2.
Both clauses have the same subject: Coadministration of CRIXIVAN and other drugs that inhibit CYP3A4. This subject includes a relative clause (marked with rel) whose subject is other drugs.
Lexical patterns to extract DDIs.
DRUG MODAL ? ADV ? INTERACT syn WITH WORD0..5 (OF)? DRUG
DRUG MODAL ? ADV ? INCREASE syn WORD0..5 (OF)? DRUG
DRUG MODAL ? ADV ? DECREASE syn WORD0..5 (OF)? DRUG
DRUG MODAL ? ADV ? ALTERsyn WORD0..5 (OF)? DRUG
DRUG MODAL ? BE ADV ? INCREASEsyn WORD0..5 (BY)? DRUG
DRUG MODAL ? BE ADV ? DECREASEsyn WORD0..5 (BY)? DRUG
DRUG MODAL ? BE ADV ? ALTERsyn WORD0..5 (BY)? DRUG
COADMINISTRATION OF DRUG (WITH|AND|PLUS) DRUG MODAL ? ADV ? [INCREASE syn |DECREASE syn |INTERACT syn |ALTER syn ]
COADMINISTRATION OF DRUG (WITH|AND|PLUS) DRUG MODAL ? BE ? ADV ? RESULT syn (TO|WITH|IN) [INCREASE syn |DECREASE syn |INTERACT syn |ALTER syn ]
CAUTION MODAL ? ADV ? BE ? USED WHEN DRUG WORD ? (WITH|AND|PLUS) DRUG BE ? ADMINISTERED syn CONCURRENTLY ?
PATIENTS TREATED (WITH)? DRUG (WITH|AND|PLUS) DRUG (CONCURRENTLY)? MODAL BE OBSERVED syn
INTERACTION (OF|BETWEEN) DRUG (AND|WITH|PLUS) DRUG MODAL? (BE)? WORD0..3 (OBSERVED syn |INCREASE syn |DECREASE syn |ALTER syn )
How does MMTx label the verb phrases?
Verb phrases detected by MMTx
Verb phrases joined by the VP-pattern
[Formal drug interaction studies] NP [have] VP [not] ADV [been] V/be [conducted] VP [with ORENCIA.] PP
[Formal drug interaction studies] NP [have not been conducted] VP [with ORENCIA.] PP
[The combination] NP [of methotrexate] PP [with acitretin] PP [is] V/be [also] ADV [contraindicated] VP
[The combination] NP [of methotrexate] PP [with acitretin] PP [is also contraindicated] VP
Once it has been determined that the sentence contains two or more clauses, the following step is to determine the type of sentence. Such information will be very useful in detecting the clause boundaries. In the English language, a compound sentence is composed of two or more independent clauses joined by a conjunction that can be a coordinator (coordinating conjunction: for, and, nor, but, or, yet, so), a correlative conjunction (both, either, whether... or; not only... but also) or an independent marker word (however, moreover, furthermore, consequently, nevertheless, therefore). Semicolons and commas can also function as conjunctions. If an independent marker occurs at the beginning of the sentence, then a semicolon or a comma should separate the clauses. If the second independent clause starts with an independent marker, then a semicolon or a comma is needed before the marker . The independent markers can also occur in simple sentences, as in the following sentence: However, initial dose modification is generally not necessary.
A complex sentence has an independent clause joined with one or more subordinate clauses. Subordinate clauses contain both a subject and a verb, but do not express a complete thought. A complex sentence always has a relative pronoun (who, that, which, whoever, whom, whomever, whose, whichever, whatever) or a subordinator (after, although, as, as if, because, before, even if, even though, if, in order to, since, though, unless, until, whatever, whether, when, whenever, while.) that links the clauses. If the complex sentence begins with a subordinator, that is, the subordinate clause is at the beginning of the sentence, then the subordinate clause should end with a comma. On the other hand, if the independent clause is attached at the beginning of the main sentence and the subordinator is in the middle, then no comma is required .
Initial patterns for clause splitting.
CLAUSE1(,|;)? [indepMarker|coordinator |;|,] CLAUSE2
indepMarker (,) ? CLAUSE1[,|;] CLAUSE2
depMarker (,) ? CLAUSE subordinate , CLAUSE main
CLAUSE main [depMarker | ; |,] CLAUSE subordinate
relativePronoun (NP|PP|UNK|ADJ|APOS|COORD)? VP [NP|PP|UNK|ADJ|APOS|COORD]
In a few words, the algorithm works as follows. the input of the algorithm is the sentence in which its verb phrases have been joined by the VP-pattern. First of all, the algorithm must check that the sentence contains two or more clauses. Then, the sentence is reviewed while it contains any separator marker. A separator marker can be a coordinator, a independent marker, a dependent marker, a semicolon or a comma. The coordinators and subordinators must be labeled by MMTx as CONJ phrases, otherwise, they are not considered as conjunctions. Then, the algorithm iteratively finds candidate clauses, that is, a substring of the sentence between markers. If the candidate clause contains a verb phrase, then it is considered as clause. The algorithm is able to decide the kind of clause, that is, independent or subordinate.
Once appositions and coordinate propositions have been recognized, and compound and complex sentences have been split into clauses, it is possible to apply a set of rules for sentence simplification. These rules allow to simplify the complex and compound sentences in simple sentences. Then, the pattern-based approach for DDI extraction will be applied to these simpler sentences.
Rules to generate new simplified sentences from the clauses. The clause CLAUSE REL ( NP ) means that it is attached to the noun phrase NP.
MARKER(,)? CLAUSE1, CLAUSE2
CLAUSE1(,)? MARKER CLAUSE2
CLAUSE1 NP CLAUSE REL ( NP ) CLAUSE2
(1) CLAUSE1 NP CLAUSE2
(2) NP CLAUSE REL ( NP )
[Because] MARKER [busulfan is eliminated from the body via conjugation with glutathione] [use of acetaminophen prior to (72 hours) or concurrent with BUSULFEX may result in reduced busulfan clearance based upon the known property of acetaminophen to decrease glutathione levels in the blood and tissues] .
[Although] MARKER [the interactions observed in these studies do not appear to be of major clinical importance] , [BREVIBLOC should be titrated with caution in patients being treated concurrently with digoxin, morphine, succinylcholine or warfarin.]
[Trimeprazine also decreases the effect of heparin and oral anticoagulants,] [while] MARKER [MAOIs can increase the effect of trimeprazine.]
Since the excretion of oxipurinol is similar to that of urate, uricosuric agents, which increase the excretion of urate, are also likely to increase the excretion of oxipurinol and thus lower the degree of inhibition of xanthine oxidase.
Since the excretion of oxipurinol is similar to that of urate, uricosuric agents are also likely to increase the excretion of oxipurinol and thus lower the degree of inhibition of xanthine oxidase.
Uricosuric agents (which) increase the excretion of urate.
ADV is any adverbial except ’NOT’. For example, also, potentially, etc.
INTERACT syn =[INTERACT|INTERFERE]
INCREASE syn =[AUGMENT|ELEVATE|ELEVATE|ENHANCE|EXACERBATE|EXTEND|GO_UP|INCREASE|INTENSIFY|POTENTIATE|PROMOTE|PROLONG|RAISE|RISE|STIMULATE]
DECREASE syn =[DECREASE|DIMINISH|LESSEN]
ALTER syn =[ACCELERATE|ANTAGONIZE|ALTER|CHANGE|INDUCE|INFLUENCE|INHIBIT]
RESULT syn =[RESULTS|ASSOCIATED|SHOWN|RESULTED|OBSERVED|DETERMINED]
This section explains in detail the experiments that we have carried out to evaluate the performance of the DDI extraction. We consider as baseline system, so called allDDIs, the case in which every pair of drugs that co-occur in a sentence are assumed to interact. This baseline yields the maximum recall, but low precision (11%) and a baseline F-measure of 19%. The most basic experiment in which neither coordinations, appositions nor clauses are tackled, that is, the lexical patterns are directly applied to the text of sentences. First of all, sentences are parsed by MMTx and drug names are identified by the DrugNer system . Then, only those sentences that contain two or more drug names are selected and the drug names are replaced by the label DRUG. index , where index shows the order of each drug in the list of drugs that occur in sentence. Finally, the set of lexical patterns is applied to the text of the sentence.
When a sentence has been correctly matched with a pattern, it must be checked if the matching string includes the negative adverb (NOT). If it is not included, then a possible interaction has been found. Drug names that occur in the matching are retrieved, and the pair of drug names is proposed as a DDI.
Table 8 shows the global and individual pattern performance. The basic experiment achieves a reasonable precision (67.30%), but very low recall (14.07%). The average number of DDI detected by each pattern is 35.5 (the total number of DDI in the DrugDDI corpus is 3,160). Regarding the individual pattern performance, the highest recall is achieved by the pattern P2 and the highest precision by the pattern P8. Regarding the second experiment, recall is improved by the inclusion of the appositions and coordinate structures, however, precision is lower. The average number of DDI detected by each pattern is 64.83. The pattern P2 still achieves the highest recall, and the highest precision is obtained by the pattern P10.
F β = 1(%)
F β = 1 (%)
F β = 1 (%)
The last experiment combines the detection of appositions, coordinate structures, clause splitting and simplification rules. First of all, appositions and coordinate clauses are detected by applying the previous described procedure (algorithm 2) step by step until the sixth step. Then, the algorithm 1 is applied to sentences in order to split the complex and compound sentences into their clauses. New sentences are generated from these clauses by the simplification rules. Finally, the previous procedure of matching pattern (algorithm 2) is applied to these new sentences from the seventh step.
Evaluation of linguistic structures resolution.
Rest of Clauses
Results on DDI extraction are shown in Table 8. While the inclusion of appositions and coordinate structures achieved to improve the recall, and therefore, the f-measure, the detection of clauses did not improve overall performance.
Although we are aware that the syntactic simplification evaluation is quite shallow to reach definite conclusions about performance it seems to point out that the chaining of errors may have a larger impact. In addition, many interactions occurring in complex sentences often span several clauses (for example, The Cmax of norethindrone was 13% higher when it was coadministered with gabapentin). The lexical patterns are not able to capture these interactions that would require a more complex semantic interpretation.
In this paper, we have proposed a hybrid method that combines the resolution of complex linguistic constructions and pattern matching.
Regarding the resolution of the linguistic constructions, as it was pointed out in the Results section, most of the errors are due to mistakes introduced in the MMTx level and the difficulty of resolving nested clauses, so frequent in biomedical texts. Also, we are aware that our clause splitting method is too simplistic to deal with the complexity of biomedical sentences.
While studies have not shown DRUG 1 interact with DRUG 2, caution should be exercised.
A deeper treatment of negation should discover that the phrase studies have not shown have a larger scope that includes the interaction.
Future directions include trying to identify and resolve the errors of MMTx and analyzing the effect on the DDI extraction performance, improving our clause splitting algorithm, proposing new suitable simplification rules to regenerate the simple sentences from clauses, checking what occurs if the resolutions are applied in a different order, studying the utility of other corpora such as Genia-GR  or Penn Treebank  and other parsers such as Stanford  or MiniPar , and increasing the size of the corpus and annotating it with these linguistic constructions. In addition, we will carry out a more exhaustive treatment of negation and modality in sentences. We will also study the overall contribution of our anaphora resolution approach  to the broader task of DDI extraction.
Concerning the performance in the extraction of DDI, the variability of natural language expression makes it difficult for our method to accurately detect all semantic relations occurring in text since sentences conveying the same relation may be composed lexically and syntactically differently. Inversely, sentences that are lexically common may not necessarily convey the same relation. Thus, our lexical patterns are not enough to identify many of the interactions. Future work will include the application of bootstrapping techniques to find additional patterns like the SPINDEL system . Continuing the work presented in , we also plan to apply advanced machine learning techniques to extract DDIs.
This work has been partially supported by the Spanish research projects: MA2VICMR consortium (S2009/TIC-1542, http://www.mavir.net), a network of excellence funded by the Madrid Regional Government and TIN2007-67407-C03-01 (BRAVO: Advanced Multimodal and Multilingual Question Answering). The authors are grateful to María Segura Bedmar, manager of the Drug Information Center of the Móstoles University Hospital, Spain, for her valuable assistance in the annotation of the corpus and evaluation of the system.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 2, 2011: Fourth International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S2.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.