Skip to main content

Extracting causal relations on HIV drug resistance from literature



In HIV treatment it is critical to have up-to-date resistance data of applicable drugs since HIV has a very high rate of mutation. These data are made available through scientific publications and must be extracted manually by experts in order to be used by virologists and medical doctors. Therefore there is an urgent need for a tool that partially automates this process and is able to retrieve relations between drugs and virus mutations from literature.


In this work we present a novel method to extract and combine relationships between HIV drugs and mutations in viral genomes. Our extraction method is based on natural language processing (NLP) which produces grammatical relations and applies a set of rules to these relations. We applied our method to a relevant set of PubMed abstracts and obtained 2,434 extracted relations with an estimated performance of 84% for F-score. We then combined the extracted relations using logistic regression to generate resistance values for each <drug, mutation> pair. The results of this relation combination show more than 85% agreement with the Stanford HIVDB for the ten most frequently occurring mutations. The system is used in 5 hospitals from the Virolab project to preselect the most relevant novel resistance data from literature and present those to virologists and medical doctors for further evaluation.


The proposed relation extraction and combination method has a good performance on extracting HIV drug resistance data. It can be used in large-scale relation extraction experiments. The developed methods can also be applied to extract other type of relations such as gene-protein, gene-disease, and disease-mutation.


The Human immunodeficiency virus (HIV) is the cause of acquired immunodeficiency syndrome (AIDS). HIV infection is now recognized as a pandemic. As of January 2006 the World Health Organization estimate that AIDS has killed over 25 million people since it was first recognized in 1981 [1]. Treatment of HIV infection consists of highly active antiretroviral therapy (HAART), a multi-drug treatment and has been shown to be effective in suppressing viral replication in many patients. However, the long-term use of these drugs leads to drug resistance caused by the viral mutations that occur under drug pressure. The resulting treatment failure requires new treatment regimens that can suppress the new mutations [2]. Therefore, in HIV treatment, it is critical to have up-to-date drug resistance data for selecting a treatment regimen to which the virus is still susceptible in the presence of resistant mutations.

To assist physicians in selecting the most suitable treatment regimen, currently there are two methods available to predict HIV drug resistance: a rule-based approach [3] and recently a computational approach [4, 5]. For the former systems such as Stanford HIVDB and RegaDB, HIV drug resistance data are updated with resistance data manually gleaned from scientific publications by experts in this field. However, the amount of biomedical literature regarding to HIV drug resistance is increasing rapidly and it is becoming highly labour intensive for experts to collect reliable drug resistance information in a convenient and effective manner. Thus, a significant amount of drug resistance data remains hidden in biomedical literature. Therefore there is a need for computational methods that automate parts of this process and that can assist in retrieving and updating causal relations between drugs and virus mutations from literature.

Several approaches for extracting relations of interest (e.g. protein-protein, gene-protein) in biomedical texts have been reported [6]. The approaches range from co-occurrence to natural language processing (NLP) techniques. Co-occurrence is the simplest approach for relation extraction of entities within sentences. It assumes that if two entities are repeatedly mentioned together, they are somehow related. This approach provides high sensitivity (measuring the 'coverage') but very low specificity (measuring the accuracy) [7].

Other approaches use pattern-based techniques to extract relations that increase specificity, unfortunately at the cost of significantly lower sensitivity [8]. The patterns are either manually defined or automatically learned through annotated data. Manual patterns are generated by domain experts through the analysis of entities connected by a specific relation from text. Automatic patterns are generated by learning from text surrounding entity pairs known to have the relationship of interest. However, the more detailed the analysis of the text, the more patterns must be taken into account to deal with the large amount of surface grammatical variation in the texts [9].

Systems that are based on NLP techniques use either shallow parsing, which divides the sentence into chunks [10, 11] or full parsing, which provides complete syntactic analysis of sentence structures. Since full parsing produces more elaborate syntactic information than shallow parsing, relation extraction systems based on full parsing can potentially provide better results [12]. The output of the parser is represented as constituent parse trees or dependency parse trees. Based on syntactic patterns or the shortest path between entities in the dependency trees, two approaches can then be applied to extract relations from parse trees: either a rule set which is manually defined [1316] or machine learning techniques (e.g. SVM) are used [1719].

Recent relation extraction methods focus on extraction of protein-protein interactions or protein-gene interactions [2022]; a limited number of methods also deal with contradiction of extracted relations by assigning a strength score based on the amount of contradiction [23, 24]. Much less attention has been paid to the extraction of other types of relationships and combination of extracted relations: this research area still remains largely untouched [25].

In this paper, we introduce a novel method to extract and combine relationships between mutations in viral genomes and HIV drugs, hereafter referred to as causal relations, which express changes in the resistance to the HIV drugs which are attributed to the presence or absence of certain mutations on the HIV genome. Our system distinguishes itself from previous research on relation extraction in a number of ways. First, we apply rules to extract relations from grammatical relations of sentence constituents. Next, we combine extracted relations to generate a unique resistance value for each <drug, mutation> pair. To the best of our knowledge, this is the first attempt to apply automatic relation discovery in the field of HIV drug-ranking.


The work-flow of the proposed method is shown in Figure 1. The system consists of the following components:

Figure 1
figure 1

Work flow of the system.

  1. 1.

    Text retrieval

  2. 2.

    Text preprocessing

  3. 3.

    Relation extraction

The text retrieval component collects relevant abstracts from PubMed and filters out irrelevant sentences. The text preprocessing component then simplifies sentences, parses them using the Stanford Lexicalized Parser version 1.6 [26] and applies grammatical relations to generate sentence components. The relation extraction component applies a set of rules to sentence components to extract candidate relations. Finally, the extracted relations are combined using a logistic regression classifier.

As the combination of relation extraction methods has previously been shown to increase overall system performance [23, 27], we decided to combine both co-occurrence and NLP methods to enhance the overall performance. For the text retrieval, we apply co-occurrence to obtain high sensitivity, while in the relation extraction phase, we apply NLP and rule-based methods to gain high specificity.

Text retrieval

The text retrieval phase consists of two steps: collection of abstracts and selection of candidate sentences from those abstracts. To collect relevant abstracts, we prepare a list of drug names by collecting them from websites related to HIV treatment such as the Stanford HIVDB and RegaDB. The system queries PubMed using the drug names as keywords. The obtained abstracts in XML format are parsed using the LingPipe parser and then stored in a local database. Next, abstracts are split into sentences and the system selects candidate sentences that belong to either one of the following cases:

  • A single sentence: if a sentence contains at least one mutation and one drug then it is selected.

  • An inter-sentence: if two sentences are adjacent, one sentence contains at least one drug, and the other contains at least one mutation, then these sentences are selected.

In order to identify mutations in text, the system uses regular expressions. The regular expression for a mutation consists of an optional single letter code for the amino acid followed by a position consisting of one to three digits and ending with an amino-acid code letter [28]. Examples are K65R, I84V, and 103N. Groups of amino acids which can appear as mutations at a single position are notated with the separators "/" or "-", such as 54A/M/V.

Text preprocessing

The text preprocessing phase consists of three steps: simplification of sentences, parsing the simplified sentences, and generating grammatical relations.

Simplifying sentences

Generic English parsers tend to perform poorly when applied directly to biomedical texts [29]. This is because the sentences in abstracts for such texts frequently use long and complex noun phrases and contain technical terms which are specific to the biomedical domain. For these reasons we simplify the sentences in a number of ways to make them more amenable to the parser. This process has been proposed in previous work by [15]. We further enhance this process by grouping mutations and drugs. The simplification process consists of 5 steps:

Removing parenthetical remarks

Words inside a pair of parenthesis () are removed except those that contain drug names or mutations.

Replacing "known" terms

Common terms such as "human immunodeficiency virus type 1 (HIV-1)" are replaced by their well established abbreviations.

Grouping mutation and drug names

The drug names and or mutations in sentences are replaced by a predefined name. In case there is an enumerated list of drug names/mutations (either conjunctive or disjunctive), the system also replaces this group by a new name. For each sentence, the system maintains a list of generated words with the original words as a reference to be used in the extraction phase.

Normalizing sentences

Special characters, such as "-", "+" or "/" between words, may cause parse errors and are therefore removed.

Anaphora resolution

A simple anaphora resolution algorithm is implemented to resolve a list of predefined pronouns such as this drug, these drugs, etc., which refer to drug names or mutations in the sentence.

The following example illustrates the result of this simplification process:

  • Original sentence: 'A371V and Q509L increased resistance to lamivudine and abacavir, but not stavudine or didanosine'.

  • Simplified sentence: 'MUTATION0 increased resistance to DRUG0, but not DRUG1'.

Recognized keywords

In addition to mutations and drug names, the system also recognizes relation words and manner words. We prepared a list of relation words that indicate causal relations between drugs and mutations by manually analyzing sample sentences. This list is shown in Table 1. Furthermore, during this process we also collected adjectives and adverbs that describe the 'manner' of the relation such as high, strong, full, low, weak, etc., as shown in Table 2. For each sentence, a list of these keywords is maintained and used in the extraction phase. For more details of the simplification process see Additional file 1.

Table 1 Examples of relation words and their categories
Table 2 Examples of manner words and their corresponding groups.

Parsing sentences and generating grammatical relations

Before parsing, each simplified sentence is checked for a triplet <mutation, relation, drug>, in which mutation and drug are predefined names resulting from the simplification process. Sentences containing the required triplet are parsed. The parser generates the output in the form of the Penn Treebank. Figure 2 shows an example of the Stanford parser output. The input for the parser is the simplified sentence of the previous step.

Figure 2
figure 2

Penn Treebank output of the Stanford parser.

The parse trees are then subjected to a set of English grammatical relation rules which is bundled with the parser to generate sentence constituents such as subject, object, preposition etc, which are then used as the input of relation extraction phase. The built-in rule set consists of 49 rules, however, we only apply 11 rules that generate the most common relations, and this is shown in Table 3.

Table 3 Main grammatical relations and some of their values generated from parse tree in Figure 2.

Relation extraction

Most of the relations in biomedical texts in the English language can be expressed in two main forms:

  • Clause form: a relation between entities is expressed by a relational verb in the form of subject and predicate (A - relation - B).

  • Phrase form: a relation between entities is expressed by a relational noun and makes use of prepositions to connect entities (Relation - A - B).

Based on these relation forms, we define two rules:

Rule 1a : This rule applies to relations in the following form:

This is the most common relation form found in texts. If keyword1 is MUTATION then keyword2 is DRUG and vice versa. The procedure to extract relation of rule 1a is carried out as follows:

  • Input: lists of sorted relation words, manner words, predefined keywords and components that belong to the predicate of the current clause.

  • Requirement: The nominal subject (nsubj) must contain a predefined keyword (MUTATION/DRUG).

Step 1: Find a relation pair:

  1. a.

    Pick a relation word from the sorted list of relation words.

  2. b.

    Find a keyword from the sorted list of predefined keywords at distance 1 to 4 words from the relation word. This keyword either belongs to the same component as the relation word or belongs to an adjacent component. If found, go to step 2, otherwise pick another relation word from the list.

Step 2: Find manner words:

  1. a.

    Find a manner word from the sorted list of manner words at distance 1 to 3 words from the relation word, this manner word either belongs to the same component as the relation word or belongs to an adjacent component.

  2. b.

    Continue to find other manner words at distance 1 to 2 words from this manner word.

Step 3: Extract a relation:

  1. a.

    Form a pattern to extract a relation with the data found in step 1 and step 2.

  2. b.

    If the list of relation words is not empty then go to step 1.

Step 4: Extract relations from a conjunction component (conj) that only contains a predefined keyword:

  1. a.

    If a relation pair is found adjacent to this conj component, then use the relation word that is closest to the keyword of this component to form a relation pair.

  2. b.

    Find a manner word in a similar approach as step 2.

  3. c.

    Form a pattern to extract the relation from this component.

Note: In case there is more than one predefined word in nsubj, the first keyword is selected then the procedure will repeat for the other keywords.

Rule 1b: The same as rule 1a, but switch the role of subject and predicate for passive sentences.

Rule 1c: Preposition (keyword1) + Predicate (Relation word + key-word2)

This rule is similar to rule 1a, but instead of looking for a predefined keyword in nsubj, the system finds a predefined keyword in the preposition component (prep) that is located before the predicate.

Rule 2: This rule is applied to relations in a phrase form. First, the system calculates distance (measured by word) and numbers of occurrence of each of the following pairs in the current phrase: <Mutation, Relation>, <Drug, Relation>, <Relation, Mutation>, <Relation, Drug>. Based on these values, a heuristic algorithm forms a triple <relation, keyword1, keyword2>. Secondly, searching for manner words is done in the same way as described for rule 1a.

Check for negation

We can classify the negation into two cases: the negation words located outside and inside a relation. We only focus on the case where negation words are located inside a relation (see Figure 3). The other case is ignored since its frequency is very low and requires extensive semantic analysis. A more comprehensive analysis of negation can be found in [30]. Checking for negation is done the same way as checking for manner words.

Figure 3
figure 3

Extracting relations from grammatical relations of a simplified sentence.

Depending on the sentence components generated for each sentence, the system then decides to apply rule 1a, 1b or 1c, when these rules fail to extract relations, rule 2 is applied. For example, when applying rule 1a to the sentence components of the sentence in the previous step, the system forms the relation pairs as shown in Figure 3. The extracted relations are as follows:

MUTATION0 increased resistance to DRUG0

MUTATION0 not increased resistance to DRUG1

Post relation extraction

The keywords in the extracted relations are then replaced with the original mutations or drug names. Next this group of drug names and mutations are disentangled and split into single values. For example, with the extracted relation: 'MUTATION0 increased resistance to DRUG0'. The relation after replacing keywords is as follows:

A317V and Q509 L increased resistance to 3TC and ABC.

Relation combination

The extracted relations obtained in the relation extraction step are expressed in different manners and may also contain contradictory relations. In addition, these relations are usually taken out of context so they do not represent the true nature of the relation as it was specified in original sentences. Our task here is to determine a resistance type for each <mutation, drug> pair from these pieces of evidence, which can have the following properties:

  • Containing false positive relations due to relation extraction method or relations are taken out of context.

  • Relations are in textual descriptive form with different manners, and come from different sources.

  • Extracted relations contradict with each others (resistant vs. susceptible), this is the most common case.

Table 4 shows examples of extracted relations between K65R mutation and D4T. The relation combination process is carried out in two steps: grouping relations with the same mutations and drugs, and calculating the resistance type for each <drug, mutation> pair.

Table 4 Output extracted relations between K65R mutation and D4T when running the system over all candidate sentences.

Grouping relations

First, the mutations in each relation are checked for consistency. Mutations with amino acid letters and those without amino acid letters are converted to a standardized form. Mutations ending with more than one amino acid letter are split into an atomic mutation, for instance M184I/V is converted into two atomic mutations, M184V and M184I. Second, extracted relations that have the same drug and mutation are put into the same group. In each group, the relations are categorized into 4 subgroups according to their resistance properties: resistant, susceptible, responsive, and associated. In addition, negative relations are removed from each subgroup. The categorizing process uses a list of predefined relation words some of which are shown in Table 1.

Calculating resistance types

Since the relations in the association and response subgroups do not indicate clear evidence on drug resistance, we only use relations from the resistance and susceptible subgroups to calculate a resistance value. For each subgroup, we divide relations into six subsets based on their manner words that indicate the degree of the relation. Example of the manner words is shown in Table 2. The result of this division leads to 12 subsets of relations as illustrated in Figure 4. Since the resistance value for common <drug, mutation> pairs are available in expert systems such as Stanford HIVDB or RegaDB, we transform the current problem into a well-known regression problem [4, 31, 32] that is to predict the resistance value for each <drug, mutation> pair. We use the output for each <drug, mutation> pair from Stanford HIVDB as the gold standard, and use the number of extracted relations in each subset as feature values. Now the problem is to find optimized weight factor for each subset in the following equation:

Figure 4
figure 4

Example of categorizing extracted relations of the K65R mutation and D4T.


where E(y) is the predicted value of the <mutation, drug> pair y; ri and si are the number of relations in each subset of the resistance and susceptible subgroup {high, increased, medium, decreased, low, no manner} and wj (j = 1, ..., 12) denote the corresponding weights of these sets.

The procedure to determine the weight factors as follows:

  • Divide the extracted relations into two datasets, one for determining the weight factors (learning) and one for testing. Extracted relations for each <drug, mutation> pair can belong to either datasets. This is to make sure that learning data and test data do not overlap thus avoiding bias the final results.

  • Only select a <drug, mutation> pair for training if it has at least 3 extracted relations. Assign the resistance value (resistance/susceptible) for this pair by taking this value from Stanford HIVDB.

  • Use the logistic regression function from WeKa package version 3.6 [33] to find the weigh factors.

When the weight factors in equation 1 are obtained, we then apply these values to predict the resistance value for the remaining data. Again, these predicted values are compared with the output from HIVDB system.

Results and discussion


We use the text retrieval component with a list of 22 FDA approved drugs (see Additional file 2) to collect candidate abstracts from the PubMed database. We obtained 129,448 unique abstracts when the search was carried out on June 10, 2009. Among these collected abstracts there were 74,321 abstracts containing at least one drug name in the body text, 9,651 abstracts contained mutations, and 5,615 abstracts contained both drugs and mutations. When applying the text filter to find candidate sentences, we obtained 2,937 candidate sentences which contained both drugs and mutations. Of these 1,913 were single sentences and 1,024 were inter-sentences that contained the triple <mutation, relation, drug>.

Evaluation metrics

We use recall, precision, and the F-score as metrics to evaluate the performance of our system to extract relations based on the following calculations:

where TP, FN, and FP are defined as:

TP (true positives): is the number of relations that were correctly extracted from input sentences.

FN (false negatives): is the number of relations that the system failed to extract from input sentences.

FP (false positives): is the number of relations that were incorrectly extracted from input sentences.

The F-score is the harmonic mean of recall and precision.

We asked independent medical doctors and virologists from the ViroLab project, with a strong background in drug resistance, to assess the extracted relations.

Relation extraction performance

In order to evaluate the performance of the relation extraction method, we prepared two datasets: one dataset consist of sentences taken from PubMed abstracts and the other consists of sentences taken from the Stanford HIVDB comments which derived from full text papers. Table 5 gives an overview of the number of positive and negative relations in two datasets.

Table 5 Datasets statistics

Evaluation on PubMed dataset

For dataset from PubMed, there were 1543 out of the 2937 candidate sentences containing a triple <mutation, relation word, drug>. From these, we randomly selected 500 sentences, none of which had been used for developing rules. In this dataset, there are 921 instances of negative relations (46%) and 1095 instances of positive relations (54%). The evaluation by experts against the output of these 500 sentences shows that the system can extract 1023 (896 true positives and 127 false positives) instances of relations with a precision, recall, and F-score of 87%, 82%, and 84.5%, respectively (see table 6).

Table 6 Performance of the system compared with the baseline method over 2 datasets.

Evaluation of the HIVDB comments dataset

In the second evaluation, we wanted to test the performance of the proposed method on both single and inter-sentences. We used all of 130 sentences taken from the Stanford HIVDB comments, of which there are 32 sentences (24.6%) containing no relation and 98 sentences (75.4%), consisting of 56 single sentences and 42 inter-sentences. Among these 98 sentences, there are 261 instances of negative relations and 307 instances of positive relations. The evaluation shows that the system can extract 275 relations (267 true positives and 8 false positives) with a precision, recall, and F-score of 97%, 87%, and 91.7%, respectively. Table 6 shows the evaluation results of the system over these two datasets. For more details of extracted relations, see Additional file 3, Additional file 4, and Additional file 5 respectively.

Analysis of the results

The results in Table 6 show that the performance of the system is comparable with existing relation extraction systems for such tasks as protein-protein interaction or protein-gene interaction, of which most do not take into account the degree of the relations [25, 34]. Furthermore, there is no gold standard corpus available to evaluate the results of our system, making it hard to compare the proposed method directly with other relation extraction systems. Therefore, we used a co-occurrence method as a baseline to compare with our method. This method (Base_C) predicts a <mutation, drug> pair occurring in the same sentence as a relation. Table 6 shows that our method has a significantly better performance than the baseline method on both datasets with F-scores of 84.5% and 91% compare to 69% and 70% of the Base_C.

The extraction results of the Stanford HIVDB dataset show a higher precision than the extraction results of datasets taken from abstracts. The reason is that the sentences taken from the Stanford HIVDB were composed in a clear, consistent way, and expressed the relations in an explicit form. In addition, the mutation and drug names are also written in a standard format and the sentences are relatively short. As a consequence, the system can archive better results. In contrast, sentences taken from abstracts are long and often have complex structures and thus are prone to more errors.

To identify the source of the errors, we analyzed the sentences that the system failed to extract (false negative) or extracted incorrectly (false positive). The causes of these errors are parser errors, non specific rules, semantic problems, negation, and anaphora resolution:

Parser errors and grammatical relation errors

The most frequent errors were caused by the parser (23/62 i.e. 23 of the 62 failures were due to parser errors). Since the parser is not trained on biomedical texts, it often returns inaccurate parse trees, which in turn generate incorrect grammatical relations. As a result, the system applies inappropriate rules to extract relations. For instance, "G48 M causes high-level SQV resistance and intermediate resistance to NFV, ATV, IDV, and NFV". In this example, the parser returns a parse tree in the form of noun phrase (NP) instead of a clause form. However, in some cases, the system can still extract relations by applying rule 2 on noun phrase such as this example.

Non specific rules

The second major source of errors is due to cases where the rules are not covered (18/62), as for instance in: "Additional insertion of M184V into the zidovudine background doubled the resistance, whereas 44/118 did not lead to a further increase". In such cases, the distance from relation word to keyword is longer than the defined values set in the rules. This can be corrected by relaxing the defined rules; however, this would also mean reducing the precision.

Semantic problems

In some cases, the errors were caused by semantic problems (12/62). This occurred when a relation is implied or hyponyms are used. For example: "The PI mutation I50L causes clinically relevant resistance and increased susceptibility to atazanavir and other PIs respectively".

There were only a very few cases where the source of error was caused by negation or anaphora resolution. Currently we do not take those sparse situations into account.

Relation combination performance

We extracted relations from all candidate sentences of the collected abstracts and obtained 2,434 extracted relations. After grouping relations and dropping relations belong to response and association groups, we obtained 612 <mutation, drug> pairs. However, among these, there were only 163 pairs containing more than or equal 3 extracted relations. We selected 63 <mutation, drug> pairs for training the logistic regression function. The remaining 100 pairs are used to predict resistance values. To evaluate the results of the relation combination process, we selected relations of the 10 most common mutations. For each <mutation, drug> pair to be chosen as output relations, it is required that this pair has at least 3 extracted relations from the text. In addition, we have also calculated the resistance type based on three levels of resistance: susceptible (S), intermediate resistant (I) and resistant (R) using the same method as proposed for two levels. Table 7 shows examples of the output of our system on K65R mutation and its related drugs.

Table 7 Prediction results of mutation K65R and its related drugs.

The result in Table 7 shows that the output of our system has the same resistance type compared with the Stanford HIVDB on K65R mutation. There are 3 relations that have a different resistance type between the Stanford HIVDB and the RegaDB. This discordance is due to the fact that there are cases where RegaDB does not take into account the single <mutation, drug> pairs, therefore the RegaDB gives as an output "susceptible", e.g. in case of <K65R, D4T> and <K65R, ABC> pairs. In contract, our system and the Stanford HIVDB do have evidence for these resistance pairs. In addition, there is one relation that only appears in our system i.e. "K65R-intermediate resistance-DDC".

Table 8 shows a summary of the output results of the 10 common mutations, which account for 33% of extracted relations (615 instances over 54 <mutation, drug> pairs) and cover 3 common drug classes (PI, NRTI, NNRTI). The results are compared manually with the Stanford HIVDB system. The percentage of the agreement between two systems based on two levels of resistance (S, R) are 85%, and based on three levels of resistance (S, I, R) are 76%. By following the reference links provided by the Stanford HIVDB, we discovered that the main reason for the differences of the output between our system and the Stanford HIVDB is that there are many relations which can only be found from full texts, not from abstracts. In addition, the Stanford HIVDB also uses experimental data (e.g., n-fold value of resistance), while our system only uses pure text to synthesise the relations.

Table 8 Summary of the prediction results of the 10 most frequent mutations and their related drugs extracted from text compares with the HIVDB on two levels and three levels of resistance: susceptible (S), intermediate resistant (I), and resistant (R).

Furthermore, there are many pairs of <mutation, drug> where the number of extracted relations is below the threshold we have set, so these pairs are not considered by the system and do not appear in the output results. However, we also discovered that our system can extract new relations that do not appear in the Stanford HIVDB as shown in example of K65R mutation above.

Atomic value vs. group values

The atomic relations obtained by splitting a group of mutations from original relations are also the cause for disagreement between the output of the system and the Stanford HIVDB result. This was due to the fact that, in some contexts, the resistance only occurs if these mutations come together, but does not occur in a single mutation. For future work, we will take this issue into account.

PubMed abstracts are certainly a good source for extracting causal relations on HIV drug resistance; however, the number of extracted relations from abstracts is relatively low, only 5% of abstracts may potentially provide evidence for drug resistance. Therefore a more advanced form of publishing as proposed in [35] might provide a better solution for collecting data. In addition processing of full texts can be considered. The system can be used as annotator to extract relations from full text articles, the results are then considered as raw relations, which can be evaluated by experts. This system can save experts a significant amount of time otherwise spent finding relevant sentences which provide evidence for drug resistance. For convenience, the system provides summarized data and original texts, from which the relations were extracted, to support the experts in the verification of the results.


We have proposed a new method for extracting causal relations between drugs and mutations by applying two rule sets over grammatical relations. Our system can extract relations from single sentences and from inter-sentences. In addition, by grouping mutations and drugs, the system also reduces the number of conjunctions and the enumeration lists of entities, thus making the process of extracting relations much quicker and less error prone.

We have also described a method to combine extracted relations which combines the manner of each individual relation and deals with contradictory relations in order to determine resistance type for each <drug, mutation> pair. The output of the relation combination shows promising results with 85% and 76% agreement to the Stanford HIVDB on two and three levels of resistance respectively. Furthermore, the system can also provide new relations and additional sources of evidence to analyze the discordance between expert systems.

The proposed algorithm uses publicly available NLP tools. Therefore, it is easy to setup a similar system, and it is suitable for extracting relations in case where an annotated corpus is not available. The algorithm can also be applied to extract other types of relations in which entities have a distinct category such as gene-protein, gene-disease, and disease-mutation. In such cases, the system needs to provide a list of relation words, manner words, and Name Entity Recognition (NER) module.

The performance of the system is adequate. The system processed 129,448 abstracts on a Centrino Duo 1.8 GHz laptop in 35 minutes, of which 97% of the time was used by the parser, 2% was used for filtering and simplification of the sentences, and 1% of the time was used for the actual extraction and combination of the relations. The system is clearly capable to be used in large-scale relation extraction experiments. The system is used in 5 hospitals from the Virolab project to preselect the most relevant novel resistance data from literature and present those to virologists and medical doctors for further evaluation.


  1. AIDS epidemic update: December 2006[]

  2. Douglas DR, David MM, Martin D, Warner CG, Daria H, Pomerantz RJ: The Challenge of Finding a Cure for HIV Infection. Science 2009, 323: 1304–1307. 10.1126/science.1165706

    Article  Google Scholar 

  3. Vercauteren J, Vandamme AM: Algorithms for the interpretation of HIV-1 genotypic drug resistance information. Antiviral Research 2006, 71: 335–342. 10.1016/j.antiviral.2006.05.003

    Article  CAS  PubMed  Google Scholar 

  4. Lengauer T, Sing T: Bioinformatics-assisted anti-HIV therapy. Nature Reviews 2006, 4: 790–797. 10.1038/nrmicro1477

    CAS  PubMed  Google Scholar 

  5. Saigo H, Uno T, Tsuda K: Mining complex genotypic features for predicting HIV-1 drug resistance. Bioinformatics 2007, 23: 2455–2462. 10.1093/bioinformatics/btm353

    Article  CAS  PubMed  Google Scholar 

  6. Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform 2005, 6: 57–71. 10.1093/bib/6.1.57

    Article  CAS  PubMed  Google Scholar 

  7. Erhardt RA, Schneider R, Blaschke C: Status of text-mining techniques applied to biomedical text. Drug Discovery Today 2006, 11: 315–325. 10.1016/j.drudis.2006.02.011

    Article  CAS  PubMed  Google Scholar 

  8. Saric J, Jensen LJJ, Ouzounova R, Rojas I, Bork P: Extraction of regulatory gene/protein networks from Medline. Bioinformatics 2006, 22: 645–650. 10.1093/bioinformatics/bti597

    Article  CAS  PubMed  Google Scholar 

  9. Ananiadou S, Kell DBB, Tsujii J: Text mining and its potential applications in systems biology. Trends Biotechnol 2006, 24: 571–579. 10.1016/j.tibtech.2006.10.002

    Article  CAS  PubMed  Google Scholar 

  10. Huang M, Zhu X, Li M: A hybrid method for relation extraction from biomedical literature. Int J Med Inform 2006, 75: 443–455. 10.1016/j.ijmedinf.2005.06.010

    Article  PubMed  Google Scholar 

  11. Koike A, Niwa Y, Takagi T: Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 2005, 21: 1227–1236. 10.1093/bioinformatics/bti084

    Article  CAS  PubMed  Google Scholar 

  12. Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB: Frontiers of biomedical text mining: current progress. Brief Bioinform 2007, 8: 358–375. 10.1093/bib/bbm045

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I: Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 2004, 20: 604–611. 10.1093/bioinformatics/btg452

    Article  CAS  PubMed  Google Scholar 

  14. Fundel K, Küffner R, Zimmer R: RelEx - Relation extraction using dependency parse trees. Bioinformatics 2007, 23: 365–371. 10.1093/bioinformatics/btl616

    Article  CAS  PubMed  Google Scholar 

  15. Jang H, Lim J, Lim JH, Park SJ, Lee KC, Park SH: Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics 2006, 22: e220-e226. 10.1093/bioinformatics/btl203

    Article  CAS  PubMed  Google Scholar 

  16. Rinaldi F, Schneider G, Kaljurand K, Hess M, Andronis C, Konstandi O, Persidis A: Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach. Artif Intell Med 2007, 39: 127–136. 10.1016/j.artmed.2006.08.005

    Article  PubMed  Google Scholar 

  17. Erkan G, Ozgur A, Radev DR: Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning: 28–30 June 2007; Prague. Edited by: Eisner J. ACL; 2007:228–237.

    Google Scholar 

  18. Katrenko S, Adriaans P: Learning Relations from Biomedical Corpora Using Dependency Trees. Knowledge Discovery and Emergent Complexity in Bioinformatics 2007, 4366: 61–80. full_text

    Article  Google Scholar 

  19. Kim JH, Mitchell A, Attwood TK, Hilario M: Learning to extract relations for protein annotation. Bioinformatics 2007, 23: i256-i263. 10.1093/bioinformatics/btm168

    Article  CAS  PubMed  Google Scholar 

  20. Chowdhary R, Zhang J, Liu J: Bayesian inference of protein-protein interactions from biological literature. Bioinformatics 2009, 25: 1536–1542. 10.1093/bioinformatics/btp245

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Kim S, Yoon J, Yang J: Kernel approaches for genic interaction extraction. Bioinformatics 2008, 24: 118–126. 10.1093/bioinformatics/btm544

    Article  CAS  PubMed  Google Scholar 

  22. Kim MY: Detection of Gene Interactions Based on Syntactic Relations. J Biomed Biotechnol 2008, 2008: 371710.

    PubMed  PubMed Central  Google Scholar 

  23. Abulaish M, Dey L: Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining. Data Knowl Eng 2007, 61: 228–262. 10.1016/j.datak.2006.06.007

    Article  Google Scholar 

  24. Giles C, Wren J: Large-scale directional relationship extraction and resolution. BMC Bioinformatics 2008, 9: S11. 10.1186/1471-2105-9-S9-S11

    Article  PubMed  PubMed Central  Google Scholar 

  25. Zhou D, He Y: Methodological Review: Extracting interactions between proteins from the literature. J of Biomedical Informatics 2008, 41: 393–407. 10.1016/j.jbi.2007.11.008

    Article  CAS  PubMed  Google Scholar 

  26. Klein D, Manning CD: Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics: 7–12 July 2003; Sapporo. Edited by: Hinrichs WE. ACL; 2003:423–430.

    Google Scholar 

  27. Malik R, Franke L, Siebes A: Combination of text-mining algorithms increases the performance. Bioinformatics 2006, 22: 2151–2157. 10.1093/bioinformatics/btl281

    Article  CAS  PubMed  Google Scholar 

  28. Horn F, Lau AL, Cohen FE: Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics 2004, 20: 557–568. 10.1093/bioinformatics/btg449

    Article  CAS  PubMed  Google Scholar 

  29. Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 2006, 7: 119–129. 10.1038/nrg1768

    Article  CAS  PubMed  Google Scholar 

  30. Sanchez-Graillet O, Poesio M: Negation of protein protein interactions: analysis and extraction. Bioinformatics 2008, 23: i424–432. 10.1093/bioinformatics/btm184

    Article  Google Scholar 

  31. Liao JG, Chin KV: Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 2007, 23: 1945–1951. 10.1093/bioinformatics/btm287

    Article  CAS  PubMed  Google Scholar 

  32. Torvik VI, Smalheiser NR: A quantitative model for linking two disparate sets of articles in MEDLINE. Bioinformatics 2007, 23: 1658–1665. 10.1093/bioinformatics/btm161

    Article  CAS  PubMed  Google Scholar 

  33. Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2nd edition. Morgan Kaufmann, San Francisco; 2005.

    Google Scholar 

  34. Miyao Y, Sagae K, Saetre R, Matsuzaki T, Tsujii J: Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 2009, 25: 394–400. 10.1093/bioinformatics/btn631

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Leitner F, Valencia A: A text-mining perspective on the requirements for electronically annotated abstracts. FEBS Letters 2008, 582: 1178–1181. 10.1016/j.febslet.2008.02.072

    Article  CAS  PubMed  Google Scholar 

Download references


This research was supported by the European ViroLab INFSO-IST-027446 and Vietnamese overseas training program (Project 322).

The authors sincerely thank Professor Tu-Bao Ho and Dr Sophia Katrenko for their useful comments.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Quoc-Chinh Bui.

Additional information

Authors' contributions

The results reported in this paper are part of the PhD research of the first author, QCB. BON conceived the study and made the initial design. PMAS, BON and CAB participated in the analysis, editing the paper and as PhD supervisors. All authors read and approved the final document.

Electronic supplementary material

Additional file 1: Simplification_process. A MS Word document provides details of simplification process. (DOC 48 KB)


Additional file 2: List_of _approved_HIV_drugs. A MS Word document provides a list of 22 FDA approved drug names. (DOC 50 KB)


Additional file 3: 500_PubMed_results. A text file containing the list of 500 sentences taken from abstracts and the extracted relations corresponding to each input sentence. (TXT 159 KB)


Additional file 4: 130_HIVDB_results. A text file containing the list of 130 sentences taken from the Stanford HIVDB rules and the extracted relations corresponding to each input sentence. (TXT 32 KB)


Additional file 5: Performance_evaluation. A MS Word document provides details of the evaluation of the extraction method on 500 sentences taken from PubMed abstracts. (DOC 32 KB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Bui, QC., Nualláin, B.Ó., Boucher, C.A. et al. Extracting causal relations on HIV drug resistance from literature. BMC Bioinformatics 11, 101 (2010).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: