Comparing a knowledge-driven approach to a supervised machine learning approach in large-scale extraction of drug-side effect relationships from free-text biomedical literature
© Xu and Wang; licensee BioMed Central Ltd. 2015
Published: 18 March 2015
Systems approaches to studying drug-side-effect (drug-SE) associations are emerging as an active research area for both drug target discovery and drug repositioning. However, a comprehensive drug-SE association knowledge base does not exist. In this study, we present a novel knowledge-driven (KD) approach to effectively extract a large number of drug-SE pairs from published biomedical literature.
Data and methods
For the text corpus, we used 21,354,075 MEDLINE records (119,085,682 sentences). First, we used known drug-SE associations derived from FDA drug labels as prior knowledge to automatically find SE-related sentences and abstracts. We then extracted a total of 49,575 drug-SE pairs from MEDLINE sentences and 180,454 pairs from abstracts.
On average, the KD approach has achieved a precision of 0.335, a recall of 0.509, and an F1 of 0.392, which is significantly better than a SVM-based machine learning approach (precision: 0.135, recall: 0.900, F1: 0.233) with a 73.0% increase in F1 score. Through integrative analysis, we demonstrate that the higher-level phenotypic drug-SE relationships reflects lower-level genetic, genomic, and chemical drug mechanisms. In addition, we show that the extracted drug-SE pairs can be directly used in drug repositioning.
In summary, we automatically constructed a large-scale higher-level drug phenotype relationship knowledge, which can have great potential in computational drug discovery.
It has been increasingly recognized that similar side effects of seemingly unrelated drugs can be caused by their common off-targets and that drugs with similar side effects are likely to share molecular targets . Therefore, systems approaches to studying side effect relationships among drugs and integration of this drug phenotypic data with drug-related genetic, genomic, proteomic, and chemical data will facilitate drug target discovery and drug repositioning. The availability of a comprehensive drug-side effect (SE) relationship knowledge base is critical for these tasks. Current drug phenotype-driven systems approaches rely exclusively on drug-SE associations extracted from FDA drug labels. However, there exists a large amount of additional drug-SE relationship knowledge in the large body of published biomedical literature. In this study, we present a novel knowledge-driven approach to automatically extract a large number of drug-SE pairs from 21 million published biomedical abstracts. We systematically analyzed extracted drug-SE pairs in combination with drug-related gene targets, metabolism, pathways, gene expression and chemical structure data. We show that these extracted drug-SE pairs have great potential in drug discovery.
Systems approaches to studying the phenotypic relationships among drugs can facilitate rapid drug target discovery and drug repositioning. Computational approaches to predicting drug targets have often been based on chemical similarity measures and docking strategies [7, 16]. Similarly, many computational strategies for drug repositioning have been explored . The majority of these approaches leverage on known drug properties such as chemical similarity , molecular activity similarity , molecular docking , and gene expression profile similarity . In a seminal paper, Campillos et al. used phenotypic side-effect similarities among drugs to predict new targets for drugs . However, their analysis was limited to drug-SE relationships derived solely from the FDA drug labels. In one of our recent studies, we show that much of the drug-SE association knowledge from biomedical literature has not been captured in FDA drug labels yet .
Currently, more than 21 million biomedical records are available on MEDLINE. While many biomedical relationship extraction tasks have focused on extracting relationships between drugs, diseases, proteins, or genes [2, 18, 19], extracting drug-SE relationships from MEDLINE has been less explored. Recently, Gurulingappa et al. trained and tested a supervised machine learning classifier to classify drug-condition pairs in a set of 2972 manually annotated case reports . That study focused on a limited set of drugs and side effects and case reports. It is unclear how their approach can be effectively scaled up to the whole MEDLINE in building a large-scale drug-SE relationship knowledge base. Recently, we developed an approach in boosting drug safety signal detection from FDA Adverse Event Reporting System (FAERS) using evidence from MEDLINE . We developed an automatic approach to extract anticancer drug-specific side effects from MEDLINE by developing specific filtering and ranking schemes . We developed a pattern-based learning approach to accurately extract drug-SE pairs from MEDLINE sentences . We combined automatic table classification and relationship extraction in extracting anticancer drug-side effect pairs from full-text articles . In this study, we present a knowledge-driven (KD) text-classification-based approach to extract drug-SE pairs from MEDLINE sentences. Different from our previous studies where we extracted drug-SE pairs from unclassified sentence, here we classified sentences into drug-SE-related and -unrelated before relationship extraction. Our approach is also different from other text classification-based approaches that often trained text classifiers using annotated training datasets to find drug-SE-related sentences , instead, we implicitly classified MEDLINE sentences using known drug-SE pairs. Since our study did not explicitly train a text classifier, it is highly dynamic and effective: it can easily incorporate any changing prior knowledge and quickly extracted drug-SE pairs from the whole MEDLINE (21,354,075 abstracts and 119,085,682 sentences).
Our study is based on the two key observations: (1) multiple side effects for a drug are often reported in the same sentences or abstracts; and (2) if a sentence contains a known drug-SE pair, then this sentence is likely to be SE-relevant. Other pairs in this SE-related sentence are likely to be drug-SE pairs. For example, the sentence "At the final irinotecan dose of 50 mg/m(2), grade 3 or higher toxicity included diarrhea (26%), neutropenia (21%), nausea (18%), fatigue (16%), anorexia (13%), and thrombosis/embolism (13%)" (PMID 19139178) contains a known drug-SE pair "irinotecan-diarrhea." Based on this fact, we know that this sentence is SE-related and that the other five pairs in this sentence are likely to be drug-SE pairs. On the other hand, the following sentence "Weekly docetaxel, cisplatin, and irinotecan (TPC): results of a multicenter phase II trial in patients with metastatic esophagogastric cancer " contain no known drug-SE pair, therefore no pair will be extracted from this sentences even though it contains three drug-disease pairs. In this study, we used all known drug-SE pairs derived from FDA drug labels as prior knowledge to find SE-related MEDLINE sentences and abstracts, from which many additional drug-SE pairs that have not included in FDA drug labels are then extracted. We compared the KD approach to a support vector machine (SVM)-based approach.
Data and methods
The entire experimental process consists of the following steps: (1) build a local MEDLINE search engine; (2) develop, evaluate and compare the KD approach to a SVM-based approach; (3) extract drug-SE pairs from MEDLINE; and (4) systematically analyze the correlation between drug-associated side effects and drug gene targets, metabolism genes, chemical similarity, and disease indications.
Build a local MEDLINE search engine
We downloaded a total of 21,354,075 MEDLINE citations (119,085,682 sentences) published between 1965 and 2012 from the U.S. National Library of Medicine http://mbr.nlm.nih.gov/Download/index.shtml. Each sentence was syntactically parsed with Stanford Parser  using the Amazon Cloud computing service (a total of 3,500 instance-hours with High-CPU Extra Large Instance used). We used the publicly available information retrieval library Lucene http://lucene.apache.org to create a local MEDLINE search engine with indices created on both sentences and their corresponding parse trees. We have recently used this local search engine in our recent tasks of extracting disease-manifestation relationships  and anticancer drug-SE pairs from MEDLINE .
Develop, evaluate and compare the KD approach to the SVM-based approach
We downloaded a total of 100,049 known drug-SE pairs from SIDER (Side Effect Resource), a public, machine-readable side effect resource that was automatically constructed from FDA package inserts . These pairs are used as prior knowledge for the KD approach. Using each drug-SE pair from the prior knowledge data as a search query to the local MEDLINE search engine, we retrieved all MEDLINE sentences and abstracts that contain at least one known drug-SE pair. These sentences are determined as SE-related. We then extracted drug-SE co-occurrence pairs from these SE-relevant sentences, with the restriction that both drug and SE names must be noun phrases in the parse trees of the sentences. We have recently shown that this restriction can increase the precision of biomedical relationship extraction from MEDLINE [19–22]. The drug and SE lists (996 drugs (generic names) and 4,199 SE terms) are from SIDER.
For the SVM-based approach, we first classified MEDLINE sentences into drug-SE-related and -unrelated using a pre-trained SVM-sentence classifier. We then extracted drug-SE co-occurrence pairs from positively classified sentences. A two-class SVM-based sentence classifier was trained using implementation in WeKa . The positive training data consisted of a total of 320,175 sentences, each of which contained at least one known drug-SE pairs from SIDER (the testing data for above 10 drugs were excluded). Equal number of negative sentences was randomly selected from the rest of MEDLINE sentences. The SVM-based sentence classifier used polynomial kernel, bag-of-words feature, TF-IDF weighting, stemming and stopwords-removal. The bag-of-words feature was used since it is often the case that the appearance of one specific word such as 'toxicity' can be used to determine whether a sentence is drug-SE-related. The 10-fold cross validation was used in training the SVM classifier. For both KD and SVM-based approaches, the input sentences are the same, which are sentences that contain at least one of the 10 drugs and at least one term from the SE lexicon. We evaluated and compared the performance using the same testing datasets. Precisions, recalls and F1 scores for these 10 drugs were calculated.
Extract drug-SE pairs from 21 Million MEDLINE records
After evaluating the KD-approach using 10 drugs, we then scaled this approach to extract drug-SE pairs from all MEDLINE sentences and abstracts. We used all 100,049 drug-SE pairs from SIDER as prior knowledge to classify all MEDLINE sentences and abstracts into SE-related and -unrelated. For the classification, we used each of the 100,049 drug-SE pairs from SIDER as a search query to the local MEDLINE search engine. Both sentences and abstracts containing the pair were retrieved as SE-related. Sentences that are not retrieved are assumed to be SE-unrelated and ignored. In total, we extracted 49,575 drug-SE pairs from sentences and 180,454 pairs from abstracts using the drug and SE lists (996 drugs and 4,199 SE terms) we compiled from SIDER. These extracted drug-SE pairs were used in the subsequent semantic analysis.
Analyze the correlations between drug-associated side effects and genetic, genomic and chemical drug properties
Many current computational approaches for drug target discovery [7, 16] and drug repositioning[7, 12, 8, 13] used only lower-level genetic, genomic, and chemical drug properties. In this study, we investigated whether the large number of higher-level phenotypic drug-SE relationship data that we extracted from MEDLINE implicitly captured lower-level drug mechanism, therefore can be leveraged for drug target discovery and drug repositioning. In extracting drug-SE pairs from MEDLINE, we used the generic names of FDA-approved drugs. For the correlation analysis, we use drug generic names to link drug-SE pairs to drug-related information from different databases.
Correlation with drug target genes
Drug side effects are often caused by drugs acting on their target genes. We investigated whether drug-drug pairs that shared SEs tend to share gene targets. We downloaded a total of 13,635 drug-target gene associations from DrugBank , a knowledge base for drugs, drug actions and drug targets. The drug-gene pairs are comprised of 3,454 drugs and 1,816 genes. We first mapped drugs of drug-SE pairs extracted from MEDLINE to drugs of drug-gene pairs from DrugBank. For drug-drug pairs that share SEs at different cutoffs, we calculated the average number of shared gene targets.
Correlation with drug metabolizing genes
Drug metabolism plays critical role in drug-associated side effects. We investigated whether drug-drug pairs that shared SEs also share drug metabolism genes. We downloaded a total of 4,399 drug-gene pairs from PharmGKB , a repository of drug pharmacogenetics information. For drug-drug pairs that share SEs at different cutoffs, we calculated the average number of shared metabolism genes.
Correlation with genetic, genomic and chemical drug-drug relationships
Drug related pathway, genomic and chemical relationship information was obtained from STITCH , a resource of known and predicted interactions of chemicals and proteins. In STITCH, chemicals are linked to other chemicals and proteins by four types of relationships: chemical reactions from manually curated pathway databases ("Database"), literature associations ("Textmining"), similar 2D structures ("Similarity") and similar activities ("Experimental") based on drug-induced perturbation on the gene expression level . We used chemical-chemical relationships from curated pathway database ("Database", 342,072 chemical-chemical pairs), chemical 2D structure ("Similarity", 607,588 chemical-chemical pairs) and gene expression ("Experimental", 238,380 chemical-chemical pairs). The text mining-based co-occurrence pairs were not used since they provide no explicit semantic relationships for chemical-chemical pairs. For drug-drug pairs that share SEs at different cutoffs, we calculated the average chemical similarity scores.
Correlation with disease indications
If high-level phenotypic drug-SE relationships implicitly capture known and unknown drug-related genetic, genomic and chemical information, then drug-SE pairs may be directly used for drug repositioning, as suggested in a recent review article . In this study, we investigated whether drug-drug pairs that shared SEs tend to share disease indications. We recently extracted a total of 52,000 drug-disease pairs from ClinicalTrials.gov , a registry of federally and privately supported clinical trials conducted in the United States and around the world. The drug-disease pairs contain 2,035 drugs and 9,591 diseases. For drug-drug pairs that share SEs at different cutoffs, we calculated the average number of shared disease indications.
The KD approach is more effective than the SVM-based approach in extracting drug-SE pairs from MEDLINE
Compare knowledge-driven approach (KD) to SVM for ten drugs
Drug-associated side effects positively correlate with drug-associated gene targets
Drug-associated side effects positively correlate with drug-associated metabolism genes
Drug-associated side effects positively correlate with drug-associated pathways and gene expression
Drug-associated side effects positively correlate with drug-associated disease indications
Our current study has several limitations and can be significantly improved in future studies. First, we used drug-SE pairs from SIDER as prior knowledge for the KD approach. The overall performance depends on the accuracy and comprehensiveness of the SIDER database. Errors and uncertainties in the knowledge base can propagate into the relationship extraction process and adversely affect the precision. For example, the drug-disease treatment pair 'ondansetron-pain' was incorrectly specified as a drug-SE pair in SIDER. Because of this error, our algorithm classified the following sentence as SE-related: "Ondansetron, lidocaine, tramadol, and fentanyl were effective in preventing and decreasing the level of rocuronium injection pain" (PMID 12032018). Three additional pairs (lidocaine-pain, tramadol-pain, and fentanyl-pain) were incorrectly extracted as drug-SE pairs. Since SIDER was constructed from FDA drug labels using text-mining approaches, errors may be inevitable for completely automatic method. Currently, we are manually extracting drug-associated side effects from FDA drug labels. Second, our algorithm cannot extract correct pairs from sentences with multiple drugs and multiple side effect names (n × m), even though the sentences are side effect-related. For example, sentence "... decreases in hemoglobin, nausea/vomiting, and hyperbilirubinemia were observed to be influenced by the previous use of irinotecan (OR = 3.07, P = 0.003), mitomycin (OR = 2.28, P = 0.004), and cisplatin (OR = 1.60, P = 0.007), respectively" (PMID: 17577624). Three drugs and three SEs are specified in the sentence, but only three, instead of 9 (3 × 3) are valid drug-SE pairs. This is a difficult problem for not only the KD approach but also for automatic relationship extraction in general. In this case, human curation may be necessary. Even with the above mentioned limitations, we demonstrated that the large number of drug-SE pairs extracted from MEDLINE reflect drug-related genetic, genomic and chemical information and can have potential in computational drug target discovery and drug repositioning. Currently, we are developing integrative systems approaches for drug repositioning by fully exploiting data ranging from lower level genetic connections to immediate layer genomic data to higher level phenotype data in order to build integrative models of genetic, genomic, and phenotypic complexity.
We have developed a novel KD approach in extracting a total of 49,575 drug-SE pairs from 119,085,682 MELDINE sentences and 180,454 pairs from 21,354,075 MEDLINE abstracts (records). We show that the KD approach performed significantly better that a SVM-based machine approach. We demonstrated that this large-scale drug-SE association database that we have built provides an invaluable data resource for computational drug target discovery and drug repositioning.
RX is funded by Case Western Reserve University/Cleveland Clinic CTSA Grant (UL1 RR024989), the Eunice Kennedy Shriver National Institute Of Child Health & Human Development of the National Institutes of Health under Award Number DP2HD084068, the Training grant in Computational Genomic Epidemiology of Cancer (CoGE) (R25 CA094186-06), and Grant #IRG-91-022-18 to the Case Comprehensive Cancer Center from the American Cancer Society. QW is partly funded by ThinTek LLC.
Publication charges for this article have been funded by the Training grant in Computational Genomic Epidemiology of Cancer (CoGE) (R25 CA094186-06).
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 5, 2015: Selected articles from the 10th International Symposium on Bioinformatics Research and Applications (ISBRA-14): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S5.
- Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P: Drug target identification using side-effect similarity. Science. 2008, 321 (5886): 263-266. 10.1126/science.1158140.View ArticlePubMedGoogle Scholar
- Cohen KB, Hunter LE: Text Mining for Translational Bioinformatics. PLoS computational biology. 2013, 9 (4): e1003044-10.1371/journal.pcbi.1003044.PubMed CentralView ArticlePubMedGoogle Scholar
- Fliri AF, Loging WT, Thadeio PF, Volkmann RA: Analysis of drug-induced effect patterns to link structure and side effects of medicines. Nature chemical biology. 2005, 1 (7): 389-397. 10.1038/nchembio747.View ArticlePubMedGoogle Scholar
- Gurulingappa H, Mateen-Rajput A, Toldo L: Extraction of potential adverse drug events from medical case reports. Journal of biomedical semantics. 2012, 3 (1): 15-10.1186/2041-1480-3-15.PubMed CentralView ArticlePubMedGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD, Explorations Newsletter. 2009, 11 (1): 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- Hurle MR, Yang L, Xie Q, Rajpal DK, Sanseau P, Agarwal P: Computational drug repositioning: From data to therapeutics. Clinical Pharmacology and Therapeutics. 2013Google Scholar
- Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, Roth BL: Predicting new molecular targets for known drugs. Nature. 2009, 462 (7270): 175-181. 10.1038/nature08506.PubMed CentralView ArticlePubMedGoogle Scholar
- Kinnings SL, Liu N, Buchmeier N, Tonge PJ, Xie L, Bourne PE: Drug discovery using chemical systems biology: repositioning the safe medicine Comtan to treat multi-drug and extensively drug resistant tuberculosis. PLoS computational biology. 2009, 5 (7): e1000423-10.1371/journal.pcbi.1000423.PubMed CentralView ArticlePubMedGoogle Scholar
- Klein D, Manning CD: Accurate unlexicalized parsing. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. 2003, 1: 423-430.Google Scholar
- Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P: A side effect resource to capture phenotypic effects of drugs. Molecular systems biology. 2010, 6 (1):Google Scholar
- Kuhn M, Szklarczyk D, Franceschini A, von Mering C, Jensen LJ, Bork P: STITCH 3: zooming in on proteinchemical interactions. Nucleic acids research. 2012, 40 (D1): D876-D880. 10.1093/nar/gkr1011.PubMed CentralView ArticlePubMedGoogle Scholar
- Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Golub TR: The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science Signalling. 2006, 313 (5795): 1929-Google Scholar
- Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Butte AJ: Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med. 2011, 3 (96ra): 77-Google Scholar
- Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong Li, Sangkuhl K, Thorn CF, Altman RB, Klein TE: Pharmacogenomics Knowledge for Personalized Medicine. Clinical Pharmacology and Therapeutics. 2012, 92 (4): 414-417. 10.1038/clpt.2012.96.PubMed CentralView ArticlePubMedGoogle Scholar
- Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic acids research. 2006, 34 (suppl 1): D668-D672.PubMed CentralView ArticlePubMedGoogle Scholar
- Xie L, Evangelidis T, Xie L, Bourne PE: Drug discovery using chemical systems biology: weak inhibition of multiple kinases may contribute to the anti-cancer effect of nelfinavir. PLoS computational biology. 2011, 7 (4): e1002037-10.1371/journal.pcbi.1002037.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu R, Wang Q: Automatic signal prioritizing and filtering approaches in detecting post-marketing cardiovascular events associated with targeted cancer drugs from the FDA Adverse Event Reporting System (FAERS). Journal of Biomedical Informatics. 2014, 47: 171-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu R, Wang Q: A semi-supervised approach to extract pharmacogenomics-specific drug-gene pairs from biomedical literature for personalized medicine. Journal of Biomedical Informatics. 2013, 46 (4): 585-593Google Scholar
- Xu R, Li L, Wang Q: Towards building a disease-phenotype relationship knowledge base: large scale extraction of disease-manifestation relationship from literature. Bioinformatics. 2013Google Scholar
- Xu R, Wang Q: Large-scale combining signals from both biomedical literature and the FDA Adverse Event Reporting System (FAERS) to improve post-marketing drug safety signal detection. BMC Bioinformatics. 2014, 15: 17-10.1186/1471-2105-15-17.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu R, Wang Q: Automatic construction and integrated analysis of a cancer drug side effect knowledge base. Journal of the American Medical Informatics Association. 2014Google Scholar
- Xu R, Wang Q: Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature. Journal of Biomedical Informatics, J Biomed Inform. 2014, 51: 191-9.View ArticlePubMedGoogle Scholar
- Xu R, Wang Q: Combining automatic table classification and relationship extraction in extracting anticancer drug-side effect pairs from full-text articles. Journal of Biomedical Informatics. 2014Google Scholar
- Xu R, Wang Q: Large-scale extraction of drug-disease treatment pairs from biomedical literature for drug repurposing. BMC Bioinformatics. 2013, 14 (1): 181-10.1186/1471-2105-14-181.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.