The entire experimental process consists of the following steps: (1) build a local MEDLINE search engine; (2) develop, evaluate and compare the KD approach to a SVM-based approach; (3) extract drug-SE pairs from MEDLINE; and (4) systematically analyze the correlation between drug-associated side effects and drug gene targets, metabolism genes, chemical similarity, and disease indications.
Build a local MEDLINE search engine
We downloaded a total of 21,354,075 MEDLINE citations (119,085,682 sentences) published between 1965 and 2012 from the U.S. National Library of Medicine http://mbr.nlm.nih.gov/Download/index.shtml. Each sentence was syntactically parsed with Stanford Parser [9] using the Amazon Cloud computing service (a total of 3,500 instance-hours with High-CPU Extra Large Instance used). We used the publicly available information retrieval library Lucene http://lucene.apache.org to create a local MEDLINE search engine with indices created on both sentences and their corresponding parse trees. We have recently used this local search engine in our recent tasks of extracting disease-manifestation relationships [19] and anticancer drug-SE pairs from MEDLINE [21].
Develop, evaluate and compare the KD approach to the SVM-based approach
We downloaded a total of 100,049 known drug-SE pairs from SIDER (Side Effect Resource), a public, machine-readable side effect resource that was automatically constructed from FDA package inserts [10]. These pairs are used as prior knowledge for the KD approach. Using each drug-SE pair from the prior knowledge data as a search query to the local MEDLINE search engine, we retrieved all MEDLINE sentences and abstracts that contain at least one known drug-SE pair. These sentences are determined as SE-related. We then extracted drug-SE co-occurrence pairs from these SE-relevant sentences, with the restriction that both drug and SE names must be noun phrases in the parse trees of the sentences. We have recently shown that this restriction can increase the precision of biomedical relationship extraction from MEDLINE [19–22]. The drug and SE lists (996 drugs (generic names) and 4,199 SE terms) are from SIDER.
We compared the drug-SE extraction from sentences classified using the KD approach with that from sentences classified using a SVM-based text classifier (described later). For comparison, we selected ten drugs from SIDER that are associated with the most numbers of SEs and compared the performance of KD approach to the SVM approach in extracting drug-SE pairs for each of them (Figure 1). For each drug, we randomly split its drug-SE pairs into two equal parts: training dataset and testing dataset. For the KD approach, we first retrieved all sentences that contain at least one of the 10 drugs and at least one SE term from the SE lexicon (4,199 SE terms). We then classified these sentences into SE-related and -unrelated. A sentence is determined as SE-related if it contains at least one known drug-SE pair from the training dataset (prior knowledge). We then extracted additional drug-SE pairs from these sentences.
For the SVM-based approach, we first classified MEDLINE sentences into drug-SE-related and -unrelated using a pre-trained SVM-sentence classifier. We then extracted drug-SE co-occurrence pairs from positively classified sentences. A two-class SVM-based sentence classifier was trained using implementation in WeKa [5]. The positive training data consisted of a total of 320,175 sentences, each of which contained at least one known drug-SE pairs from SIDER (the testing data for above 10 drugs were excluded). Equal number of negative sentences was randomly selected from the rest of MEDLINE sentences. The SVM-based sentence classifier used polynomial kernel, bag-of-words feature, TF-IDF weighting, stemming and stopwords-removal. The bag-of-words feature was used since it is often the case that the appearance of one specific word such as 'toxicity' can be used to determine whether a sentence is drug-SE-related. The 10-fold cross validation was used in training the SVM classifier. For both KD and SVM-based approaches, the input sentences are the same, which are sentences that contain at least one of the 10 drugs and at least one term from the SE lexicon. We evaluated and compared the performance using the same testing datasets. Precisions, recalls and F1 scores for these 10 drugs were calculated.
Extract drug-SE pairs from 21 Million MEDLINE records
After evaluating the KD-approach using 10 drugs, we then scaled this approach to extract drug-SE pairs from all MEDLINE sentences and abstracts. We used all 100,049 drug-SE pairs from SIDER as prior knowledge to classify all MEDLINE sentences and abstracts into SE-related and -unrelated. For the classification, we used each of the 100,049 drug-SE pairs from SIDER as a search query to the local MEDLINE search engine. Both sentences and abstracts containing the pair were retrieved as SE-related. Sentences that are not retrieved are assumed to be SE-unrelated and ignored. In total, we extracted 49,575 drug-SE pairs from sentences and 180,454 pairs from abstracts using the drug and SE lists (996 drugs and 4,199 SE terms) we compiled from SIDER. These extracted drug-SE pairs were used in the subsequent semantic analysis.
Analyze the correlations between drug-associated side effects and genetic, genomic and chemical drug properties
Many current computational approaches for drug target discovery [7, 16] and drug repositioning[7, 12, 8, 13] used only lower-level genetic, genomic, and chemical drug properties. In this study, we investigated whether the large number of higher-level phenotypic drug-SE relationship data that we extracted from MEDLINE implicitly captured lower-level drug mechanism, therefore can be leveraged for drug target discovery and drug repositioning. In extracting drug-SE pairs from MEDLINE, we used the generic names of FDA-approved drugs. For the correlation analysis, we use drug generic names to link drug-SE pairs to drug-related information from different databases.
Correlation with drug target genes
Drug side effects are often caused by drugs acting on their target genes. We investigated whether drug-drug pairs that shared SEs tend to share gene targets. We downloaded a total of 13,635 drug-target gene associations from DrugBank [15], a knowledge base for drugs, drug actions and drug targets. The drug-gene pairs are comprised of 3,454 drugs and 1,816 genes. We first mapped drugs of drug-SE pairs extracted from MEDLINE to drugs of drug-gene pairs from DrugBank. For drug-drug pairs that share SEs at different cutoffs, we calculated the average number of shared gene targets.
Correlation with drug metabolizing genes
Drug metabolism plays critical role in drug-associated side effects. We investigated whether drug-drug pairs that shared SEs also share drug metabolism genes. We downloaded a total of 4,399 drug-gene pairs from PharmGKB [14], a repository of drug pharmacogenetics information. For drug-drug pairs that share SEs at different cutoffs, we calculated the average number of shared metabolism genes.
Correlation with genetic, genomic and chemical drug-drug relationships
Drug related pathway, genomic and chemical relationship information was obtained from STITCH [11], a resource of known and predicted interactions of chemicals and proteins. In STITCH, chemicals are linked to other chemicals and proteins by four types of relationships: chemical reactions from manually curated pathway databases ("Database"), literature associations ("Textmining"), similar 2D structures ("Similarity") and similar activities ("Experimental") based on drug-induced perturbation on the gene expression level [12]. We used chemical-chemical relationships from curated pathway database ("Database", 342,072 chemical-chemical pairs), chemical 2D structure ("Similarity", 607,588 chemical-chemical pairs) and gene expression ("Experimental", 238,380 chemical-chemical pairs). The text mining-based co-occurrence pairs were not used since they provide no explicit semantic relationships for chemical-chemical pairs. For drug-drug pairs that share SEs at different cutoffs, we calculated the average chemical similarity scores.
Correlation with disease indications
If high-level phenotypic drug-SE relationships implicitly capture known and unknown drug-related genetic, genomic and chemical information, then drug-SE pairs may be directly used for drug repositioning, as suggested in a recent review article [6]. In this study, we investigated whether drug-drug pairs that shared SEs tend to share disease indications. We recently extracted a total of 52,000 drug-disease pairs from ClinicalTrials.gov [24], a registry of federally and privately supported clinical trials conducted in the United States and around the world. The drug-disease pairs contain 2,035 drugs and 9,591 diseases. For drug-drug pairs that share SEs at different cutoffs, we calculated the average number of shared disease indications.