Predicting potential adverse events using safety data from marketed drugs

Background While clinical trials are considered the gold standard for detecting adverse events, often these trials are not sufficiently powered to detect difficult to observe adverse events. We developed a preliminary approach to predict 135 adverse events using post-market safety data from marketed drugs. Adverse event information available from FDA product labels and scientific literature for drugs that have the same activity at one or more of the same targets, structural and target similarities, and the duration of post market experience were used as features for a classifier algorithm. The proposed method was studied using 54 drugs and a probabilistic approach of performance evaluation using bootstrapping with 10,000 iterations. Results Out of 135 adverse events, 53 had high probability of having high positive predictive value. Cross validation showed that 32% of the model-predicted safety label changes occurred within four to nine years of approval (median: six years). Conclusions This approach predicts 53 serious adverse events with high positive predictive values where well-characterized target-event relationships exist. Adverse events with well-defined target-event associations were better predicted compared to adverse events that may be idiosyncratic or related to secondary target effects that were poorly captured. Further enhancement of this model with additional features, such as target prediction and drug binding data, may increase accuracy.


Background
The Food and Drug Administration's (FDA) proposed process modernization to support new drug development involves establishing a unified post-market safety surveillance framework to monitor the benefits and risks of drugs across their lifecycles [1]. While clinical trials are considered the gold standard for detecting and labeling adverse events, these trials are not sufficiently powered to detect less common adverse events. Additionally, some adverse events emerge when a drug is used in clinical practice outside of the specified inclusion/exclusion criteria. Some adverse events may have high prevalence in specific subpopulations who were not enrolled in the clinical trials or subgroups who cannot be identified based on information collected from patients in the trials. For example, a substantially increased risk of Stevens-Johnson syndrome in patients positive for the HLA-B*1502 allele taking carbamazepine was not identified until decades after approval [2]. In addition, concomitant medications (drug-drug interactions) and comorbidities may also contribute to adverse events, and these interactions are not always adequately present or captured in clinical trials. Therefore, post-market safety surveillance is crucial.
FDA uses the FDA Adverse Event Reporting System (FAERS) [3] and the Sentinel Initiative [4] to obtain information about adverse events occurring after drug approval. In 2017, over 1.8 million adverse event cases were reported to the FDA, including nearly 907,000 serious reports and over 164,000 fatal cases [5]. While traditional pharmacovigilance relies on data mining systems, these methods have reporting biases and require manual review of cases to determine reporting accuracy. Recently, there has been a strong interest in developing prediction algorithms to assist in post-market surveillance to overcome such weaknesses and make post-market pharmacovigilance more efficient.
Adverse event information from a variety of sources such as FAERS, literature, genomic data, and social media has been used to both evaluate adverse events and make predictions. For example, FAERS and similar post-market databases have demonstrated utility in adverse event prediction; Xu and Wang showed FAERS, combined with literature, had great utility in detecting safety signals [6]. Others have used chemical structure as the basis for adverse event predictions. Vilar and colleagues used molecular fingerprint similarity to drugs with a known association with rhabdomyolysis to further support and prioritize rhabdomyolysis signals found in FAERS [7]. Another unique option has been to use social media reports to identify new adverse events for drugs before they are reported to regulatory agencies or in peer-reviewed literature; Yang and colleagues used a partially supervised classification method to identify reports of adverse events on the discussion forum for Medhelp [3]. Other sources of information for adverse event prediction and detection include electronic health records, drug labels and even bioassay data [8][9][10]. Additionally, a wide variety of algorithms have been used to make adverse event predictions, including logistic regression models, support vector machine, and ensemble methods [8,11,12]. Many of these models have experienced varying degrees of success but overall demonstrate the great potential of developing an adverse event prediction model using a classifier.
However, many of these methodologies have focused on predicting a specific adverse event (e.g. cardiovascular events) or drug class (e.g. oncology drugs) [12][13][14]. Algorithms that can predict a wide variety of adverse events for multiple drug classes are important to enhance post-market safety surveillance. We have previously developed a genetic algorithm to predict approximately 900 adverse events using FDA product labels and FAERS data [15]. In this study, we build on this algorithm to predict 135 adverse events of high priority to regulatory review using safety data from marketed drugs with one or more shared molecular targets. We hypothesize that drugs that have similar modes of action at the same targets will have a similar adverse event profile because of shared structural features and likely target binding characteristics. We additionally expect adverse events that are more closely associated with drug targets (such as serotonin syndrome) to be well-predicted via this methodology. Some idiosyncratic reactions may also be captured well because the shared structural features likely play a role in these reactions where the targets and actions have not yet been fully characterized.

Results
Inclusion and exclusion criteria resulted in 54 test drugs and 213 unique comparator drugs, leading to 287 test-comparator drug combinations. The 54 test drugs used in this study had one to 37 comparator drugs, with one and two comparators being most frequent, as identified by DrugBank (Fig. 1a), and were on the market four to nine years (Fig. 1b). Tanimoto similarity scores between test drugs and comparator drugs ranged between 0.02 and 1, with 0.51 being the mean and 0.5 being the mode. Eighteen test drug-comparator associations included a biologic, as defined by a − 1 Tanimoto score (Fig. 1c). Target cosine similarity scores between test drugs and comparator drugs ranged between 0 and 1, with 0.45 being the mean and 1 being the mode (Fig. 1d). Seventy-nine comparator drugs were approved before 1982, while the most recently approved comparator drug had five years of time in market (Fig. 1e). The 54 test drugs are known to bind to 126 targets based on DrugBank data (summarized in Supplemental Table 1).
The prevalence of the 135 adverse events considered in this study is summarized in Fig. 2. The overall prevalence of adverse events was higher in the comparator drugs.
Prediction models were not made for 26 adverse events that were not observed or observed only in one test drug label (accident, anaphylactoid reaction, aplastic anaemia, apnoea, atrioventricular block, azotaemia, cardiomyopathy, cerebral infarction, coagulopathy, colitis, colitis ulcerative, Crohn's disease, dermatitis bullous, dermatitis Fig. 1 Characteristics of test drugs, comparator drugs and test-comparator drug combinations. a) Distribution of number of comparator drugs for test drug. b) Distribution of time on market for test drugs. c) Tanimoto score distribution for test-comparator drug combinations. d) Target similarity score distribution for test-comparator drug combinations. e) Distribution of time on market for comparator drugs exfoliative, gastric ulcer, granulocytopenia, hepatic necrosis, hypokinesia, injury, myopathy, oliguria, respiratory depression, road traffic accident, skin ulcer, thrombosis, and ulcer).
Results at varying thresholds (the minimum percentage of comparator drugs which are predicted positive for an adverse event to result in a positive prediction) for the safety label change evaluation and the number of adverse events with left-skewed positive predictive value, which demonstrated a high probability for high positive predictive value, are summarized in Table 1. Based on these results, we selected 70% as the optimum threshold. This resulted in the highest number of adverse events with high positive predictive values along with a high percentage of predicted safety label changes that were also issued by FDA (32%). All performance histograms at 70% threshold for each adverse event are provided in supplementary materials. Positive predictive value histograms of two well-predicted (i.e. left-skewed histograms) adverse events (febrile  neutropenia and hypertension) and two poorly-predicted (i.e. right-skewed histograms) adverse events (bacterial infection and haemorrhage) are shown in Fig. 3. Fifty-three adverse events showed 100% as the positive predictive value mode, with the median between 50 and 100, 25% quantile between 0 and 100, and 75% quantile at 100%, which suggests left-skewed distributions. By having a left-skewed distribution for positive predictive value, these adverse events were considered well-predicted, which suggests high probability of having high positive predictive value ( Table 2). Additionally, these adverse events had a sensitivity mode between 0 and 100%, specificity mode of 100%, and negative predictive value mode of 50-100%.
Fifty-six adverse events had positive predictive values mode between 0 and 33%, which suggested right-skewed distribution and thus were considered poorly-predicted (Table 3). While the positive predictive value was low, all these adverse events did have high specificity (mode: 76-100%) and negative predictive value (mode: 55-91%). Two adverse events, bacterial infection and fungal infection, additionally had high sensitivity (mode: 100%) ( Table 3).

Discussion
In this study we developed a preliminary approach to predict 135 adverse events of high priority to regulatory review using post-market safety data from marketed drugs that have the same activity at one or more of the same targets. We identified 53 adverse events that were well-predicted with this approach and chose to use a threshold which optimizes positive predictive value. These adverse events had varying sensitivity, but high specificity and negative predictive value. A model with high positive predictive value but low sensitivity will miss some true adverse events, but this was deemed    Table 2 Performance and prevalence of adverse events that were well-predicted by the algorithm  Table 2 Performance and prevalence of adverse events that were well-predicted by the algorithm  Table 3 Performance and prevalence of adverse events that were poorly-predicted by the algorithm  Table 3 Performance and prevalence of adverse events that were poorly-predicted by the algorithm  Table 3 Performance and prevalence of adverse events that were poorly-predicted by the algorithm acceptable for this study. In discussions about optimizing either positive predictive value or sensitivity in this study, it was deemed more important to identify adverse events that are most likely to be true and save time and effort sifting through false positives. In practice, a balance between sensitivity and positive predictive value would likely be optimal in conjunction with a manual review of predictions. Adverse event predictions based on molecular targets have multiple applications. We may be able to identify difficult to observe events that are not commonly seen in clinical trials to statistical significance. Predicted adverse events may be able to augment post-marketing surveillance activities by providing a list of adverse events to monitor. If an adverse event is discovered during pre-market evaluation or post-market utilization, examination of other drugs with similar pharmacologic mechanism and activity may help evaluate causality of the event and determine if further studies are necessary based on information from all comparators, not necessarily limited to those with the same indication. Particularly, examination of secondary targets may be useful, as this may explain the emergence of an adverse event or why a particular drug is at lower risk for adverse events traditionally labeled as a class adverse event. While the preliminary approach presented here is considered a tool for hypothesis generation, further evaluation and refinement will determine if it is useful in regulatory safety review.
The method reported in this study matches safety data based on drug activity at one or more of the same known targets. This may limit the predictive ability, as some adverse events may be idiosyncratic or be associated with unknown secondary targets, and thus the mechanisms responsible for the event have not yet been identified. Associations may still be identified, however, if overlapping structural features capture this unknown shared idiosyncratic activity. This method can be expanded to match a drug not only based on drug activity at one or more of the same targets, but also considering other features which characterize the drug activity, such as Anatomical Therapeutic Chemical (ATC) codes or binding strength (Ki). ATC codes, developed by the World Health Organization, may provide insight into drugs that are related by mechanism or therapeutic use [16]. Binding strength to targets of interest, which may be obtained from literature or databases such as the Psychoactive Drug Screening Program [17] or ChEMBL [18], may provide further classification of target similarity by identifying comparator drugs that bind to targets of interest at a similar order of magnitude. The model also does not capture drug dose that may be needed to produce the required target activity.
Fifty-six adverse events were predicted with low positive predictive value. Therefore, a positive prediction for these adverse events should be carefully reviewed by experts before reaching a conclusion. In practice, expert review augments this by assessment of FDA Adverse Event Reporting System (FAERS) reports, literature, and more recently evaluations using insurance claims and electronic health data. Reviewers may examine predictions made by this algorithm by reviewing literature and other databases to identify plausible mechanisms for the drug eliciting the reaction, or review cases in FAERS and electronic health records. More detail about evaluation of safety signals at the FDA can be found in Szarfman et al. [19]. Analysis of the poor-performing adverse events in this study identified several clinical patterns: hemorrhage (including "haemorrhage", "haematoma", and "rectal haemorrhage"), infection (including "cellulitis", "fungal infection", and "bacterial infection"), and psychiatric (including "paranoia", "delirium", and "hallucination") adverse events were among the worst-performing events by positive predictive value. Many of these adverse events may be idiosyncratic or related to unknown secondary target effects, and therefore it is difficult to predict an adverse event based on the known drug targets. This study may have been limited by the known targets that are available in DrugBank, as DrugBank may not contain all known secondary targets for all drugs. To better capture adverse events that may be related to secondary drug targets, target prediction for the test drugs and comparator drugs may be incorporated to better match comparator drugs to test drugs. DrugBank contains limited target predictions, so another source would be used.
This study had several limitations. First, the current version of Embase only allows users to extract manually curated adverse events by date for one drug at a time, which makes this process time-intensive for a large set of test drugs and their comparators and thus limited the number of drugs used in this study. We tried to address this limitation by using a probabilistic approach of performance evaluation using bootstrapping. Creating a tool to automate extraction of these adverse events may alleviate the manual burden. Additionally, text-mining FDA labels for adverse events is most accurate when used on a structured document, and thus we elected to use test drugs that had labels available in SPL format. While an assessment of the text-mining for 20 labels showed positive predictive value, sensitivity, and F-score at approximately 90% (unpublished data, Racz et al., 2018), we anticipate larger text-mining errors. This assessment identified patterns in the textmining algorithm that may lead to errors, and the query is currently being updated to improve performance. Finally, several adverse events were not observed or observed with low prevalence in the test drug set. Further analysis of these adverse events identified some events that may be associated with targets that were not substantially analyzed. This includes events such as "respiratory depression", which is particularly associated with drugs such as benzodiazepines and opioids and their related receptors [20], and "hypokinesia", which may be associated with dopamine receptors [21]. Other adverse events, such as "anaphylactoid reaction" and "apnea", may be reported interchangeably with other MedDRA Preferred Terms, such as "anaphylactic reaction" and "sleep apnea", respectively; therefore, these terms may be reported in lower frequency. To better capture this, we may consider alternative groupings or adding additional terms to complete a mechanistically-related grouping.

Conclusions
This classifier algorithm predicts significant adverse events that are of high priority for regulatory monitoring, some of which may be difficult to observe in clinical trials. The prediction algorithm uses evidence of adverse events available through FDA product labels and scientific literature for drugs that have the same activity at one or more of the same targets along with structural and target similarities and the duration of postmarket experience. For this study, we prioritized achieving high positive predictive value for the adverse event prediction. The model achieved high positive predictive value on 53 out of 135 adverse events, including several adverse events with wellcharacterized target relationships. We found that 32% of the model predicted safety label changes were FDA-issued within four to nine years after approval.

Selection of adverse events for evaluation
This methodology predicts 135 adverse events identified by FDA medical experts and reviewers to be of high priority to regulatory review and the pharmacovigilance efforts of the Office of Surveillance and Epidemiology. High priority was determined by FDA pharmacovigilance experts as events that are serious, may be life-threatening or debilitating, or represent frequent events that result in the need for safety label changes. These 135 adverse events were derived using 167 MedDRA Preferred Terms, grouped by mechanistic similarity according to FDA medical experts. For example, "pancreatitis" and "pancreatitis acute" are mechanistically similar and may be reported interchangeably, thus they were captured as one adverse event, "pancreatitis". The 135 adverse events and the 167 MedDRA Preferred Terms used to define them are listed in Table 4. MedDRA is the Medical Dictionary for Regulatory Activities and is the international medical terminology developed under the auspices of the International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use [22]. MedDRA Preferred Terms are medical concepts for symptoms, signs, diagnoses, indications, investigations, procedures, and medical, social, or family history. The FDA Adverse Event Reporting System (FAERS) currently codes reported adverse events as MedDRA Preferred Terms, and all terms from other sources were converted to Med-DRA Preferred Terms as described below.

Drug set selection
Selection of test drugs Fifty-four drugs approved by FDA between 2008 and 2013 were chosen for this analysis. Analyses were based on available Structured Product Labeling for products and required both an original label and a subsequent version of the label for this assessment. As Structured Product Labeling began in 2006, 2008 was selected to allow time for the requirement to be adequately implemented. The year 2013 was selected as the upper bound to allow at least four years of post-market experience to 2017, which is the median time for a regulatory action on a safety event (e.g. updating a drug label) [23]. Of the drugs approved between 2008 and 2013, drugs were included as long as there was at least one other U.S. marketed drug with the same pharmacological activity at one or more of the same known targets. Additional inclusion criteria were systemic exposure (e.g. not ophthalmic only) and multiple doses (i.e. drugs with single dose administration were excluded) due to an increased likelihood of multiple and significant adverse events.
Selection of comparator drugs Comparator drugs, defined as drugs that have the same activity (i.e. agonist or antagonist) at one or more of the same targets as the test drug, were chosen using DrugBank [24]. Test and comparator drug targets were identified if the drug had "pharmacological action" at the target (i.e. the column "pharmacological action" in DrugBank must read "yes" as opposed to "no" or "unknown") and must have a defined action column in DrugBank (i.e. "antagonist" or "agonist") at the Comparator drug adverse events were text-mined using Linguamatics I2E (Enterprise Release, Linguamatics Limited, Cambridge, United Kingdom). Adverse events were extracted as MedDRA Preferred Terms from Boxed Warnings, Warnings and Precautions, and Adverse Reactions sections. For each comparator drug, the FDA product label in use at the time of the respective test drug approval was used as the source for text-mining (e.g.: if a test drug was approved on November 1, 2010, the comparator drug labels that were in use on November 1, 2010 were mined).
For each drug label and adverse event, the presence or absence of a MedDRA Preferred Term was indicated by "1" or "0", respectively. The classifiers were trained on and performance was analyzed using test drug label data from 2017. To assess the algorithm's ability to predict future safety label changes at the approval date (described in detail in "Classifier" below), the difference between drug label data from 2017 and the label at approval (2008-2013) was used.

Adverse events from scientific literature
Adverse events from scientific literature were mined using Embase Biomedical Database (Elsevier B. V, Amsterdam, The Netherlands), a biomedical database covering journals

Comparator drug duration in market
Comparator time in market was included as a feature. The longer a drug has been marketed, the more adverse events, particularly difficult to observe adverse events, are identified and evaluated for labeling. The duration in market for comparator drugs was determined from the Orange Book [26]. Drugs that were approved before 1982 have an approval date listed as "Approved Prior to Jan 1, 1982"; the duration in market for these drugs was imputed to be 36 years (1982 to 2017).

Structural similarity
Structural similarity was included as a feature as it was hypothesized that the more structurally similar a comparator drug was to a test drug, the more likely they were to share pharmacology, including unknown secondary pharmacology that was not included in this analysis and may contribute to similar idiosyncratic reactions. Structural similarities of each test drug to its respective comparator drugs were determined using Tanimoto scores. Simplified Molecular Input Line Entry System (SMILES) structures for all test and comparator drugs were imported into the Tanimoto Matrix workflow in the KNIME Analytics Platform (version 3.3.2) [27]. Structures were then converted to MACCS 166-bit fingerprints, and structural similarity between the test drug and the respective comparator drug was determined. For biologics where similarity score was not available, − 1 was imputed as Tanimoto score.

Target similarity
Target similarity, or how closely the target profile of each comparator aligned with that of the test drug, was included as a feature as it was hypothesized that the more targets a comparator shares with a test drug, the more likely it is that a comparator and test drug share adverse events. The set of known pharmacological targets for each test drug and corresponding comparator drugs was extracted from DrugBank [24]. Target similarities of each test drug with its comparator drugs were determined using target-based cosine similarity scores. A trivalent drug-by-target matrix was then constructed such that for each drug-target pair an entry of "1" indicates drug-target activation, an entry of "-1" indicates drug-target inhibition, and an entry of "0" indicates no pharmacological activity. Cosine similarities the test drug has with its comparator drugs were then computed as follows:

Classifier
Five features were defined for each comparator-test drug -adverse event association: 1) presence or absence of an adverse event in FDA drug label for the comparator drug; 2) presence or absence of an adverse event in scientific literature for comparator drug; 3) structural similarity between comparator drug and test drug; 4) target similarity between comparator drug and test drug; and 5) duration the comparator drug was on the market (Fig. 4), all of which are independent of each other. These features were used to train a Naïve Bayes classifier, using presence or absence of an adverse event in the 2017 FDA drug label for the test drug as the training label (see section Adverse Events from FDA Drug Labels for details). Given the wide range of prevalence of presence of an adverse event, we anticipated the contribution of prevelance of presence of an adverse event to model prediction would be high. Therefore a Naïve Bayes classifier was chosen in order to take into account both prior probability (i.e. prevelance of presence of an adverse event) and likelihood for presence of an adverse event. All statistical calculations were conducted in R version 3.2.2 (R Foundation for Statistical Computing, Vienna, Austria) and the Naïve Bayes classifier from package e1071 was used [28] (see supplemental materials for code). Due to the limited number of drugs available for testing and the high dimensionality of prediction (135 adverse events), 10,000 bootstrapping steps were conducted by selecting a random set of 44 drugs to train the Naïve Bayes classifier, while leaving 10 drugs for testing at each iteration (i.e. 10,000/ C 54 44 ). A prediction was made by each comparator drug-test drug association for an adverse event of interest. Therefore, since a single test drug can have multiple comparator drugs, there may be multiple predictions for one test drug for each adverse event of interest. To remediate this, if the percentage of comparator drug-test drug combinations that predicted the adverse event of interest was above a predefined threshold, the adverse event was considered a positive prediction for the test drug. Performance was calculated while varying the threshold (0, 10, 30, 50, 60, 70, 90%) above which the percentage of comparator drug-test drug combinations predicted the adverse event of interest to identify the optimum threshold.

Funding
This project was supported in part by a research fellowship from the Oak Ridge Institute for Science and Education through an interagency agreement between the Department of Energy and the Food and Drug Administration (FDA). The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Availability of data and materials
The datasets supporting the conclusions of this article are included within the article (and its additional files).
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.
Competing interests RR's spouse is an employee of AstraZeneca. All other authors have no competing interests to declare.