Drug labeling documents
Drug labeling documents used in this study are in the Structured Product Labeling (SPL) format. SPL is a document markup standard approved by the Health Level Seven International (HL7), mandated by the FDA since 2005, as a standard XML format used to guide manufacturers on how to report and share drug product information. A wealth of material associated with a drug is included in the SPL (e.g., text, tables, safety and use information, active ingredients, package inserts, packaging type), and is required for all human drug products, including over-the-counter and biologic drug products. The FDA’s Center for Drug Evaluation and Research manages SPL submissions and approvals for US marketed drug products. In SPL documents, each labeling section title is coded by Logical Observation Identifiers Names and Codes (LOINC), which is a set of universal codes used to identify or exchange medical information. For example, the LOINC code for BW is 34,066–1, and the LOINC code for WP is 43,685–7. We used LOINC to parse the three ADR related sections (BW, WP, AR) from the XML-based SPL file.
FDALabel database
FDALabel database (https://www.fda.gov/scienceresearch/bioinformaticstools/ucm289739.htm) was used to collect the drug labeling documents for this study [13]. FDALabel is developed and maintained by the FDA as a web-based application that allows access to the most up-to-date drug-labeling data, aiding their use in regulatory science, drug development, and scientific research. In its latest version, FDALabel allows the easy querying of drug information based on labeling sections (e.g., BW, WP, and AR). SPL documents are the source of FDALabel and are archived by the FDA and can be downloaded from DailyMed [41]. The current version of FDALabel database (3/20/2017) has 94,657 SPLs, which include human prescription drugs, biological products, and over-the-counter (OTC) drugs.
FDA-approved NDA drug list
In the current version of FDALabel, 34,681 of the 94,657 SPLs are of human prescription drug labeling (hereafter called “drug labeling”). Of note, one prescription drug can have multiple SPLs due to the differences in regulatory applications, dosage forms, routes of administration, manufacturers, etc. For this study, duplicates of SPLs with the same Unique Ingredient Identifier (UNII) were removed and only the most recent effective SPL of the UNII drug was used. The drug list used in this study was selected using the following sequential criteria: (I) human prescription drug; (II) New Drug Application (NDA) drug; (III) single active ingredient UNII; (IV) most recent SPL of the same UNII of a drug. Finally, 1164 unique drug SPLs were extracted. The detailed drug list is provided in Additional file 5.
Extracting MedDRA standardized terms for ADR study using Oracle text search
In this study, version 19.0 was used and has, in total, 75,818 LLTs, 21,920 PTs, 1732 HLTs, 335 HLGTs, and 27 SOCs. MedDRA has anatomical, physiological, and etiological SOCs. AEs or ADRs coded by MedDRA LLTs are classified per MedDRA’s predefined hierarchy and can be aggregated using SOCs. Of the 27 SOCs, 22 are “disorder” SOCs with PTs that are highly related to ADRs, such as Cardiac disorders and Psychiatric disorders. We removed 5 SOCs that were not ADR specific: Injury, poisoning and procedural complications (Inj&P), Investigations (Inv), Social circumstances (SocCi), Surgical and medical procedures (Surg), and Product issues (Prod).
We extracted ADRs in drug labeling with LLTs through an Oracle Text querying strategy and then linked the LLTs to their corresponding PTs for frequency counting. We counted each PT only once per section per labeling, regardless of how many times the PT, or its subordinate LLTs, occurred within the specific labeling section. Although PTs can be linked to multiple SOCs, for our SOC level analysis, only the primary SOC was considered.
The MedDRA terms extraction process was conducted using Oracle Text search. First, the labeling SPLs of full text sections, as XML, were parsed into the Oracle database based on LOINC [13]. The text index was built in basic NLP procedures at Oracle database including stop word removal, stemming, and pattern matching [42, 43]. Then, the processed text information was indexed and extracted using MedDRA LLTs and mapped to PTs. Specifically, the LLTs and PTs were extracted for each drug labeling document from three ADR related sections (i.e., BW, WP, and AR) as well as the whole document using structured query language (SQL). The resulting drugs - PTs matrix was used for further data analysis.
Fisher’s exact test of SOC significance
Fisher’s exact test was performed per individual SOC, comparing the number of PTs that occurred in BW drugs belonging to the SOC to the total number of PTs occurring in that SOC for the FDA-approved NDA drug list. Since multiple SOCs were tested, Bonferroni correction (p < 0.002) was further considered in determining whether SOCs had significantly enriched Boxed Warnings (Additional file 2).
Anatomical therapeutic chemical (ATC) codes
Anatomical Therapeutic Chemical (ATC) classification system classifies drugs by organ or system of involvement, as well as by chemical, therapeutic, and pharmacological properties. In this study, drugs were categorized into 54 ATC classes under therapeutic/pharmacological levels (the second level in ATC hierarchy). Details can be found in Additional file 6. If a drug had multiple ATC codes, all ATCs were counted separately. ATC information for the 1164 drugs was retrieved from the DrugBank database [44]. First, we mapped via the active ingredient, then we mapped the remaining drugs to Active moiety UNIIs. Thus, 989 drug-ATC relationships were identified and used to group the drugs into ATC classes.
Hierarchical clustering analysis
A two-way Hierarchical Cluster Analysis (HCA) is an unsupervised learning approach and primarily used for pattern discovery [45]. In this analysis, HCA was used to investigate the grouping of ADRs (along with associated PTs) for BW drugs (i.e., drugs with a BW) in terms of their similarities across drug classes (ATC). Log 2 transformations of PT frequencies were performed to conduct the HCA analysis. Extracted PT data and ATC group data were organized into a data matrix where each row represented a single MedDRA PT, and each column represented an ATC secondary-level group. The frequency of each PT is the number of drugs in one ATC group that contained this PT in the labeling.
Some ATC groups have multiple drugs, such as antineoplastic agents (L01), psycholeptics (N05), and psychoanaleptics (N06). However, some ATC groups only contain one BW drug, such as antifungals for dermatological use (D01) and pituitary and hypothalamic hormones and analogues (H01). To reduce possible data noise in low frequency values, we compiled a preprocessed data matrix containing only ATC groups with at least 5 drugs, which were then further explored by cluster analysis. Similarly, only PTs that appeared in at least 5 drug counts across all drugs were included in the cluster analysis. Overall, for the final analysis, 129 out of 460 PTs and 25 out of 54 ATCs were used to compile a preprocessed data matrix (Additional file 7), and were analyzed by cluster analysis using heatmap.1 function in R (version 3.2.1).