Skip to main content

Mining hidden knowledge for drug safety assessment: topic modeling of LiverTox as a case study



Given the significant impact on public health and drug development, drug safety has been a focal point and research emphasis across multiple disciplines in addition to scientific investigation, including consumer advocates, drug developers and regulators. Such a concern and effort has led numerous databases with drug safety information available in the public domain and the majority of them contain substantial textual data. Text mining offers an opportunity to leverage the hidden knowledge within these textual data for the enhanced understanding of drug safety and thus improving public health.


In this proof-of-concept study, topic modeling, an unsupervised text mining approach, was performed on the LiverTox database developed by National Institutes of Health (NIH). The LiverTox structured one document per drug that contains multiple sections summarizing clinical information on drug-induced liver injury (DILI). We hypothesized that these documents might contain specific textual patterns that could be used to address key DILI issues. We placed the study on drug-induced acute liver failure (ALF) which was a severe form of DILI with limited treatment options.


After topic modeling of the "Hepatotoxicity" sections of the LiverTox across 478 drug documents, we identified a hidden topic relevant to Hy's law that was a widely-accepted rule incriminating drugs with high risk of causing ALF in humans. Using this topic, a total of 127 drugs were further implicated, 77 of which had clear ALF relevant terms in the "Outcome and management" sections of the LiverTox. For the rest of 50 drugs, evidence supporting risk of ALF was found for 42 drugs from other public databases.


In this case study, the knowledge buried in the textual data was extracted for identification of drugs with potential of causing ALF by applying topic modeling to the LiverTox database. The knowledge further guided identification of drugs with the similar potential and most of them could be verified and confirmed. This study highlights the utility of topic modeling to leverage information within textual drug safety databases, which provides new opportunities in the big data era to assess drug safety.


In recent years, numerous drug safety databases have been made publicly available, e.g., LiverTox[1], SIDER[2], TOXNET, the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS), and PubMed These databases contribute significantly to the research community, facilitating the enhanced understanding of drug safety issues [3]. Mining large-scale drug safety data is a promising venue for drug regulation [4]. Some databases integrated the safety data from various sources with free text format, for which text mining would be effective to leverage the textual information to gain knowledge of drug safety, and thus address critical safety issues that are difficult to be approached by using other databases.

Topic modeling is a widely used text mining approach for analysis of large volumes of unlabeled documents in order to discover hidden textual patterns [5]. Previous studies in our group demonstrated that topic modeling could be effectively used for the analysis of adverse events for drug safety assessment from the FDA-approved drug labels [6], and for the identification of opportunities for drug repositioning [7]. The National Institutes of Health (NIH) LiverTox database provides comprehensive clinical information for drug-induced liver injury (DILI) which is summarized with a free-text format in several sections.

In this study, we extended our text mining effort with topic modeling to the LiverTox database to ask the question of whether additional knowledge beyond what had been described in the documents could be extracted to guide an enhanced DILI assessment. We placed our emphasis on drug-induced acute liver failure (ALF) which was a severe form of DILI with limited treatment options thus with significant public health impact. With topic modeling, we successfully identified a topic incriminating a drug's liability to cause ALF based on the text in the "Hepatotoxicity" sections of the LiverTox. The identified topic further guided identification of other drugs with the similar liability and, importantly, most of them could be verified and confirmed with additional data. This proof-of-concept study demonstrated the potential utility of topic modeling to the existing text documents in the public domain to gain knowledge as predictive means for the enhanced assessment of drug safety.


LiverTox database

The NIH LiverTox database is developed by Liver Disease Research Branch of National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and National Library of Medicine (NLM) to promote the basic and clinical research on DILI [1]. It is a free on-line source of textual documents on DILI summarized from various databases, scientific literature, and interpretations of the curators. The LiverTox contains a set of documents (one document per drug) and each document contains multiple sections. Each section provides different set of DILI information, including introduction, background, hepatotoxicity, mechanism of injury, outcome and management, and others (e.g., case report, product information, chemical structure, and references). In this study, only the "Hepatotoxicity" section was used for topic modeling because the "Hepatotoxicity" section mainly contained the DILI-relevant clinical observations. The findings was compared against to the information from other sections such as "Outcome and management" to demonstrate the utility of the method. In case of no clear ALF evidence presented in those sections, the results were compared to the data from other sources. The "Hepatotoxicity" section for each drug, on average, contains 200-400 words that summarize the DILI-relevant information including clinical features, time to onset and recovery, liver enzymes (frequency of elevation, fold change, and serum levels), liver injury pattern, immunoallergic and autoimmune features, and other hepatotoxicity relevant data. A total of 478 documents (i.e., 478 drugs) were used for topic modeling.

Topic modeling

Latent Dirichlet allocation (LDA), one of the most popular topic modeling approach [5, 810], was applied to explore the LiverTox database. We used LDA in Mallet, a Java-based package, for topic modeling [11]. Number of topics to optimally represent the content of all documents is a key parameter in a topic model. The optimal number of topics can be determined by fitting models with different number of topics to the data. The model fitness can be estimated by the likelihood of the data given a topic model [10]. To obtain the sparse topic and word distributions, the Dirichlet hyperparameters alpha (α) and beta (β) were defined as 0.1 and 0.01, respectively. Before topic modeling, the English stop-words and numerical digits are removed. In addition, three words (i.e., liver, injury, and elevation) presented in more than 80% of documents are also removed as the words with high frequency across the documents will not provide the discriminative information for topics. After that, words in each document are tokenized and then put into LDA to train a topic model. The model yields two probability distributions, one gives a probability value (θ) for each topic to determine its relevance to each document and the other assigns a probability value to each word for its relevance to the topic.

Identification of ALF-Topic

As listed in Table 1, 26 drugs known to cause ALF are selected and used to identify a topic most relevant to ALF in the topic model. There are 23 drugs annotated by Suzuki et al. with a justified causality assessment from the ALF survey conducted in the United States [12]. Another 3 drugs (i.e., benzbromarone, tolcapone, and troglitazone) are withdrawn from the market due to the drug-induced ALF.

Table 1 Summary of topic model for the 26 known ALF drugs in LiverTox database.

For the 26 known ALF drugs, a mean topic distribution of these drugs is calculated, which leads to the determination of a topic that represents best for these ALF drugs. This so-called ALF-Topic is defined to be topic j for which the mean probability value of θj is the greatest among all the topics. In this model, other drugs highly associated with ALF-Topic are expected to be related to ALF.

Investigation of drugs implicated by ALF-Topic

To investigate whether there was any evidence to support the ALF-implicated drugs identified by ALF-Topic, we searched the ALF evidence in their "Outcome and management" sections from the LiverTox database, in the safety sections from the FDA-approved drug labels, in the literature reporting the ALF case reports with the established causality, and in the FAERS with post-marketing ALF case reports from 1969 to 2013. The workflow of this study is depicted in Figure 1.

Figure 1
figure 1

Workflow of this study. The 26 known ALF drugs are from the Suzuki's paper [12]. ALF: acute liver failure; FAERS: FDA Adverse Event Reporting System; Hy's law: a well-accepted rule to incriminate ALF [17].


Identification of ALF-Topic

The study started with the determination of the optimal number of topics for the LiverTox dataset. Consequently, 40 topics were determined as the highest likelihood of the data given the model with the varying number of topics between 10 and 150 (Figure 2). Then, the mean probability value (θ) for each topic was calculated for 26 known ALF drugs. As shown in Figure 3, Topic-37 had significantly higher probability value (0.36) to these drugs compared to the baseline (0.02) from the rest of topics (p < 0.01). Therefore, this topic was considered as an ALF recognizing/predicting topic and denoted as ALF-Topic.

Figure 2
figure 2

Log-likelihood of the data (D) given the model (M) with different settings of the number of topics ( T ).

Figure 3
figure 3

Mean probability values of all 40 topics for the 26 known ALF drugs

The following 15 words were prevalently represented in ALF-Topic: "case, acute, hepatic, therapy, serum, pattern, week, clinical, report, jaundice, hepatocellular, patient, typical, severe, aminotransferase". Three of these words (i.e., jaundice, hepatocellular, and severe) were unique to this topic and were not simultaneously present in first 15 words for the other 39 topics. These specific words might imply a clinical phenomenon likely to indicate the potential of a drug to cause ALF. Thus, ALF-Topic could be applied to identify other ALF-related drugs in this model based on the probability values of this topic to those drugs.

Application of ALF-Topic

For 12 (12/26; ~46%) known ALF drugs listed in Table 1, ALF-Topic (i.e., Topic-37) was ranked as the first topic, and this proportion was significantly higher compared to that of the other topics (p < 0.05). For five other drugs (i.e., Benzbromarone, Halothane, Phenytoin, Simvastatin, and Sulfasalazine), ALF-Topic was ranked as the second, while for Ibuprofen it was in the third place. There were no ALF drugs with ALF-Topic ranked in the fourth place. The results suggested that a drug with ALF-Topic ranked among its first three topics might have a high likelihood to be associated with ALF. This criterion (i.e., ALF-associated drugs would have ALF-Topic ranked among their first three topics in the model) was applied to the rest of drugs in the LiverTox database, and a total of 127 drugs were identified.

Confirmation of drugs identified by ALF-Topic

Among the identified drugs, 77/127 (60.6%) drugs were described in their "Outcome and management" sections with the ALF-related terms such as "liver failure", "hepatic failure", "liver transplantation", "fatal/death", or "fulminant hepatitis" (Additional file 1: Table S1). The remaining 50/127 (39.4%) ALF-implicated drugs were not mentioned to cause ALF in the LiverTox database (Table 2). We examined the safety sections in the FDA-approved drug labels, and found out that 13/50 (26%) drugs were mentioned to have ALF risk in the Warnings & Precautions, and/or Adverse reactions sections (Table 2). For another 7 drugs (7/50; 14%), there were reports for drug-induced ALF with the established causality in literature [1215]. For the remaining 30 drugs, we found that 22 (22/50; 44%) drugs had the ALF case reports in the FAERS (Additional file 2: Table S2), which were obtained by searching the FAERS with the Medical Dictionary for Regulatory Activities (MedDRA) preferred terms: "acute hepatic failure" and/or "hepatic failure". Apart from 4 herbal medicines (i.e, Aloe Vera, Ba Jiao Lian, Chi R Yun, and Shosaikoto/daisaikoto), which were not recorded by the FAERS, no ALF case was reported for the remaining 4 drugs (i.e., Clofibrate, Methocarbamol, Pentamidine, and Reserpine). In summary, among 127 identified drugs, evidence supporting risk of ALF was found for 119 drugs (119/127; 93.7%).

Table 2 Summary of 50 drugs implicated by ALF-Topic without apparent ALF evidence in the LiverTox database.


In this proof-of-concept study, topic modeling was demonstrated to be a promising approach to leverage information from drug safety databases comprised of textual data. As a case study, LiverTox database was explored by topic modeling to discover the hidden pattern for the identification of drugs potentially causing ALF. We deliberately chose to analyze the LiverTox "Hepatotoxicity" section alone so the findings could be verified by other sections in the LiverTox to demonstrate the potential utility of topic model in the field of drug safety. Specifically, first, ALF-Topic from the "Hepatotoxicity" sections of the drug documents was discovered, which was interpreted by the prevalence of three specific words (i.e., jaundice, hepatocellular, and severe). Then, this topic was applied to identify ALF-related drugs in the LiverTox database. Thereafter, evidence supporting risk of ALF for those identified drugs was found from the "Outcome and management" sections of the LiverTox or found from other public databases if not available from the LiverTox.

ALF-Topic was confirmed to be relevant to the well-known Hy's law [16, 17], which defines that the observed drug-induced hepatocellular liver injury pattern together with jaundice has a poor prognosis with 10~50% fatality of ALF. The predictive power of Hy's law has been verified by the analysis of extensive studies in Spain and Sweden [18, 19], and it has been recommended by the FDA for assessing the potential of a drug to cause severe DILI [20]. In this study, ALF-Topic identified 127 drugs in the LiverTox database, and approximately 60% (77/127) of these drugs were implicated to cause ALF in their "Outcome and management" sections. For those unspecified drugs, supporting evidence was found for 20 drugs in safety sections of their FDA-approved drug labels or in the literature with established ALF causality. ALF case reports were identified in the FAERS for the other 22 drugs, among which, 6 drugs were predicted as ALF positive drugs by an in vitro experiments. While the in vitro data might not directly indicate the ALF potential in humans, it was demonstrated that these 6 drugs were much closer to the ALF positive control drugs when they were tested by in vitro experiment. Evidence of ALF from the FAERS should be interpreted cautiously, because the causality may not be fully established. For example, although ALF cases of Phenelzine were reported in the FAERS, it was emphasized that Phenelzine might not be the suspect drug [21]. In addition to the overestimated risk, the FAERS only receives reports from the United States. For example, Ethionamide was not reported ALF in the FAERS despite being known to cause ALF in the United Kingdom [22].

For 127 identified drugs, evidence supporting risk of ALF was found for 119 drugs. The result strongly suggests that not only the specific wording but also their probabilistic/statistic relationship in the hidden structure of textual documents were crucial to incriminate drugs for ALF. It is worthwhile to point out that it is beyond the scope of this excise to ask ALF-Topic to identify all ALF-related drugs because ALF mechanisms are complex and the selected 26 known ALF drugs for determining ALF-Topic do not necessary represent the entire landscape of ALF. For example, hepatocellular liver injury pattern is not observed for Efavirenz and Dicloxacillin, which have the potential to cause ALF [12]. Atorvastatin and Ethambutol, known as ALF drugs [12], are not mentioned to cause either jaundice or hepatocellular liver injury in the LiverTox database.


We explored the LiverTox database using topic modeling, and discovered the hidden knowledge to identify drugs with potential to cause ALF. Our proof-of-concept study demonstrates the applicability of topic modeling to leverage information within the textual drug safety databases, which will provide new opportunities for drug safety assessment.



Acute Liver Failure


Drug-Induced Liver Injury


Food and Drug Administration (FDA) Adverse Event Reporting System


Latent Dirichlet Allocation


National Institute of Diabetes and Digestive and Kidney Diseases


National Institutes of Health


National Library of Medicine


Medical Dictionary for Regulatory Activities.


  1. Hoofnagle JH, Serrano J, Knoben JE, Navarro VJ: LiverTox: a website on drug-induced liver injury. Hepatology. 2013, 57 (3): 873-874. 10.1002/hep.26175.

    Article  PubMed  Google Scholar 

  2. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P: A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. 2010, 6: 343-

    Article  PubMed Central  PubMed  Google Scholar 

  3. Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P: Drug target identification using side-effect similarity. Science. 2008, 321 (5886): 263-266. 10.1126/science.1158140.

    Article  CAS  PubMed  Google Scholar 

  4. Fang H, Su Z, Wang Y, Miller A, Liu Z, Howard PC, Tong W, Lin SM: Exploring the FDA Adverse Event Reporting System to Generate Hypotheses for Monitoring of Disease Characteristics. Clin Pharmacol Ther. 2014, 95 (5): 496-498. 10.1038/clpt.2014.17.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Blei DM, Ng AY, Jordan MI: Latent Dirichlet allocation. J Mach Learn Res. 2003, 3 (4-5): 993-1022.

    Google Scholar 

  6. Bisgin H, Liu Z, Fang H, Xu X, Tong W: Mining FDA drug labels using an unsupervised learning technique--topic modeling. BMC Bioinformatics. 2011, 12 (Suppl 10): S11-10.1186/1471-2105-12-S10-S11.

    Article  PubMed Central  PubMed  Google Scholar 

  7. Bisgin H, Liu Z, Kelly R, Fang H, Xu X, Tong W: Investigating drug repositioning opportunities in FDA drug lables through topic modeling. BMC Bioinformatics. 2012, 13 (Suppl 15): S6-10.1186/1471-2105-13-S15-S6.

    Article  PubMed Central  PubMed  Google Scholar 

  8. Hofmann T: Probabilistic Latent Semantic Indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99). 1999, 50-57.

    Chapter  Google Scholar 

  9. Griffiths TL, Steyvers M: A probabilistic approach to semantic representation. Proceedings of the Twenty-Fourth Annual Conference of Cognitive Science Society. 2002, 381-386.

    Google Scholar 

  10. Griffiths TL, Steyvers M: Finding scientific topics. Proc Natl Acad Sci USA. 2004, 101 (Suppl 1): 5228-5235.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  11. McCallum AK: MALLET: a machine learning for language toolkit. 2002, []

    Google Scholar 

  12. Suzuki A, Andrade RJ, Bjornsson E, Lucena MI, Lee WM, Yuen NA, Hunt CM, Freston JW: Drugs associated with hepatotoxicity and their reporting frequency of liver adverse events in VigiBase: unified list based on international collaborative work. Drug Saf. 2010, 33 (6): 503-522. 10.2165/11535340-000000000-00000.

    Article  CAS  PubMed  Google Scholar 

  13. Reuben A, Koch DG, Lee WM, Acute Liver Failure Study G: Drug-induced acute liver failure: results of a U.S. multicenter, prospective study. Hepatology. 2010, 52 (6): 2065-2076. 10.1002/hep.23937.

    Article  PubMed Central  PubMed  Google Scholar 

  14. Mindikoglu AL, Magder LS, Regev A: Outcome of liver transplantation for drug-induced acute liver failure in the United States: analysis of the United Network for Organ Sharing database. Liver Transpl. 2009, 15 (7): 719-729. 10.1002/lt.21692.

    Article  PubMed  Google Scholar 

  15. Polson J, Lee WM, American Association for the Study of Liver D: AASLD position paper: the management of acute liver failure. Hepatology. 2005, 41 (5): 1179-1197. 10.1002/hep.20703.

    Article  PubMed  Google Scholar 

  16. Zimmerman HJ: Hepatotoxicity: the adverse effects of drugs and other chemicals on the liver. 1999, Philadelphia, Lippincott Williams & Wilkins, 2

    Google Scholar 

  17. Reuben A: Hy's law. Hepatology. 2004, 39 (2): 574-578. 10.1002/hep.20081.

    Article  PubMed  Google Scholar 

  18. Andrade RJ, Lucena MI, Fernandez MC, Pelaez G, Pachkoria K, Garcia-Ruiz E, Garcia-Munoz B, Gonzalez-Grande R, Pizarro A, Duran JA: Drug-induced liver injury: an analysis of 461 incidences submitted to the Spanish registry over a 10-year period. Gastroenterology. 2005, 129 (2): 512-521. 10.1016/j.gastro.2005.05.006.

    Article  PubMed  Google Scholar 

  19. Bjornsson E, Olsson R: Outcome and prognostic markers in severe drug-induced liver disease. Hepatology. 2005, 42 (2): 481-489. 10.1002/hep.20800.

    Article  PubMed  Google Scholar 

  20. U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research (CDER), Center for Biologics Evaluation and Research (CBER): Guidance for industry drug-induced liver injury: premarketing clinical evaluation. 2009

    Google Scholar 

  21. Gomez-Gil E, Salmeron JM, Mas A: Phenelzine-induced fulminant hepatic failure. Ann Intern Med. 1996, 124 (7): 692-693.

    Article  CAS  PubMed  Google Scholar 

  22. Hollinrake K: Acute hepatic necrosis associated with ethionamide. Br J Dis Chest. 1968, 62 (3): 151-154. 10.1016/S0007-0971(68)80006-3.

    Article  CAS  PubMed  Google Scholar 

Download references


This is research was funded by HHS/FDA/NCTR (to WT). We thank Reagan Kelly, Qiang Shi, Yuping Wang, and Hong Fang for comments and discussion. KY is grateful to the National Center for Toxicological Research (NCTR) of U.S. Food and Drug Administration (FDA) for postdoctoral support through the Oak Ridge Institute for Science and Education (ORISE). KI worked during her summer sabbatical at NCTR, which was supported through ORISE and by the Ministry of Education and Science, Republic of Serbia (project 175064, 2011-2014).


This study and publication was fully financed by the U.S. Food and Drug Administration (E0721511).

The views presented in this article do not necessarily reflect current or future opinion or policy of the US Food and Drug Administration. Any mention of commercial products is for clarification and not intended as endorsement.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 17, 2014: Selected articles from the 2014 International Conference on Bioinformatics and Computational Biology. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Katarina Ilic or Weida Tong.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

WT and KI designed and supervised this study. KY and XX performed text mining experiments. KY, KI, MC, AS, and WT contributed to the writing of this manuscript. JZ performed the in vitro experiment.

Electronic supplementary material


Additional file 1: Table S1. Summary of 77 ALF-implicated drugs specified to cause ALF in the LiverTox database. (DOC 153 KB)


Additional file 2: Table S2. Number of ALF case reports by searching acute hepatic failure and/or hepatic failure in the FAERS for the 30 drugs implicated by the ALF-Topic. (DOC 52 KB)

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, K., Zhang, J., Chen, M. et al. Mining hidden knowledge for drug safety assessment: topic modeling of LiverTox as a case study. BMC Bioinformatics 15 (Suppl 17), S6 (2014).

Download citation

  • Published:

  • DOI: