Application of dynamic topic models to toxicogenomics data
© The Author(s). 2016
Published: 6 October 2016
All biological processes are inherently dynamic. Biological systems evolve transiently or sustainably according to sequential time points after perturbation by environment insults, drugs and chemicals. Investigating the temporal behavior of molecular events has been an important subject to understand the underlying mechanisms governing the biological system in response to, such as, drug treatment. The intrinsic complexity of time series data requires appropriate computational algorithms for data interpretation. In this study, we propose, for the first time, the application of dynamic topic models (DTM) for analyzing time-series gene expression data.
A large time-series toxicogenomics dataset was studied. It contains over 3144 microarrays of gene expression data corresponding to rat livers treated with 131 compounds (most are drugs) at two doses (control and high dose) in a repeated schedule containing four separate time points (4-, 8-, 15- and 29-day). We analyzed, with DTM, the topics (consisting of a set of genes) and their biological interpretations over these four time points. We identified hidden patterns embedded in this time-series gene expression profiles. From the topic distribution for compound-time condition, a number of drugs were successfully clustered by their shared mode-of-action such as PPARɑ agonists and COX inhibitors. The biological meaning underlying each topic was interpreted using diverse sources of information such as functional analysis of the pathways and therapeutic uses of the drugs. Additionally, we found that sample clusters produced by DTM are much more coherent in terms of functional categories when compared to traditional clustering algorithms.
We demonstrated that DTM, a text mining technique, can be a powerful computational approach for clustering time-series gene expression profiles with the probabilistic representation of their dynamic features along sequential time frames. The method offers an alternative way for uncovering hidden patterns embedded in time series gene expression profiles to gain enhanced understanding of dynamic behavior of gene regulation in the biological system.
All biological processes including perturbation-responses are inherently dynamic. Investigating the temporal behavior of these dynamic processes is an important part of biological research. With the advancement of technology and reduction in cost, study of time-series gene expression has become routine . The objectives of these types of research cannot be achieved without appropriate computational algorithms and methods. For example, a targeted perturbation like drug treatment activates or inhibits certain molecules in the cellular system in a transient or sustained manner; however, if we ignore these intrinsic dynamics of molecular changes due to lack of analysis techniques, we may miss out critical biological findings.
To analyze time-series gene expression profiles, several approaches have been used which can be divided into two classes. One of the classes is conventional clustering algorithms such as hierarchical, k-means clustering and self-organizing maps, which do not consider any dependencies between temporally successive profiles. In other words, even if we permute the order of time points, the results of these algorithms would not change. Additionally, another drawback of these approaches is the mutual exclusiveness of genes with respect to their involvement in biological processes responding to exposure. The second class of approaches is the clustering algorithms primarily designed to analyze time-series expression. For example, Aach and Church introduced dynamic time warping algorithm for the alignment of expression profiles in different time series . Schliep introduced Hidden Markov Model widely used in speech recognition to consider time dependencies along sequential timeline of time series gene expression data . These algorithms have been constantly under improvement [4–6]. Ramoni modeled the dynamics of genes by autoregressive equation and genes with the highest posterior probability from the same autoregressive equation were clustered together . Additionally, there are software packages to analyze time-series gene expression data such as CAGED , STEM  and so on. CAGED applies regression analysis to cluster genes on the basis of their trajectories over multiple time points, while STEM first defines a set of representative temporal profiles, then assigns genes to one of several predefined temporal trajectories. These methodologies are more focused on clustering of genes showing similar expression patterns over time without an explicit consideration of sample-gene-time relationship.
In this study, we propose dynamic topic model (DTM) as a novel approach to cluster time-series gene expression profiles. DTM was originally developed by Blei to analyze the time evolution of topics in large document collections in the field of text mining . DTM is an extension of Latent Dirichlet Allocation (LDA). LDA and other similar text mining methodologies haven been successfully applied to gene expression data analysis, some from our group [10–12], based on the similarity in data structure between text and gene expression. For example, Manuele et al. applied two different topic modeling approaches, PLSA (Probabilistic Latent Semantic Analysis) and LDA, to cancer classification using gene expression profiles . Patrick et al. used a modified topic modeling technique to cluster drugs and genes . Bing et al. applied a correspondence LDA model to discover microRNA regulated modules by identifying the microRNA and mRNA co-occurring frequently within the same latent variable . Yu et al. applied topic modeling to discover functional modules with a biologically meaningful interpretation in RNA-Seq toxicogenomics data .
However, to the best of our knowledge, DTM has not been explored as an applicable method for the analysis of time-series gene expression profiles. It extended the static LDA to take into consideration time evolution that existed in a real document collection. Static LDA assumes that documents from the same set of topics are exchangeable, the probability of which is invariant to permutation. However, that assumption completely ignores one significant variable, i.e., time, that is present in documents organized according to sequential time where the topics evolve over time. DTM assumes that the topics associated with time, t, evolve from the topics associated with the previous time, t-1. It treats documents as mixtures of topics in which words are represented by the probability distribution across all time points. This representation can be analogously applied to biological systems since a biological component (topic) is different between the initial unstable stage, which is shortly after drug treatment, and the steady stage into which cellular system enters after a certain time period. In other words, the acting genes tend to evolve over time, consistent with the dynamic behavior of cellular and molecular effects after drug exposure changes over time. The advantage of using DTM is that it is a soft clustering technique which does not assume mutual exclusivity and permits multiple topic assignment with a probabilistic way to the same sample and gene, reflecting true biological complexity.
Here, we applied DTM to a set of time-series gene expression profiles generated by the Japanese Toxicogenomics Project  to investigate the dynamic feature of gene expression following drug treatment. Our study was built on the assumption that gene expression profiles resemble a set of documents, where each gene expression profile (a document) consists of mixtures of biological processes (that can be thought of as topics), and each biological process in turn consists of a set of genes (that can be thought of as the words used to present a topic). Based on the estimated latent topics, we clustered samples based on topic distributions followed by clustering of genes according to their contributions to each topic. Finally, the results provided us with a wide range of biological insights into how the genes’ activities are changed over time upon drug-perturbation.
Materials and methods
The Japanese Toxicogenomics Project generated large-scale gene expression profiles for the same compounds tested in rat livers and kidneys as well as using both rat and human primary hepatocytes . The datasets are organized in TG-GATEs. In its first phase, 131 compounds were profiled, most of which are drugs. Each compound was tested on three different assay platforms (i.e., in vitro assay, in vivo repeated dose study and in vivo single dose experiment) with a design including multiple doses and time points. In this study, we only utilized in vivo repeated dose experiments in which three doses (low, medium and high) and four time points (4, 8, 15, 29 days) were tested for 131 compounds. Among all dose levels, we selected high-dose repeated treatment under all-time points from rat livers, which consists of 3144 arrays (=131 compounds × 3 replicates × 4 timepoints × 2 doses). Further information about TG-GATEs can be found in Uehara et al. . The TG-GATEs dataset used in this study was downloaded from CAMDA 2013 (http://dokuwiki.bioinf.jku.at/doku.php/start).
Gene expression data processing
The probe-level microarray data were quantile normalized followed with mapping of a probe set into its corresponding genes , then multiple probes were summarized into one corresponding gene’s intensity ratio using FARMS . Next, we generated a “document” for each drug-time condition; it contained “words”, each represented genes differentially expressed by comparing the treated against the matched control. Thus, each document was then represented by its differentially expressed genes (DEGs) (words). A total of 514 documents (131 compounds × 4 time points) were generated; each document was tagged with each time point, such as 4, 8, 15 and 29 days. A total of 12,088 genes were present in this experiment. We considered the same gene with different transcriptional directions (i.e., up and down) as two different words, leading to a corpus of 24,176 words. The frequency of a word appearing in each document was determined by multiplying 100 to the fold change of DEGs.
All variants of LDA are probabilistic, which usually involves a series of processes to determine the optimal parameters to maximize the posterior probability of the observed data. In static LDA, α and βk are the Dirichlet prior parameters on the topic distributions over document and the word distribution over topic k, respectively. Different from a static LDA, DTM adopts logistic normal distribution for two prior distributions (topic per document and word per topic) and hence is more complex compared to static LDA, which assures conjugacy between prior and posterior distributions. Specifically, static LDA assumes that the words of each document are independently drawn from a mixture of multinomial. However, this implicit assumption of independency is not appropriate, because the topic (a set of words) in a document collection evolves over time. Our goal is to explicitly address the dynamics of the underlying topics as a function of sequential time. DTM provides a solution to this problem by assuming that topics at time i evolved from the topics at time i-1 with the reflection of real organization of document collections. DTM assumes that the data is divided by time slice, modeling the documents of each slice with a static topic model, where the topics associated with slice t evolve from the topics associated with slice t – 1. In a static LDA model, it assumes that the topic-specific word distributions are drawn from a Dirichlet distribution. However, DTM does not assume Dirichlet distribution to approximate posterior inference, the word distributions over multiple time points are chained by Gaussian distribution. Due to the nonconjugacy of the Gaussian and multinomial models, Blei applies variation approximations such as Kalman filters and nonparametric wavelet regression to approximate posterior inference.
In this study, the open-source DTM C++ package was applied from the author’s website (https://www.cs.princeton.edu/~blei/topicmodeling.html). The modeling results include two different distributions: multinomial distribution over topics for each document and multinomial distributions over words for each time point associated with each topic. In our analysis, the number of topics was heuristically determined by closely examining two hyperparameters, alpha and top_chain_var which defines the number of topics. Specifically, alpha controls the shape of the topic distribution of a sample. A smaller alpha results in each document to be more probabilistically associated with fewer topics. The top_chain_var determines how similar topics would be over multiple time points. A smaller top_chain_var leads to similar word distributions over multiple time points. In our study, we have tested several parameter settings for alpha and top_chain_var and found that the varied values do not have a significant effect on our interpretation of the sample clustering results and topic distribution over time points. Thus, choose the default value of (alpha = 0.01, top_chain_var = 0.005) and, at this condition, we feel that the choice of 20 topics is sufficient to balance between extreme generalization of the model and maximizing the chance of an informative discovery.
Clustering samples and genes
After building a probabilistic model for our observed temporal DEGs using DTM, two distributions (matrix) were generated: topic distribution over document and a series of word distributions over multiple time points for each topic. The former includes the conditional probability of each topic given a sample, P(T|D). This probability is a signature of the sample, which can be used to assess sample similarities. The latter represents the conditional probability of each gene given a topic at a particular time point, P(W|T) time , indicating which genes are important to a given topic in a particular time point. As we have four time points (4, 8, 15 and 29 days), four different P(W|T) were obtained, i.e., P(W|T) 4days , P(W|T) 8days , P(W|T) 15days , and P(W|T) 29days . First, to group documents, each document was assigned to the topic with the largest conditional probability value of P(T|D). The other distribution, P(W|T) time was used for clustering genes. Since DTM is designed to cluster words co-occurring frequently across whole documents, the genes with a high rank in the same topic are likely involved in the same biological process. To take advantage of this information, functional pathway analysis was performed for each topic using the Fisher’s exact test with data from Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/), and Gene Ontology (GO) (http://geneontology.org/). Above all, the strongest benefit of DTM is its ability to monitor the behavior of the genes at the given time points and thus aid in investigating the significantly active genes at each time point. Each gene was ranked for each time point, and functional analysis was conducted for the top 300 genes according to their rank.
Results and discussion
This study consists of several steps: (1) generation of documents of DEG lists for each compound at each time point; (2) Building a generative probabilistic model using DTM to maximize the posterior probability of observed temporal DEGs; (3) Assignment of the topic with largest conditional probability value to each compound-time condition; (4) ranking DEGs according to their conditional probability of each topic and assessment of topic evolution over time (4) topic analysis in the biological context. From these procedures, we obtained two outputs, one of them is P(T|D), the topic distributions for a given compound-time condition, and the other is P(W|T) time , a series of distributions over genes at multiple time points for each topic.
Study of topics
DTM provides the distribution of topics for a given compound-time condition, P(T|D). This can be used for the assessment of the association between a specific condition and a specific topic. We used this statistical probability to group the conditions by connecting them with topics. These results are provided in Additional file 1: Table S1 that includes Mode of Action (MoA) and therapeutic category information for the 131 drugs. We found that drugs with the same MoA category tend to be highly associated with the same topic. Specifically, all of the eight drug-time condition associated with topic 5 are PPAR α agonists (WY-14643 and fenofibrate) at all four time points, indicating that these drugs have consistent DEGs (vocabularies) across the whole time points and that the PPARα agonist action is not time sensitive compared to other compounds. Other than topic 5, topic 1 is also found to be associated with PPARα agonists (clofibrate and gemfibrozil) at all four time points. However, unlike topic 5, topic 1 also includes several other drugs that are not PPARα agonists, such as amiodarone, aspirin, bendazac, benzbromarone, chloramphenicol and simvastatin. Benbromarone is not a PPARα agonist; but it is known to have a high binding affinity for PPARα, showing potential as a PPARα agonist . It has also been reported that the adverse effect of amiodarone is related to expression of PPARα target genes, implying the possibility of PPARα as one of its off-targets . In addition to these two drugs, the relationship between PPARα and aspirin and simvastatin has also been studied [22, 23]. Another example is topic 17, which showed high selectivity, 68 % (21 / 31) for COX inhibitors, including non-steroidal anti-inflammatory drugs such as ibuprofen, mefenamic acid, phenylbutazone, sulindac, diclofenac, naproxen, nimesulide and indomethacin. We also found that some of the topics are associated with only a single drug. For instance, all the samples over four time points of ethambutol, thioacetamide and ethionine were assigned topic 4, 12 and 15, respectively, showing their distinct drug effect with less sensitivity to time.
Associating topics with functional pathways
Each topic is composed a set of genes. Genes are ranked according to the probability of topic-gene matrix. The table shows the top 10 genes at the time point of 4 days
Assessment of word evolution over time
Comparison with other bi-clustering methods
To supplement the drawbacks of traditional clustering algorithms, DTM was explored as one of the model based algorithms for the analysis of time-series gene expression data, which has not been applied previously to our best knowledge. Therefore, our study is significant as a pilot study that explores the feasibility of applying a text mining approach to time-series biological datasets. As a result of our investigation, we identified hidden patterns embedded in time series toxicogenomics. From the topic distribution for each document, a number of drugs were successfully clustered by their shared MoA, for example, PPARɑ agonists and COX inhibitors. The biological meaning underlying each topic was interpreted using diverse sources of information such as functional analysis of the pathways and therapeutic uses of the drugs, which could provide a better understanding of drug perturbation mechanisms. Additionally, we found that sample clusters produced by DTM are much more coherent in terms of functional categories than the ones from traditional clustering algorithms. Above all, time specific activity distribution according to each sequential time provided tremendous opportunity to uncover the underlying toxicological dynamic changes. In summary, our study found that DTM has several distinct advantages. Firstly, it can reduce data dimension very effectively in terms of the latent variable (i.e., topic), with the assumption of time-dependency present in toxicogenomics. Secondly, it also allows samples and genes to be associated with multiple topics in an intuitive probabilistic manner without the mutual exclusivity assumption, reflecting the complexity of real biological system. Most importantly, topic dynamics over time relapse could provide new biological insights into the evolution of gene regulation.
This work was supported by the U.S. Food and Drug Administration and the National Center for Advancing Translational Sciences, National Institutes of Health. The views expressed in this article are those of the authors and do not necessarily reflect the statements, opinions, views, conclusions, or policies of the National Center for Advancing Translational Sciences, National Institutes of Health, U.S. Food and Drug Administration, or the United States government. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 13, 2016: Proceedings of the 13th Annual MCBIOS conference. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-13.
Publication of this article was funded by ORISE postdoc program at FDA.
Availability of data and materials
The dataset supporting the conclusions of this article is available in Open TG-GATEs (http://toxico.nibiohn.go.jp/english/index.html) and could be downloaded from CAMDA 2013 (http://dokuwiki.bioinf.jku.at/doku.php/start).
ML created idea and concept of this manuscript, performed data analysis, and wrote the manuscript completely. WT, ZL and RH contributed to the data analysis, verified the calculations, and assisted with writing the manuscript. WT developed the original idea and guided the data analysis and presentation of results. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Bar-Joseph Z, Gitter A, Simon I. Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet. 2012;13(8):552–64.View ArticlePubMedGoogle Scholar
- Aach J, Church GM. Aligning gene expression time series with time warping algorithms. Bioinformatics. 2001;17(6):495–508.View ArticlePubMedGoogle Scholar
- Schliep A, Schönhuth A, Steinhoff C. Using hidden Markov models to analyze gene expression time course data. Bioinformatics. 2003;19 suppl 1:i255–63.View ArticlePubMedGoogle Scholar
- Smith AA, Vollrath A, Bradfield CA, Craven M. Similarity queries for temporal toxicogenomic expression profiles. PLoS Comput Biol. 2008;4(7):e1000116.View ArticlePubMedPubMed CentralGoogle Scholar
- Yoneya T, Mamitsuka H. A hidden Markov model-based approach for identifying timing differences in gene expression under different experimental factors. Bioinformatics. 2007;23(7):842–9.View ArticlePubMedGoogle Scholar
- Zhao C, Hua J, Bittner ML, Ivanov I, Dougherty ER. Identifying mechanistic similarities in drug responses. Bioinformatics. 2012;28(14):1902–10.View ArticlePubMedGoogle Scholar
- Ramoni MF, Sebastiani P, Kohane IS. Cluster analysis of gene expression dynamics. Proc Natl Acad Sci. 2002;99(14):9121–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Ernst J, Bar-Joseph Z. STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics. 2006;7(1):1–11.View ArticleGoogle Scholar
- Blei DM, Lafferty JD. Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning; 1143859. Pittsburgh: ACM; 2006. p. 113–20.Google Scholar
- Lee M, Huang RL, Tong WD. Discovery of transcriptional targets regulated by nuclear receptors using a probabilistic graphical model. Toxicol Sci. 2016;150(1):64–73.View ArticlePubMedGoogle Scholar
- Chung MH, Wang YP, Tang HL, Zou W, Basinger J, Xu XW, Tong WD. Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics. Front Pharmacol. 2015;6:81.Google Scholar
- Lee M, Liu ZC, Kelly R, Tong WD. Of text and gene - using text mining methods to uncover hidden knowledge in toxicogenomics. BMC Syst Biol. 2014;8:93.Google Scholar
- Bicego M, Lovato P, Oliboni B, Perina A. Expression microarray classification using topic models. In: Proceedings of the 2010 ACM Symposium on Applied Computing; 1774415. Sierre: ACM; 2010. p. 1516–20.Google Scholar
- Flaherty P, Giaever G, Kumm J, Jordan MI, Arkin AP. A latent variable model for chemogenomic profiling. Bioinformatics. 2005;21(15):3286–93.View ArticlePubMedGoogle Scholar
- Liu B, Liu L, Tsykin A, Goodall GJ, Green JE, Zhu M, Kim CH, Li J. Identifying functional miRNA–mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics. 2010;26(24):3105–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu K, Gong B, Lee M, Liu Z, Xu J, Perkins R, Tong W. Discovering functional modules by topic modeling RNA-seq based toxicogenomic data. Chem Res Toxicol. 2014;27(9):1528–36.View ArticlePubMedGoogle Scholar
- Uehara T, Ono A, Maruyama T, Kato I, Yamada H, Ohno Y, Urushidani T. The Japanese toxicogenomics project: application of toxicogenomics. Mol Nutr Food Res. 2010;54(2):218–27.View ArticlePubMedGoogle Scholar
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005;33(20):e175.View ArticlePubMedPubMed CentralGoogle Scholar
- Hochreiter S, Clevert D-A, Obermayer K. A new summarization method for affymetrix probe level data. Bioinformatics. 2006;22(8):943–9.View ArticlePubMedGoogle Scholar
- Kunishima C, Inoue I, Oikawa T, Nakajima H, Komoda T, Katayama S. Activating effect of benzbromarone, a uricosuric drug, on peroxisome proliferator-activated receptors. PPAR Res. 2007;2007:36092.Google Scholar
- McCarthy TC, Pollak PT, Hanniman EA, Sinal CJ. Disruption of hepatic lipid homeostasis in mice after amiodarone treatment is associated with peroxisome proliferator-activated receptor-α target gene activation. J Pharmacol Exp Ther. 2004;311(3):864–73.View ArticlePubMedGoogle Scholar
- Seo M, Inoue I, Ikeda M, Nakano T, Takahashi S, Katayama S, Komoda T. Statins activate human PPARalpha promoter and increase PPARalpha mRNA expression and activation in HepG2 cells. PPAR Res. 2008;2008:316306.Google Scholar
- Yiqin Y, Meilin X, Jie X, Keping Z. Aspirin inhibits MMP-2 and MMP-9 expression and activity through PPARα/γ and TIMP-1-mediated mechanisms in cultured mouse celiac macrophages. Inflammation. 2009;32(4):233–41.View ArticlePubMedGoogle Scholar
- Zanger UM, Schwab M. Cytochrome P450 enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation. Pharmacol Ther. 2013;138(1):103–41.View ArticlePubMedGoogle Scholar
- Poulsen LC, Siersbæk M, Mandrup S. PPARs: fatty acid sensors controlling metabolism. Semin Cell Dev Biol. 2012;23(6):631–9.View ArticlePubMedGoogle Scholar
- Lin Y, Li Y, Zhu Y, Zhang J, Li Y, Liu X, Jiang W, Yu S, You X-F, Xiao C, et al. Identification of antituberculosis agents that target ribosomal protein interactions using a yeast two-hybrid system. Proc Natl Acad Sci. 2012;109(43):17412–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Dongol B, Shah Y, Kim I, Gonzalez FJ, Hunt MC. The acyl-CoA thioesterase I is regulated by PPARα and HNF4α via a distal response element in the promoter. J Lipid Res. 2007;48(8):1781–91.View ArticlePubMedGoogle Scholar
- Pritt ML, Hall DG, Recknor J, Credille KM, Brown DD, Yumibe NP, Schultze AE, Watson DE. Fabp3 as a biomarker of skeletal muscle toxicity in the rat: comparison with conventional biomarkers. Toxicol Sci. 2008;103(2):382–96.View ArticlePubMedGoogle Scholar
- Bergmann S, Ihmels J, Barkai N. Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E Stat Nonlin Soft Matter Phys. 2003;67(3 Pt 1):031902.View ArticlePubMedGoogle Scholar
- Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. 2008;24(5):719–20.View ArticlePubMedGoogle Scholar