 Proceedings
 Open Access
 Published:
Topic modeling for cluster analysis of large biological and medical datasets
BMC Bioinformatics volume 15, Article number: S11 (2014)
Abstract
Background
The big data moniker is nowhere better deserved than to describe the everincreasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets.
Results
In this study, three topic modelderived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsedfield gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths.
Conclusion
Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic modelderived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic modelbased methods could provide an analytic advancement in the analysis of large biological or medical datasets.
Background
Recent advances in biotechnology have generated massive amounts of biological and medical data for disease diagnosis/prognosis, unknown compound toxicity prediction, and pathogen identification in outbreak investigation, etc. Identification of pattern and structure among a large number of samples and/or the associated variables requires the development of more powerful statistical methods and data mining techniques. For example, genomic microarray and proteomic technologies are often used to identify genes and proteins that have similar functionality for understanding biological processes or identifying new biomarkers for targeted therapy, etc. [1–6]. Data mining techniques have been developed to classify patients into distinct subgroups for treatment assignment by identifying sets of genomic markers of individual patients. In food safety surveillance, the PulseNet managed by the Center of Disease Control (CDC) (http://www.cdc.gov/pulsenet) has been using the pulsedfield gel electrophoresis (PFGE) for the source tracking of foodborne pathogens [7–9]. PulseNet has collected more than 350,000 profiles of over 2,000 Salmonella serotypes. The fingerprint of an isolate is characterized by the presence or absence at designated band locations in PFGE analysis. Classification models were developed to characterize and identify serotype of isolates in outbreak investigation from the analysis of PFGE fingerprint [8]. The FDA Adverse Event Reporting System (FAERS) database is the primary database for postmarketing safety surveillance of all approved drugs and therapeutic biologic products. The FAERS database consists of over 5,000 drugs and over 16,000 adverse events reported. Data mining methods have been proposed to detect signals of unexpected occurrences in FAERS [10–12].
A dataset can be expressed in a twoway data matrix with rows representing samples and columns representing the measured variables that characterize the corresponding samples. A large dataset may have a large number of samples, such as the PFGE dataset of Salmonella or other foodborne pathogens [8, 9]; or a large number of variables, such as a microarray dataset [13, 14]. The analysis of large amounts of multivariate data to discover the hidden patterns and the relationships between patterns presents big challenges in both analysis methodology and data interpretation.
Cluster analysis is a commonly used data mining technique to explore the relationships among attributes, samples and the relationships between attributes and samples. Clustering algorithms assign samples or attributes to clusters based on their similarity. Cluster analysis can be used as a preliminary method for classification or for finding new classes. Hierarchical clustering tree (HCT) [15] and kmeans [16] are the two most popular clustering methods. HCT sequentially merges the most similar cluster subnodes resulting in a treelike dendrogram. K means is the most commonly used nonhierarchical clustering algorithm. In kmeans clustering procedures, samples are divided into k partitions or clusters based on a measure of similarity. Unlike the hierarchical clustering, the number of clusters in a kmeans analysis must be specified a priori. Simulation studies have shown that kmeans algorithms and other nonhierarchical clustering algorithms perform poorly when random initial seeds are used; their performance is improved when the results from hierarchical methods are used to form the initial partition [17]. Thus, hierarchical and nonhierarchical techniques should be applied as complementary rather than as competing clustering techniques.
Topic modeling algorithms are statistical methods that analyze the words of documents to discover the themes that pervade a large collection of documents [18]. The basic idea of topic modeling is that a document is a mixture of latent topics and each topic is expressed by a distribution of words. Latent Dirichlet Allocation (LDA) is the most popular topic modeling method in the field of text mining. LDA is an enhanced version of earlier models [19, 20] and uses two DirichletMultinomial distributions to model the relationships between documents and topics and the relationships between topics and words. The output of LDA provides two probability matrices: 1) the (posterior) probability distribution of each document over the topics, and 2) the probability distribution of words in a given topic. The LDA analysis commonly uses approximate methods, such as variation inference [21] or Markov chain Monte Carlo (MCMC) [22], to calculate the posterior probabilities. The calculated probability matrixes are used to make inference about the topics and documents for text mining. LDA has been shown to be an effective tool for text mining of large datasets [23, 24], and computational software is freely available [25].
In this study, we proposed to apply LDA topic modeling for cluster analysis of large datasets. Three different datasets were selected to represent various types of large biological or medical datasets. These large datasets were transformed into the files of documents on which the LDA algorithm was run and two matrices were generated for each dataset. Three different cluster analysis methods were then applied on the topic modelderived data matrixes of the three datasets, and the most accurate method for each type of dataset was determined. The applications of the topic model on various large datasets provide new approaches to improve the accuracy and efficacy of the subgroup identification and data mining.
Materials and methods
Datasets
Three large data sets were utilized to evaluate the proposed approaches in this study. The first dataset was the Salmonella PFGE genotyping data from CDC [8, 9]. It included 41,232 PFGE profiles of Salmonella outbreakrelated isolates. The dataset contained 20 most common Salmonella serotypes and about 2,000 isolates for each of 20 serotypes. Each profile used 1/ 0 to represent the presence/absence of the electrophoresis bands, and each of 41,232 profiles was nominated to have 60 bands in the dataset. As a standard typing method used in Salmonella outbreak investigations, PFGE has been used by many laboratories to determine strain relatedness and confirm an outbreak of a bacterial disease by comparing the band profiles [8, 9]. The serotype information of each profile was considered as the true label to evaluate the clustering results.
The second dataset was the public lung cancer microarray dataset originally collected from the Gene Expression Omnibus [14, 26]. The dataset consisted of 111 lung cancer samples harboring 53 adenocarcinoma and 58 squamous cell carcinoma subtypes. Each sample was expressed by 54,613 continuous valued variables. The subtype of each sample was considered as the true label to evaluate the clustering results.
The third dataset was the breast cancer microarray dataset originally collected by van 't Veer et al. [13]; there were of 24,481 continuous valued gene expression variables from 97 patients. In this work, the data of the patient with "ID54" was removed from the dataset because it had 10,896 (about 44.5%) missing variables. The incomplete variables were also removed from the dataset. The final dataset consisted of 96 patients with 21,907 genomic variables. Although there were no true labels for the samples in breast cancer dataset, we used the survival analysis [27] to evaluate the clustering results for this dataset.
Data preprocessing
In this step, each isolate/sample was transformed into one document and all documents constituted one corpus. For the Salmonella dataset, the PFGE bands were viewed as the words. Each isolate had its corresponding document consisting of the bands present, which had value 1 in the PFGE dataset. After the data preprocessing, the corpus of the Salmonella dataset contained 41,232 documents, where each document contained at most 60 words.
In both of lung and breast cancer microarray datasets, the expression value for each variable (gene) was normalized to 0 (smaller than the median value) or 1 (larger or equal than the median value) based on its median value. Each sample was transformed into one document. The variables with value 1 were considered as the words in the documents. The final corpus of the lung cancer dataset contained 111 documents and each document contained at most 54,613 words. The final corpus of the breast cancer dataset contained 96 documents and each document contained at most 21,907 words.
Topic modeling
For a given dataset, topic modeling with LDA is utilized to model the relationships between samples and variables. LDA assumes that the dataset is generated by the following process [21]:

1.
Pick a Multinomial distribution φ_{ k } (k∈{1,...,K}) for each topic from a Dirichlet distribution with hyper parameter β;

2.
Pick a Multinomial distribution θ_{ s } (s∈{1,...,S}, where S is the number of samples in the dataset) for each sample from a Dirichlet distribution with parameter α;

a)
Pick a topic z from a Multinomial distribution with hyper parameter θ_{ s }.

b)
Pick a word w_{ n } (n∈{1,...,N}, where N is the number of words in the current document) from a Multinomial distribution with parameter φ_{ z }.

a)
Based on the generative process above, the probability of a given dataset D = {D_{1},..., D_{ S }} is formalized as
Through maximizing the probability, LDA derives the posterior distributions of θ (the matrix in Figure 1a) and φ (the matrix in Figure 1b) which are used for cluster analysis in our study.
The LDA program implemented in Mallet [25] was applied for topic modeling. In Mallet, Gibbs sampling [24], a special case of MCMC approach, was utilized to calculate the two matrices in Figure 1. The number of iterations was set to 2000 in Gibbs sampling and other parameters were set to default values in Mallet in all calculations.
As shown in Figure 1, the sampletopic matrix (Figure 1a) depicts the distribution of the topics in the documents (samples). Each row has a mixture of topics and represents one document. Each entry gives the probability of the corresponding topic in the document, where the sum of probabilities in each row is 1. The topicword matrix (Figure 1b) depicts the distribution of the words in a given topic. Each column of the matrix gives the probable distributions of the words in each topic, where the sum of probabilities in each column is 1.
Cluster analysis methods
Three topic modelderived clustering methods were proposed based on the two LDAderived matrices.
1. Topic modelderived clustering based on highest probable topic assignment
The method was based on sampletopic matrix (Figure 1a) and called "highest probable topic assignment". In this method, the LDAderived topics were made as the clusters of the dataset. Then, each sample was assigned to the cluster (Topic) with the highest probability in the row of the sampletopic matrix.
2. Topic modelderived clustering based on feature extraction
In this method, LDA was utilized as a feature extraction approach for cluster analysis. The LDAderived topics were considered as the new features of datasets. The sampletopic matrix (Figure 1a) was treated as a new representation of the original dataset. Based on the sampletopic matrix, conventional clustering algorithms, such as kmeans and hierarchical clustering algorithms were used for the clustering analysis.
3. Topic modelderived clustering based on feature selection
In this method, the topicword matrix (Figure 1b) was used for feature selection. The words with high probabilities in each LDAderived topic were selected to express the dataset. Therefore, a reduced dataset with selected words (variables) was generated, based on which the conventional clustering could be conducted. In this study, the top 50 high probability words were chosen in each topic.
Hierarchical cluster analysis
For each of the 30 clusters, the average of PFGE band presence (value of 1) /absence (value of 0) of all the sample isolates at 60 designated band locations was calculated as the characteristic mean of the corresponding cluster. Then, the completelink hierarchical clustering algorithm was applied on the Euclidean distance measures to investigate the relationships among the 30 clusters.
Survival analysis
For the breast cancer dataset, based on the obtained clusters (groups) and survival time information of the samples, the survival package in R was utilized for survival analysis. Specifically, function "survfit" was used to generate the KaplanMeier curves [28] for the patients in the clusters and function "survdiff" was used for the logrank test [29] for differences among clusters.
Normalized mutual information (NMI)
Normalized mutual information [30] was utilized to evaluate the clustering results. NMI is an external validation metric to evaluate the quality of clustering result with respect to the given true labels of the datasets. If random variable Z' denotes the cluster assignments of instances in obtained clustering result, and random variable Z denotes the true class labels, then NMI is defined as follows:
Where I(Z';Z) = H(Z)  H(ZZ') is the mutual information between the random variables Z' and Z, H(Z) is the Shannon entropy of Z, and H(ZZ') is the conditional entropy of Z given Z' [31]. The range of NMI values is 01. In general, the larger the NMI value is, the better the clustering quality is.
Results
In this study, three large datasets representing different types of large biological or medical were selected to illustrate the applications of topic modeling for cluster analysis. The LDA algorithm transformed the original datasets into the files of documents and generated two matrices for each of the three datasets. Three different topic modelderived clustering methods were applied to the LDAderived matrices from the three large datasets. After the result comparison (data not shown here), the bestfitting cluster analysis method was selected on the basis of the most biological accuracies for each dataset.
Analysis of Salmonella PFGE dataset
Topic modelderived clustering based on highest topic assignment.
The topic modelderived clustering based on highest topic assignment yielded the most accurate classification results for the Salmonella PFGE dataset, as compared to the other two topic modelderived clustering methods (Table S1 in Additional file 1). The LDA algorithm was run on the 41,232 PFGE profiles of 20 serotypes with 30 topics (Table 1). The 30 topics representing 30 clusters were labelled with the serotypes of dominant isolates in the clusters (first column in Table 1). The percentages of the most dominant serotype for each of 30 clusters were also calculated (fourth column in Table 1). In 24 out of 30 clusters, the percentages of the most dominant serotypes were greater than 75%. The exceptions fell in the clusters T8 labelled as serotype Muenchen with 36.60%, T6 and T20 as serotype Typhimurium var. 5, and T0, T21 and T24 as Typhimurium.
To further investigate the relationships between the 30 clusters, the completelink hierarchical clustering analysis was conducted on the Euclidean distance measures of the characteristic means of 30 clusters (Material and Methods). In the resultant Figure 2 dendrogram tree, most of the clusters labelled with the same serotypes grouped together, such as the two clusters of Braenderup (T9 and T14), two clusters of Enteritidis (T2 and T11), three clusters of Typhimurium (T0, T21, and T24), two clusters of 4,5,12:i (T3 and T28), two clusters of Saintpaul (T12 and T29), and two clusters of Typhimurium var. 5 (T6 and T20). The only exception was the two clusters (T10 and T26 highlighted in red) of Paratyphi B that classified into two different branches, indicating that the serotype Paratyphi B might have two subtypes.
Analysis of the lung cancer dataset
Topic modelderived clustering based on feature selection.
The topic modelderived clustering based on selection of the highest probability features emerged as the best fitting method for the lung cancer dataset after comparison with the other two methods (Table 2). In this method, a prespecified fixed number of words (features) with highest probability in each topic were selected. The results optimized when the topic number was set to 2 and the number of features as 50. Under this parameter setting, the 54,613 variables of each of the 111 samples in the lung cancer dataset were reduced to 100 selected features by the LDA algorithm. The selected genes were listed in Table S3 in Additional file 1. Kmeans was then applied to the 100 features for the clustering analysis. The results were compared with two conventional methods, kmeans and PCA [32]. The conventional kmeans algorithm was directly applied to the original 54,613 continuousvalued variables, while the PCA method [32] was first used to reduce the original 54,613 variables to 10 features followed by the kmeans algorithm. Table 3 compares results for k set as 2, 3 and 4.
The fourth and fifth columns in Table 3 give the numbers of the two sample subtypes Adenocarcinoma and Squamous cell carcinoma in each cluster, respectively. Each cluster was labelled as the subtype having the most prevalent samples in the cluster. Two criteria, the number of misclassified samples and normalized mutual information (NMI) [30], were utilized to evaluate the clustering results. NMI was used to compare the difference between the clustering result obtained and the truth clusters in the dataset. The larger the NMI, the better of the clustering results. The results in Table 3 show that the proposed topic modelderived clustering using the feature selection method yields the best clustering on both criteria, as compared to the other two conventional cluster analysis methods. Specifically, for k set to 4 in kmeans, our proposed method gives the best cluster results with only 18 samples misclassified.
Analysis of the breast cancer dataset
Topic modelderived clustering based on feature extraction.
The 96 patients from the breast cancer dataset were best clustered using the method based on feature extraction. The topic number was set as 10 for feature extraction; the kmeans was then applied to the 10 derived features.
Survival analysis was conducted to evaluate the clustering results. For k = 3, the 96 breast cancer patients were clustered into three groups, G1 (n = 20), G2 (n = 36), and G3 (n = 40). The KaplanMeier survival curves of the three clusters were shown in Figure 3. The pvalue of the logrank test for the differences among the three groups was 0.000174 (Material and Methods). However, the survival curves of G2 and G3 were similar, and the pvalue of the logrank test between G2 and G 3 was 0.645, indicating nonsignificant difference in survival between these two patient groups.
For k = 2, the 96 patients were clustered into G1 (n = 24) and G2 (n = 72). The patients in G3 from the 3means cluster analysis (Figure 3) were divided into two parts: four patients were grouped into G1 (Figure 4) and 36 patients were grouped into G2 (Figure 4). Figure 4 shows two distinguishable survival curves of G1 and G2, where the pvalue for the logrank test is 4.6e5, indicating significant differences between the two groups. From results of the 96 breast cancer patients, we can conclude that G1is the higher risk group.
Discussion
Due to the advances in biotechnology and information technology, biological and medical datasets are growing rapidly in size and complexity and, consequently becoming increasingly difficult to process and analyze using traditional data mining methods. Multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables. Reducing the dimension of a large dataset to a few clusters makes it possible to use standard statistical tools for all subsequent analyses.
Data mining has been used as a process that deals with the discovery of hidden knowledge and unexpected patterns, particularly the discovery of optimal clusters and interesting irregularities in large data bases. Topic modeling is an active research field in machine learning and has been widely used as an analytical tool to interpret large datasets in text mining [19–21, 24] and image retrieval [23, 33]. Here we applied topic modeling in a different way to reduce the dimensions of large datasets to yield more effective clustering analysis in various biological and biomedical data.
We have proposed three topic modelderived clustering methods and evaluated the efficacies/effectiveness on datasets from three different application fields. It was found that one method yielded better results than the others for each dataset (Table 3 Table S1 and S2 in Additional file 1). The topic modelderived clustering based on highest probable topic assignment used the LDAderived topics as the clusters and the samples in the dataset were assigned to the clusters according to the highest probabilities. This method was found appropriate for the type of data with large number of samples but small number of variants, and with no causalities between the variants, such as Salmonella PFGE dataset. The results of this method on the PFGE dataset shown in Table 1 and Figure 2 not only reflected the biological understandings in concordance with the previous results [8, 9], but also revealed some hidden patterns and interesting irregularities (see Results). Most of the serotypes were distinguishable and represented various topics. The low percentages of the serotype Muenchen in T8 reflected the biological fact that the PFGE patterns of serotype Muenchen were not unique and were very similar to those of other serotypes in topic T8. The five clusters (T0, T6, T20, T21, and T24) labelled as Typhimurium or Typhimurium var. 5 had less than 70% of the most dominant serotype, consistent with the fact that the serotypes 4,5,12:i and Typhimurium var. 5 are variants of Typhimurium and isolates of the three serotypes shared similar PFGE patterns [9, 34]. Two clusters of Paratyphi B (T10 and T26) separated into two distant subbranches in the dendrogram tree of Figure 2, indicating the existence of hidden subtypes of the serotype.
The lung cancer and breast cancer datasets represent typical high dimensional microarray data with thousands of genes involved for each sample. For this type of high dimensional data with large samples and large variables, the proposed methods of topic modelderived clustering based on feature selection and on feature extraction, yielded more accurate results than the method based on highest probable topic assignment (Tables 2, 3 and S2, Figures 3 and 4). In these two methods, LDA algorithms effectively reduced the high dimensions in the original datasets to a small number of features from which the following traditional clustering algorithms were able to generate more accurate results. Both methods are appropriate for use on the high dimensional datasets, such as the microarray datasets. The differences between the two methods are generated from the fact that the method based on feature extraction works on the sampletopic matrix, while the method based on feature selection generates the results on topicword matrix. Therefore, the selection of the most appropriate method also depends on the research applications.
The goal of personalized medicine requires stratifying subgroups of disease to tailor treatment to match individual characteristics, needs, and preferences of a patient subgroup during all stages of care, including prevention, diagnosis, treatment, and followup. There were two subtypes in the lung cancer dataset, adenocarcinoma and squamous cell carcinoma. Patients with different lung cancer subtypes need different therapies in clinical treatment. The proposed method of the topic modelderived clustering based on feature selection yield more effective clustering results on this dataset than the other two topic modelderived methods (Table 2), as well as the two conventional clustering methods, kmeans and PCA (Table 3). The two topics obtained by LDA were considered as the representatives of the two subtypes of lung cancer. The method of topic modelderived clustering based on highest probable topic assignment, in which only one topic was used to describe the differences between samples, may not be appropriate to microarray datasets having tens of thousands of genes included in the samples. In the proposed method of topic modelderived clustering based on feature selection, 50 genes with the highest probability were selected to represent each topic, and all of the genes in two topics greatly reduced the dimensionality from 54,613 variables to 100 selected genes. The cluster analysis performed much better on the dimensionreduced dataset than the other methods in segregating lung cancer patients into the two subtypes (Table 3). The selected genes for each topic (Table S3 in Additional file 1) will be further analysed for subtype prediction and pathway identification.
Since there is no available subgroup information for breast cancer, we were trying to understand if there are hidden relationships in the dataset. The proposed method based on feature extraction worked on the sampletopic matrix and gives the best clustering results among the three proposed methods (Table S2 in Additional file 1). In personalized medicine, the prognostic predictors (biomarkers) are identified to predict overall course of disease outcome for treatment recommendation. The clinical endpoint of breast cancer dataset is the patient survival time. For this endpoint, prognostic biomarker signatures typically classify patients into high risk group and low risk group. The high risk group would be recommend to receive more aggressive treatment, and low risk groups to receive standard treatment or no treatment. The obtained results from this study yield potential prognostic predictors for treatment selection (Figures 3 and 4).
Conclusions
Topic modeling could be beneficially applied to various large datasets from biological or medical research areas. Each of the three proposed topic modelderived clustering methods, highest probable topic assignment, feature selection, and feature extraction, yielded the best clustering results for a distinct type of dataset. The application of the topic modeling approach to cluster analysis of large datasets can greatly improve the accuracy and efficacy of subgroup identification, and the proposed three methods provide new approaches for data mining of large datasets in biological and biomedical research.
Disclaimer
The findings and conclusions in this article have not been formally disseminated by the US Food and Drug Administration (FDA) and should not be construed to represent the FDA determination or policy.
References
 1.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531537. 10.1126/science.286.5439.531.
 2.
Director's Challenge Consortium for the Molecular Classification of Lung A, Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ: Gene expressionbased survival prediction in lung adenocarcinoma: a multisite, blinded validation study. Nature medicine. 2008, 14 (8): 822827. 10.1038/nm.1790.
 3.
Woodcock J: The prospects for "personalized medicine" in drug development and drug therapy. Clinical pharmacology and therapeutics. 2007, 81 (2): 164169. 10.1038/sj.clpt.6100063.
 4.
Avigan MI: Pharmacogenomic biomarkers of susceptibility to adverse drug reactions: just around the corner or pie in the sky?. Personalized Medicine. 2009, 6 (1): 6778. 10.2217/17410541.6.1.67.
 5.
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature genetics. 2003, 34 (3): 267273. 10.1038/ng1180.
 6.
Tsai CA, Chen JJ: Multivariate analysis of variance test for gene set analysis. Bioinformatics. 2009, 25 (7): 897903. 10.1093/bioinformatics/btp098.
 7.
Kotetishvili M, Stine OC, Kreger A, Morris JG, Sulakvelidze A: Multilocus sequence typing for characterization of clinical and environmental Salmonella strains. Journal of clinical microbiology. 2002, 40 (5): 16261635. 10.1128/JCM.40.5.16261635.2002.
 8.
Zou W, Chen HC, Hise KB, Tang H, Foley SL, Meehan J, Lin WJ, Nayak R, Xu J, Fang H: Metaanalysis of pulsedfield gel electrophoresis fingerprints based on a constructed Salmonella database. PloS one. 2013, 8 (3): e5922410.1371/journal.pone.0059224.
 9.
Zou W, Lin WJ, Hise KB, Chen HC, Keys C, Chen JJ: Prediction system for rapid identification of Salmonella serotypes based on pulsedfield gel electrophoresis fingerprints. Journal of clinical microbiology. 2012, 50 (5): 15241532. 10.1128/JCM.0011112.
 10.
O'Neill RT, Szarfman A: Some US Food and Drug Administration perspectives on data mining for pediatric safety assessment. Current Therapeutic Research. 2001, 62 (9): 650663. 10.1016/S0011393X(01)800710.
 11.
Harpaz R, Perez H, Chase HS, Rabadan R, Hripcsak G, Friedman C: Biclustering of adverse drug events in the FDA's spontaneous reporting system. Clinical pharmacology and therapeutics. 2011, 89 (2): 243250. 10.1038/clpt.2010.285.
 12.
Chen HC, Tsong Y, Chen JJ: Data mining for signal detection of adverse event safety data. Journal of biopharmaceutical statistics. 2013, 23 (1): 146160. 10.1080/10543406.2013.735780.
 13.
van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530536. 10.1038/415530a.
 14.
Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002, 30 (1): 207210. 10.1093/nar/30.1.207.
 15.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95 (25): 1486314868. 10.1073/pnas.95.25.14863.
 16.
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature genetics. 1999, 22 (3): 281285. 10.1038/10343.
 17.
Clustering PoDAa: Discriminant analysis and clustering. Statistical Science. 1989, 4 (1): 3469.
 18.
Blei DM: Probabilistic Topic Models. Communications of the ACM. 2012, 55 (4): 7784. 10.1145/2133806.2133826.
 19.
Papadimitriou CH, Tamaki H, Raghavan P, Vempala S: Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences. 2000, 61 (2): 217235. 10.1006/jcss.2000.1711.
 20.
Hofmann T: Probabilistic latent semantic indexing. In annual international ACM SIGIR conference on Research and development in information retrieval. 1999, 5057. 10.1145/312624.312649.
 21.
Blei DM, Ng AY, Jordan MI: Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003, 3: 9931022.
 22.
Jordan MI: Learning in Graphical Models. 1999, MIT Press, Cambridge, MA
 23.
Blei DM, Jordan MI: Modeling annotated data. The Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 2003, 127134.
 24.
Griffiths TL, Steyvers M: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America. 2004, 101 (suppl. 1): 52285235.
 25.
McCallun AK: MALLET: A Machine Learning for Language Toolkit. 2002, [http://malletcsumassedu]
 26.
Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006, 439 (7074): 353357. 10.1038/nature04296.
 27.
Singh R, Mukhopadhyay K: Survival analysis in clinical trials: Basics and must know areas. Perspectives in clinical research. 2011, 2 (4): 145148. 10.4103/22293485.86872.
 28.
Kaplan EL, Meier P: Nonparametric estimation from incomplete observations". Journal of the American statistical association. 1958, 53 (282): 457481. 10.1080/01621459.1958.10501452.
 29.
Harrington DP, Fleming TR: A class of rank test procedures for censored survival data. Biometrika. 1982, 69 (3): 553566. 10.1093/biomet/69.3.553.
 30.
Strehl A, Ghosh J, Mooney R: Impact of similarity measures on webpage clustering. Workshop on Artificial Intelligence for Web Search (AAAI 2000). 2000, 5864.
 31.
Cover TM, Thomas JA: Elements of information theory. 2012, John Wiley & Sons
 32.
Mardia KV, Kent JT, Bibby JM: Multivariate Analysis. 1979, London: Academic Press
 33.
Datta R, Joshi D, Li J, Jz W: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys. 2008, 40 (2): 510.1145/1348246.1348248.
 34.
CDC: National Salmonella Surveillance Annual Data Summary, 2009. 2009, In Edited by US Department of Health and Human Services CDC Atlanta, Georgia
Acknowledgements
This work and the publication were funded by FDA. Dr. Weizhong Zhao acknowledges the support of a fellowship from the Oak Ridge Institute for Science and Education, administered through an interagency agreement between the U.S. Department of Energy and the U.S. Food and Drug Administration. Dr. Weizhong Zhao would also like to thank the support of National Natural Science Foundation of China (No. 61105052, 61202398, 61272295). We are grateful to Ms. Beth Juliar and Dr. Roger Perkins for critical reading of this manuscript.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 11, 2014: Proceedings of the 11th Annual MCBIOS Conference. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S11.
Author information
Affiliations
Corresponding authors
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
WZ (Zhao) performed all the calculations and data analysis, and wrote the first draft of the manuscript. WZ and JC developed the methods, had the original idea, and guided the data analysis and presentation of results. WZ and JC collected and generated the data. All authors contributed to data verification, approach evaluation, and assisted with writing the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhao, W., Zou, W. & Chen, J.J. Topic modeling for cluster analysis of large biological and medical datasets. BMC Bioinformatics 15, S11 (2014). https://doi.org/10.1186/1471210515S11S11
Published:
Keywords
 Topic modeling
 cluster analysis
 large biological and biomedical datasets
 data mining