Datasets
Three large data sets were utilized to evaluate the proposed approaches in this study. The first dataset was the Salmonella PFGE genotyping data from CDC [8, 9]. It included 41,232 PFGE profiles of Salmonella outbreak-related isolates. The dataset contained 20 most common Salmonella serotypes and about 2,000 isolates for each of 20 serotypes. Each profile used 1/ 0 to represent the presence/absence of the electrophoresis bands, and each of 41,232 profiles was nominated to have 60 bands in the dataset. As a standard typing method used in Salmonella outbreak investigations, PFGE has been used by many laboratories to determine strain relatedness and confirm an outbreak of a bacterial disease by comparing the band profiles [8, 9]. The serotype information of each profile was considered as the true label to evaluate the clustering results.
The second dataset was the public lung cancer microarray dataset originally collected from the Gene Expression Omnibus [14, 26]. The dataset consisted of 111 lung cancer samples harboring 53 adenocarcinoma and 58 squamous cell carcinoma subtypes. Each sample was expressed by 54,613 continuous valued variables. The subtype of each sample was considered as the true label to evaluate the clustering results.
The third dataset was the breast cancer microarray dataset originally collected by van 't Veer et al. [13]; there were of 24,481 continuous valued gene expression variables from 97 patients. In this work, the data of the patient with "ID54" was removed from the dataset because it had 10,896 (about 44.5%) missing variables. The incomplete variables were also removed from the dataset. The final dataset consisted of 96 patients with 21,907 genomic variables. Although there were no true labels for the samples in breast cancer dataset, we used the survival analysis [27] to evaluate the clustering results for this dataset.
Data preprocessing
In this step, each isolate/sample was transformed into one document and all documents constituted one corpus. For the Salmonella dataset, the PFGE bands were viewed as the words. Each isolate had its corresponding document consisting of the bands present, which had value 1 in the PFGE dataset. After the data preprocessing, the corpus of the Salmonella dataset contained 41,232 documents, where each document contained at most 60 words.
In both of lung and breast cancer microarray datasets, the expression value for each variable (gene) was normalized to 0 (smaller than the median value) or 1 (larger or equal than the median value) based on its median value. Each sample was transformed into one document. The variables with value 1 were considered as the words in the documents. The final corpus of the lung cancer dataset contained 111 documents and each document contained at most 54,613 words. The final corpus of the breast cancer dataset contained 96 documents and each document contained at most 21,907 words.
Topic modeling
For a given dataset, topic modeling with LDA is utilized to model the relationships between samples and variables. LDA assumes that the dataset is generated by the following process [21]:
-
1.
Pick a Multinomial distribution φ
k
(k∈{1,...,K}) for each topic from a Dirichlet distribution with hyper parameter β;
-
2.
Pick a Multinomial distribution θ
s
(s∈{1,...,S}, where S is the number of samples in the dataset) for each sample from a Dirichlet distribution with parameter α;
-
a)
Pick a topic z from a Multinomial distribution with hyper parameter θ
s
.
-
b)
Pick a word w
n
(n∈{1,...,N}, where N is the number of words in the current document) from a Multinomial distribution with parameter φ
z
.
Based on the generative process above, the probability of a given dataset D = {D1,..., D
S
} is formalized as
Through maximizing the probability, LDA derives the posterior distributions of θ (the matrix in Figure 1a) and φ (the matrix in Figure 1b) which are used for cluster analysis in our study.
The LDA program implemented in Mallet [25] was applied for topic modeling. In Mallet, Gibbs sampling [24], a special case of MCMC approach, was utilized to calculate the two matrices in Figure 1. The number of iterations was set to 2000 in Gibbs sampling and other parameters were set to default values in Mallet in all calculations.
As shown in Figure 1, the sample-topic matrix (Figure 1a) depicts the distribution of the topics in the documents (samples). Each row has a mixture of topics and represents one document. Each entry gives the probability of the corresponding topic in the document, where the sum of probabilities in each row is 1. The topic-word matrix (Figure 1b) depicts the distribution of the words in a given topic. Each column of the matrix gives the probable distributions of the words in each topic, where the sum of probabilities in each column is 1.
Cluster analysis methods
Three topic model-derived clustering methods were proposed based on the two LDA-derived matrices.
1. Topic model-derived clustering based on highest probable topic assignment
The method was based on sample-topic matrix (Figure 1a) and called "highest probable topic assignment". In this method, the LDA-derived topics were made as the clusters of the dataset. Then, each sample was assigned to the cluster (Topic) with the highest probability in the row of the sample-topic matrix.
2. Topic model-derived clustering based on feature extraction
In this method, LDA was utilized as a feature extraction approach for cluster analysis. The LDA-derived topics were considered as the new features of datasets. The sample-topic matrix (Figure 1a) was treated as a new representation of the original dataset. Based on the sample-topic matrix, conventional clustering algorithms, such as k-means and hierarchical clustering algorithms were used for the clustering analysis.
3. Topic model-derived clustering based on feature selection
In this method, the topic-word matrix (Figure 1b) was used for feature selection. The words with high probabilities in each LDA-derived topic were selected to express the dataset. Therefore, a reduced dataset with selected words (variables) was generated, based on which the conventional clustering could be conducted. In this study, the top 50 high probability words were chosen in each topic.
Hierarchical cluster analysis
For each of the 30 clusters, the average of PFGE band presence (value of 1) /absence (value of 0) of all the sample isolates at 60 designated band locations was calculated as the characteristic mean of the corresponding cluster. Then, the complete-link hierarchical clustering algorithm was applied on the Euclidean distance measures to investigate the relationships among the 30 clusters.
Survival analysis
For the breast cancer dataset, based on the obtained clusters (groups) and survival time information of the samples, the survival package in R was utilized for survival analysis. Specifically, function "survfit" was used to generate the Kaplan-Meier curves [28] for the patients in the clusters and function "survdiff" was used for the logrank test [29] for differences among clusters.
Normalized mutual information (NMI)
Normalized mutual information [30] was utilized to evaluate the clustering results. NMI is an external validation metric to evaluate the quality of clustering result with respect to the given true labels of the datasets. If random variable Z' denotes the cluster assignments of instances in obtained clustering result, and random variable Z denotes the true class labels, then NMI is defined as follows:
Where I(Z';Z) = H(Z) - H(Z|Z') is the mutual information between the random variables Z' and Z, H(Z) is the Shannon entropy of Z, and H(Z|Z') is the conditional entropy of Z given Z' [31]. The range of NMI values is 0-1. In general, the larger the NMI value is, the better the clustering quality is.