- Research article
- Open Access
Stepwise iterative maximum likelihood clustering approach
- Alok Sharma^{1, 2, 3}Email author,
- Daichi Shigemizu^{1, 2, 4},
- Keith A. Boroevich^{1},
- Yosvany López^{1, 4},
- Yoichiro Kamatani^{1},
- Michiaki Kubo^{1} and
- Tatsuhiko Tsunoda^{1, 2, 4}Email author
https://doi.org/10.1186/s12859-016-1184-5
© The Author(s). 2016
- Received: 27 November 2015
- Accepted: 12 August 2016
- Published: 24 August 2016
Abstract
Background
Biological/genetic data is a complex mix of various forms or topologies which makes it quite difficult to analyze. An abundance of such data in this modern era requires the development of sophisticated statistical methods to analyze it in a reasonable amount of time. In many biological/genetic analyses, such as genome-wide association study (GWAS) analysis or multi-omics data analysis, it is required to cluster the plethora of data into sub-categories to understand the subtypes of populations, cancers or any other diseases. Traditionally, the k-means clustering algorithm is a dominant clustering method. This is due to its simplicity and reasonable level of accuracy. Many other clustering methods, including support vector clustering, have been developed in the past, but do not perform well with the biological data, either due to computational reasons or failure to identify clusters.
Results
The proposed SIML clustering algorithm has been tested on microarray datasets and SNP datasets. It has been compared with a number of clustering algorithms. On MLL datasets, SIML achieved highest clustering accuracy and rand score on 4/9 cases; similarly on SRBCT dataset, it got for 3/5 cases; on ALL subtype it got highest clustering accuracy for 5/7 cases and highest rand score for 4/7 cases. In addition, SIML overall clustering accuracy on a 3 cluster problem using SNP data were 97.3, 94.7 and 100 %, respectively, for each of the clusters.
Conclusions
In this paper, considering the nature of biological data, we proposed a maximum likelihood clustering approach using a stepwise iterative procedure. The advantage of this proposed method is that it not only uses the distance information, but also incorporate variance information for clustering. This method is able to cluster when data appeared in overlapping and complex forms. The experimental results illustrate its performance and usefulness over other clustering methods. A Matlab package of this method (SIML) is provided at the web-link http://www.riken.jp/en/research/labs/ims/med_sci_math/.
Keywords
- Burkitt Lymphoma
- Cluster Accuracy
- Hill Climbing Algorithm
- Tuning Time
- Support Vector Cluster
Background
In an unsupervised learning procedure, the class label of a training sample is not known and the aim is to partition the data into clusters. The unsupervised learning scheme uses the relationship between samples to perform partitioning. In many biological data (e.g. transcriptome data, genomic data etc.), the number of clusters and class labels are unknown. However, the distribution is sometimes known, which is usually normal. Therefore, it would be an advantage to build a technique that utilizes distance and variance information as it can track clusters with different conformations.
Over last several decades, the k-means clustering algorithm has been used quite significantly in partitioning the biological data. In the most recent multi-omics data analysis tools, like iCluster, and iClusterPlus [1], the underlying clustering method used was also k-means. Some tools in cancer research, like ConsensusCluster (CC) and CCPlus [2, 3], also utilize k-means as one of the common clustering algorithms. Though the k-means clustering algorithm has been extensively applied [4] due to its simplicity and reasonable level of accuracy, it cannot track clusters when samples of different groups are overlapping to each other (i.e., data points of adjacent groups are spread in a way that the groups partly coincide over each other). In biological data, this is sometimes the case, and thereby leads to clusters which may not be accurate. This has a significant implication in biological findings, particularly in cancer subtypes analysis, population stratification analysis in GWAS and multi-omics data analysis. In general, k-means has played a significant role in carrying out analysis for various types of biological data over several years. Since data complexity and quantity are increasing, it is important to develop techniques that can perform clustering by following data topologies.
In the field of unsupervised learning and clustering, several wonderful techniques have emerged. Some of the techniques are briefly summarized here as follows: 1) clustering using some criterion functions e.g. i) sum-of-squared error criterion; ii) related minimum variance criterion, iii) scattering criterion; iv) trace criterion; v) determinant criterion; and, vi) invariant criterion [5, 6]; 2) clustering using iterative optimization [7–9]; 3) hierarchical clustering [10–13]; several hierarchical-based algorithms can be found in the literature; e.g., single-linkage [14], complete-linkage [15], median-linkage [16] and so on. Single linkage (SLink) [14] merges two nearest-neighbor clusters at a time in an agglomerative hierarchical fashion. It uses Euclidean distance to measure the closeness between two clusters (if it is less than an arbitrary threshold). This method is very sensitive to data position, which sometimes creates problem by forming a cluster in a long chain (known as the chaining effect). The complete linkage (CLink) hierarchical approach [15] depends on the farthest-neighbor and reduces the effects of long chains. This technique is also sensitive to outliers. The use of average or median distance could be a way to overcome this sensitiveness. This was done in median linkage (MLink) hierarchical approach [16]; 4) clustering is also performed using Bayes classifier [17–21]; 5) clustering iterative maximum likelihood [22–24]; and, 6) support vector clustering [25–27].
In the recent literature, support vector clustering has gained a lot of attention [26–31]. However, it is expensive in processing time and sometimes fails to find meaningful clusters. In general, clustering methods based on Bayes classifier and maximum likelihood are still the preferred choice compared to support vector clustering for many applications. There are various approaches to implement these clustering methods.
In this paper, we focus on maximum likelihood estimate. There are three ways to implement the maximum likelihood method. 1) Analytic way: likelihood functions are differentiated and equated to zero and the equations are solved to find extrema. The second derivative is then taken to ensure if maxima has reached rather than minima. 2) Grid search: an exhaustive search over a region is conducted to find the parameters that produce largest likelihood. 3) Numerical analysis: an initial value of parameter is used in a hill climbing algorithm or gradient ascent algorithm (e.g. Newton-Rapson, BHHH, DFP) to find the maxima. Maximum likelihood is also estimated via EM algorithm [5, 22, 32–39].
In general, it is impossible to use an analytic approach to find maximum likelihood estimates as the parametric form of data is unknown. Grid search is only possible when the dimensionality of the data is very small. Most of the time, maximum likelihood is computed by a hill climbing algorithm or by the EM algorithm. The potential problem with gradient algorithms is that when likelihood is not differentiable then it is not possible to find gradient to convergence. Considering this, in this paper, we propose a stepwise iterative maximum likelihood (SIML) procedure which does not require derivatives of likelihood functions. It can find all unknown parameters without solving first derivative and second derivatives of likelihood. The experimental results also show promising when compared to many state-of-the-art clustering methods.
Methods
Description of Maximum Likelihood Clustering
where p(x|ω _{ j }, θ _{ j }) is the conditional density, and P(ω _{ j }) is the a priori probability. The log likelihood can be represented by joint density
In the above equations, θ is replaced by unknown mean and covariance parameters for normal distribution case, to yield maximum likelihood estimates. In the literature, parameter θ is iteratively updated to reach the final value \( \widehat{\boldsymbol{\uptheta}} \) using hill climbing algorithms such as the Newton-Raphson method. In general, the computation of first and second derivatives of likelihood is required to find the solution. If the likelihood is differentiable and the a priori probability is non-zero, then convergence can be obtained. However, there is always a possibility of being trapped in a local optima.
Stepwise iterative maximum likelihood method
where L _{ k } is from Eq. 15 or 16.
where μ _{ i } ^{*} , μ _{ j } ^{*} , Σ _{ i } ^{*} and Σ _{ j } ^{*} are means and covariances after the shift.
In order to find the change in log-likelihood functions L _{ i } and L _{ j }, we first introduce the following Lemma.
Lemma 1 If a sample \( \widehat{\mathbf{x}}\in {\chi}_i \) is shifted to cluster χ _{ j } and the changed covariance of χ _{ j } is defined as \( {\varSigma}_j^{*}=\frac{n_j}{n_j+1}{\varSigma}_j+\frac{n_j}{{\left({n}_j+1\right)}^2}\left(\widehat{\mathbf{x}}-{\boldsymbol{\upmu}}_j\right){\left(\widehat{\mathbf{x}}-{\boldsymbol{\upmu}}_j\right)}^{\mathrm{T}} \) then the determinant of Σ _{ j } ^{*} can be given as \( \left|{\varSigma}_j^{*}\right|={\left(\frac{n_j}{n_j+1}\right)}^d\left|{\varSigma}_j\right|\left(1+\frac{1}{n_j+1}{\left(\widehat{\mathbf{x}}-{\boldsymbol{\upmu}}_j\right)}^{\mathrm{T}}{\varSigma}_j^{-1}\left(\widehat{\mathbf{x}}-{\boldsymbol{\upmu}}_j\right)\right) \).
since |c| = c.
Substituting right hand side of Eq. L3 in Eq. L2 proves the Lemma.
and C is same as of Eq. 26.
Stepwise iterative maximum likelihood method procedure
1. Initialization: select initial partitions with means μ _{1}, μ _{2}, …, μ _{ c } and covariance matrices Σ _{1}, Σ _{2}, …, Σ _{ c } | |
2. Loop: Select a sample \( \widehat{\mathbf{x}}\in {\chi}_i \). | |
3. If n _{ i } > 1 then compute | |
4. \( {\delta}_j=\left\{\begin{array}{c}\hfill \varDelta {L}_j,\kern0.75em j\ne i\hfill \\ {}\hfill \varDelta {L}_i,\kern0.75em j=i\hfill \end{array}\right. \) | |
5. Transfer \( \widehat{\mathbf{x}} \) to χ _{ k } if δ _{ k } = max δ _{ j } for all j. | |
6. Update L _{ tot }, μ _{ i }, μ _{ k }, Σ _{ i } and Σ _{ k }. | |
7. If L _{ tot } doesn’t change in n attempts then stop otherwise go to Loop. |
The following sections discuss the characteristic of the SIML method.
Initial settings of the procedure
- 1.
Random initialization: In this scheme, we create c random means around the center of the data. This technique works well when the number of clusters is small. If c is very large then it can miss clusters.
- 2.
K-means initialization: In this scheme, the data is first partitioned into c clusters by using the k-means algorithm. The solution of k-means is used as an initial setting for the SIML method. This method works well even if the number of clusters is large. Most of the time this initialization technique provides good results. However, since the k-means algorithm does not track the data based on covariance information, it has limitations.
- 3.
Initialization based on the solution of c − 1 clusters: The initialization of c clusters is done by using the solution of c − 1 clusters, which would give c − 1 locations. The c th location is the mean of the overall data itself. If only two clusters are required to find, then 2 locations around center of the data is used since the solution of 1-cluster is the center of the data itself.
In this paper, we used all the three schemes for initialization and in general schemes 2 and 3 provide satisfactory results for most of the data conformations.
Numerical stability
Due to numerical difficulties the convergences of an iterative algorithm can be missed (e.g. convergence problem for EM algorithm is discussed in [40]). The problem of numeral difficulties is of particular issue when data dimensionality is high. In this situation, iterative algorithms sometimes do not converge properly. This problem usually appears due to the small numerical values of the covariance matrix. If the eigenvalues of a covariance matrix Σ are small, then its determinant can give a value close to zero due to the fixed point architecture of the hardware. However, this problem can be easily overcome by first conducting eigenvalue decomposition of Σ and using the summation of the logarithm of eigenvalues. It is described as follows:
This will ensure that \( 1-\frac{P}{n_i+1+\epsilon }>0 \).
Small sample size case
When the dimensionality d is much greater than the number of samples n (d ≫ n) then small sample size problem appears [41–44]. Let a sample set χ = {x _{1}, x _{2}, …, x _{ n }} be drawn independently and let the mean and covariance of χ be denoted by μ and Σ, respectively. In the normal density we have a term P = (x − μ)^{T} Σ ^{− 1}(x − μ) to compute which cannot be solved due to singular covariance matrix as its inverse does not exist. A simple extension could be to use the pseudo-inverse of Σ (denoted here as Σ ^{+}). However, this doesn’t solve the problem. If samples x are from χ then P ^{+} = (x − μ)^{T} Σ ^{+}(x − μ) will always be equal to the rank of Σ or basically n − 1 (for d ≫ n). This means that all the samples in a particular cluster would have the same probability and it would not be possible to justify classification of samples based on probability. A second way would be to regularize Σ, however, computing optimal regularization parameter could be a challenging task. One way could be to apply principal component analysis (PCA) procedure on a d -dimensional sample set χ ∈ ℝ^{ d } to transform it to a parsimonious sample set Y ∈ ℝ^{ h } where h < min(d, n) h < min(d, n). Thereafter, the clustering procedure can be performed.
Determination of the number of clusters
It is potentially important to estimate the number of clusters c present in the sample set. Since this information is usually not provided, it is important to obtain the value of c with whatever information we have at hand. Basically, the only information we have is the sample itself. In the maximum likelihood procedure we compute likelihood from sample set. Therefore, this information can be utilized to estimate the number of clusters. In order to evaluate c, we can run the SIML algorithm for a range of clusters e.g. 1 ≤ c ≤ K to see at what point the likelihood function stabilizes or reaches maximum. In this paper, we investigated two ways to compute c. In the first way, we compute the maximum log-likelihood MaxL _{ tot }(c) achieved for all values of c ∈ [1, K]. At a particular value of c the MaxL _{ tot } reaches maximum and does not change much. This would be the estimated value of c. In the second way, we compute the difference between the maximum log-likelihood MaxL _{ tot } achieved and the first value of L _{ tot } after SIML procedure (excluding the initial L _{ tot } value computed from initial settings as this value is based on the first random guess). Therefore, for a particular number of cluster c, we will get this difference likelihood and we denote it as DelL _{ tot } which is equal to MaxL _{ tot } − L _{ tot }(1) or max(L _{ tot }) − L _{ tot }(1), where L _{ tot }(r) defines the value of L _{ tot } at an iteration r. The curve of DelL _{ tot } as a function of c would give a peak at some value of c which would be its best value. In most of the data conformations, MaxL _{ tot } gives reliable results. Nonetheless, both the graphs of MaxL _{ tot } and DelL _{ tot } (as a function of c) are illustrated in the experimental section of the paper.
Results
In order to evaluate the algorithm, we carried out experiments on normal Gaussian data as well as on biological data. We divide this section into 5 subsections. In subsection 1, we illustrate the performance of various methods using three cluster case. Subsection 2 indulges on maximum likelihood plots as a function of number of clusters. In subsection 3, we discuss the processing time of the algorithm. In subsection 4, we discuss the performance in terms of clustering accuracy and rand score of various methods; and, in subsection 5 (parts I and II), we discuss SIML on biological data.
An illustration using three clusters
Likelihood plots
Figure 3a depicts L _{ tot } plot for 3 clusters. When a sample is moved from one cluster to another cluster the value of L _{ tot } is updated. This is an increasing function and the maximum value of L _{ tot } is defined as MaxL _{ tot } in this paper.
Figure 3b depicts MaxL _{ tot } plot. Since in general, the number of cluster c information is unknown, it is therefore crucial to estimate this value. In this paper we showed that by using MaxL _{ tot } plot and DelL _{ tot } plot, it is possible to estimate c. For this, one can provide a range of c values and the value for which MaxL _{ tot } curve converges (reaches highest peak or does not change much) is the estimated c. We use the same data we generated in Fig. 2a and provide 10 values of c as 1 ≤ c ≤ 10. It can be seen from MaxL _{ tot } plot that it converges or peaks at c = 3.
Processing time
Clustering on artificial data
Clustering on real data-I (publically available biological data)
In this section, we utilized various biological data and reported clustering accuracy and rand score. We employed several methods such as k-means, SLink, CLink, MLink and mclust for comparison. The description of biological data is given as follows:
SRBCT dataset [45]: the small round blue-cell tumor dataset consists of the expression of 2308 genes from 83 samples. This is a four class classification problem. The tumors are Burkitt lymphoma (BL), the Ewing family of tumors (EWS), neuroblastoma (NB) and rhabdomyosarcoma (RMS). The dataset consists of 11, 29, 18 and 25 samples of BL, EWS, NB and RMS respectively.
MLL leukemia [46]: This dataset has 3 classes, namely ALL, MLL and AML leukemia. The dataset contains 24 ALL, 20 MLL and 28 AML. The dimension of MLL dataset is 12,582.
ALL subtype dataset [47]: this dataset consists of the expression of 12,558 genes of subtypes of acute lymphoblastic leukemia. The dataset has seven classes namely BCR-ABL, E2A-PBX1, hyperdiploid >50 chromosomes ALL, MLL, T-ALL, TEL-AML1 and other (contains diagnostic samples that did not fit into any of the former six classes). Samples per class are 15, 27, 64, 20, 43, 79 and 79 respectively.
Clustering accuracy on SRBCT dataset
Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
---|---|---|---|---|---|---|
2 | 60.4 | 34.9 | 62.7 | 54.2 | 62.7 | 63.9 |
3 | 67.9 | 39.8 | 69.9 | 71.1 | 69.9 | 66.3 |
4 | 77.1 | 49.4 | 65.1 | 67.5 | 72.3 | 81.9 |
5 | 70.3 | 50.6 | 72.3 | 50.6 | 65.1 | 67.5 |
6 | 64.0 | 39.8 | 53.0 | 53.0 | 57.8 | 69.9 |
Rand score on SRBCT dataset
Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
---|---|---|---|---|---|---|
2 | 69.5 | 32.9 | 69.9 | 60.0 | 62.7 | 68.5 |
3 | 77.2 | 32.0 | 76.6 | 78.4 | 69.9 | 75.3 |
4 | 80.5 | 51.3 | 71.4 | 74.8 | 72.3 | 82.7 |
5 | 78.3 | 53.1 | 81.8 | 53.1 | 65.1 | 75.0 |
6 | 72.4 | 35.8 | 56.5 | 56.5 | 57.8 | 78.7 |
Clustering accuracy on MLL dataset
Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
---|---|---|---|---|---|---|
2 | 56.3 | 40.3 | 45.8 | 45.8 | 80.6 | 58.3 |
3 | 58.8 | 40.3 | 50.0 | 50.0 | 68.1 | 61.1 |
4 | 59.5 | 43.1 | 54.2 | 43.1 | 95.8 | 72.2 |
5 | 81.9 | 43.1 | 72.2 | 69.4 | 94.4 | 95.8 |
6 | 81.9 | 43.1 | 81.9 | 69.4 | 55.6 | 95.8 |
7 | 80.0 | 41.7 | 81.9 | 72.2 | 91.7 | 94.4 |
8 | 81.7 | 43.1 | 79.2 | 68.1 | 90.3 | 62.5 |
9 | 82.8 | 48.6 | 80.6 | 84.7 | 65.3 | 63.9 |
10 | 80.4 | 43.1 | 58.3 | 63.9 | 61.1 | 91.7 |
Rand score on MLL dataset
Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
---|---|---|---|---|---|---|
2 | 63.6 | 35.0 | 41.1 | 41.1 | 80.6 | 72.3 |
3 | 67.5 | 35.0 | 45.7 | 45.7 | 68.1 | 72.6 |
4 | 64.0 | 36.3 | 47.2 | 36.3 | 95.8 | 77.5 |
5 | 80.4 | 36.3 | 75.2 | 70.2 | 94.4 | 94.7 |
6 | 80.4 | 36.3 | 80.4 | 70.2 | 55.6 | 94.7 |
7 | 79.6 | 35.3 | 80.4 | 75.7 | 91.7 | 93.1 |
8 | 80.6 | 36.3 | 78.4 | 67.7 | 90.3 | 69.9 |
9 | 81.2 | 41.1 | 79.3 | 82.6 | 65.3 | 71.6 |
10 | 80.3 | 36.3 | 66.1 | 73.2 | 61.1 | 90.1 |
Clustering accuracy on ALL subtype dataset
Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
---|---|---|---|---|---|---|
2 | 44.8 | 32.1 | 42.8 | 36.1 | 34.3 | 44.0 |
3 | 53.3 | 25.1 | 45.3 | 46.2 | 34.9 | 56.9 |
4 | 57.4 | 25.1 | 51.7 | 49.9 | 33.3 | 61.5 |
5 | 60.4 | 26.0 | 42.8 | 34.6 | 44.7 | 62.4 |
6 | 58.9 | 25.4 | 38.8 | 41.0 | 45.3 | 63.6 |
7 | 58.7 | 24.2 | 47.4 | 36.1 | 49.5 | 56.3 |
8 | 54.5 | 25.7 | 42.8 | 34.8 | 41.9 | 61.2 |
Rand score on ALL subtype dataset
Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
---|---|---|---|---|---|---|
2 | 73.1 | 37.1 | 68.2 | 49.9 | 34.3 | 71.8 |
3 | 79.2 | 20.5 | 73.5 | 62.7 | 34.9 | 77.6 |
4 | 81.6 | 20.4 | 78.3 | 72.4 | 33.3 | 81.2 |
5 | 79.6 | 22.0 | 69.0 | 47.3 | 44.7 | 82.8 |
6 | 79.9 | 21.6 | 69.5 | 67.5 | 45.3 | 83.1 |
7 | 79.9 | 21.0 | 75.2 | 40.6 | 49.5 | 80.2 |
8 | 77.8 | 21.7 | 70.3 | 60.6 | 74.9 | 82.2 |
The estimation of the number of clusters by SIML
Dim | SRBCT | MLL | ALL subtype |
---|---|---|---|
2 | 4 | 3 | 7 |
3 | 4 | 2 | 7 |
4 | 4 | 2 | 8 |
5 | 4 | 3 | 4,7 |
6 | 2,4 | 3 | 7,9 |
7 | 3 | 3,8 | |
8 | 3 | 7 | |
9 | 3 | ||
10 | 6 |
It can be observed from Tables 2 and 3 that SIML achieved the highest clustering accuracy and rand score in 3/5 cases, and MLink and CLink achieved the highest performance in 1/5 case each. For the MLL dataset (Tables 4 and 5), mclust achieved the highest clustering accuracy and rand score in 4/9 cases and 3/9 cases, respectively. SIML was able to achieve 4/9 times highest clustering accuracy and rand score. Apart from SIML and mclust, k-means was also able to get reasonable performance especially for higher dimensions. For ALL subtype dataset (Tables 6 and 7), k-means achieved the highest clustering accuracy in 2/7 cases and highest rand score in 3/7 cases. SIML reported the highest clustering accuracy and rand score in 5/7 cases and 4/7 cases, respectively. These results show that SIML can perform reasonably well for many datasets employed in this work. In Table 8, we provided the summary of the number of clusters estimated by SIML. The corresponding MaxL _{ tot } plots are given in the Additional file 1. It can be seen from Table 8 that SIML estimates correctly the number of clusters most of the time.
Clustering on real data-II (SNPs data)
In this section, we attempt to illustrate the use of SIML on real data case. In practical situation, there are two problems to address in a dataset: 1) how many clusters are present; and, 2) what are the locations of the clusters? [48–50]. Sometimes, it is also necessary to identify or remove some sub-population from the data in order to solve the issue of population stratification. The existence of population stratification unmatched between cases and controls can produce false positives and negatives in GWAS [51]. For this exercise, we utilize data from a collection of 7001 individuals from the BioBank Japan (BBJ) project and 45 Japanese HapMap (JPT) samples [51]. The total number of single nucleotide polymorphisms (SNPs) was 140,387, genotyped via the Perlegen platform. We also included 45 Han Chinese HapMap (CHB) samples and merged these data using PLINK v1.9 (https://www.cog-genomics.org/plink2) on 140,367 common SNPs. Prior to PCA, we performed filtering using similar criteria as of that used by Yamaguchi-Kabata et al. [51]. We removed SNPs with a call rate < 99 %, a MAF < 0.01, and a Hardy-Weinberg equilibrium (HWE) exact test p-value > 10^{− 6}. Individuals with missing calls for > 5 % of SNPs were also removed. After filtering, 6998 BBJ, 44 JPT and 45 CHB samples sharing 117,758 SNPs remained. Consequently, the population consists of 6891 main land Japan (Hondo) samples, 45 CHB samples and 151 Okinawa samples referred as the Ryukyu (RYU) cluster. The Hondo samples can be further subdivided into 628 Kyushu, 908 Kinki, 358 Tokai-Hokoriku, 3975 Kanto-Koshinetsu, 466 Tohoku, 512 Hokkaido and 44 JPT samples. The aim here is to classify RYU and CHB from Hondo so that Hondo only data can be explored for further analysis. We first performed PCA on the filtered data using the R package SNPRelate [52] to reduce the data dimensionality and conducted analysis on 2 dimensional data. Linkage disequilibrium (LD) pruning with a threshold of 0.2 was used to define a representative set of 32,090 SNPs for PCA.
True positives for Hondo, RYU and CHB cluster on BBJ and HapMap data
Hondo | RYU | CHB | |
---|---|---|---|
Methods | (6891) | (151) | (45) |
71.4 % | 85.5 % | 100 % | |
K-means | 4922 | 129 | 45 |
99.9 % | 0 % | 100 % | |
SLINK | 6886 | 0 | 45 |
97.9 % | 92.7 % | 100 % | |
CLINK | 6746 | 140 | 45 |
95.8 % | 92.1 % | 100 % | |
MLINK | 6603 | 139 | 45 |
97.3 % | 94.7 % | 100 % | |
SIML | 6707 | 143 | 45 |
66.8 % | 94.7 % | 0 % | |
mclust | 4602 | 143 | 0 |
From Table 9, we can see that k-means was able to cluster all CHB samples correctly and also attained high true positive for the RYU cluster. However, it displayed comparatively inferior performance for the Hondo cluster. SLink reported very high true positive for Hondo and CHB clusters. However, it completely missed the RYU cluster. CLink, MLink and SIML were able to label all 45 samples of CHB correctly. SIML achieved the highest true positive for RYU among these 3 methods and CLink was slightly better (97.9 %) than SIML (97.3 %) for the Hondo cluster. In this case, mclust did not perform well. Nonetheless, mclust gave a high true positive for RYU cluster. It should be noted here that this data is highly imbalanced. Out of 7087 samples, 6891 samples belong to the Hondo cluster (i.e., almost 97 %) leaving only 3 % of samples for the RYU and CHB clusters. This imbalance creates a problem in a way that majority of samples turn to be labelled under the larger cluster leaving the smaller clusters. Nonetheless, SIML has shown encouraging results.
In summary, SIML successfully estimates the number of clusters as well as the locations. The SIML package was tested on Ubuntu 14.04 LTS OS (with 128 GB memory and Intel Xeon R CPU E5-1660 v2 @ 3.7 GHz x 6). The OS type is 64-bit. For Matlab we used ‘Statistics and Machine Learning Toolbox’.
Conclusions
In this work, through considering conformations of many biological data, we developed a clustering algorithm based on maximum likelihood estimate. The proposed stepwise iterative maximum likelihood (SIML) method is different from other maximum likelihood methods as it does not require the computation of first and second derivative of likelihood functions. This avoids the necessity to have differentiable likelihood functions for convergence. The SIML method was tested on artificial and real data to evaluate its performance. We show that SIML was able to produce promising results over state-of-the-art methods. The SIML method was also able to estimate the number of clusters successfully. The Matlab package of SIML is available from our webpage.
Declarations
Acknowledgements
This work has been supported by the CREST, JST, Japan.
Funding
The project was funded by JST Grant, Japan.
Availability of data and materials
All the 3 microarray datasets (SRBCT, MLL and ALL subtype) are publically available can can be downloaded via author’s webpage or visiting Kent Ridge Bio-medical Repository. The SNP data is managed by RIKEN management only and is not publically available. It is not in our (authors’) jurisdiction to make it available. The Matlab package of SIML is available via visiting authors’ webpage.
Authors’ contributions
AS developed the concept, carried out experiments and written the first draft of the manuscript. DS arranged and assisted in processing the genomic data. KAB: also processed the data and assisted in manuscript write-up. YL performed some experimental tasks. YK and MK provided the data. TT financed and supervised the project. All authors read and approved the final manuscript.
Competing interests
None of the authors have any competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
The Biobank Japan Project collected human genomic DNA after the patients provided written informed consent to participate in this project. This project was approved by the ethical committees at The Institute of Medical Science, The University of Tokyo, and the RIKEN Center for Integrative Medical Sciences (Ref. No. RIKEN Yokohama H17–16).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci U S A. 2013;110(11):4245–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Monti S, Tamayo P, Mesirov J, Golub T. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Mach Learn. 2003;52:91–118.View ArticleGoogle Scholar
- Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010;26(12):1572–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66.View ArticleGoogle Scholar
- Duda RO, Hart PE, Stork DG. Pattern Classification. 2nd ed: Wiley-Interscience; 2000.Google Scholar
- Maimon O, Rokach L. Data Mining and Knowledge Discovery Handbook. 2nd ed. USA: Springer-Verlag New York Incorporated; 2010.Google Scholar
- Fisher D. Iterative optimization and simplification of hierarchical clusterings. J Intell Res. 1996;4(1):147–79.Google Scholar
- Dhillon IS, Guan Y, Kogan J, editors. Iterative clustering of high dimensional text data augmented by local search. Maebashi City, Japan: The 2002 IEEE International Conference on Data Mining; 2002.Google Scholar
- Fayyad UM, Reina CA, Bradley PS, editors. Initialization of Iterative Refinement Clustering Algorithms. Proceedings of the 4th International Conference on Knowledge Discovery & Data Mining (KDD98). Menlo Park: AAAI Press; 1998.Google Scholar
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd ed. New York: Springer; 2009.View ArticleGoogle Scholar
- Heller KA, Ghahramani Z. Bayesian hierarchical clustering. Bonn, Germany: Twenty-second International Conference on Machine Learning (ICML); 2005.Google Scholar
- Farrell S, Ludwig C. Bayesian and maximum likelihood estimation of hierarchical response time models. Psychon Bull Rev. 2008;15(6):1209–17.View ArticlePubMedPubMed CentralGoogle Scholar
- Sharma A, Boroevich K, Shigemizu D, Kamatani Y, Kubo M, Tsunoda T. Hierarchical Maximum Likelihood Clustering Approach. USA: IEEE Transactions on Biomedical Engineering. 2016;PP (99). doi:10.1109/TBME.2016.2542212.
- Sibson R. SLINK: An optimally efficient algorithm for the single-link cluster method. Comput J (Br Comput Soc). 1973;16(1):30–4.Google Scholar
- Defays D. An efficient algorithm for a complete link method. Comput J (Br Comput Soc). 1977;20(4):364–6.Google Scholar
- Everitt BS, Landau S, Leese M, Stahl D. Cluster Analysis. 5th ed. UK: John Wiley & Sons; 2011.Google Scholar
- Lock EF, Dunson DB. Bayesian consensus clustering. Bioinformatics. 2013;29(20):2610–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu JS, Zhang JL, Palumbo MJ, Lawrence CE. Bayesian Clustering with Variable and Transformation Selections. Bayesian Stat. 2003;7:249–75.Google Scholar
- Latch EK, Dharmarajan G, Glaubitz JC, Jr OER. Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation. Conserv Genet. 2006;7(2):295–302.View ArticleGoogle Scholar
- Chen C, Durand E, Forbes F, François O. Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study. Mol Ecol Notes. 2007;7(5):747–56.View ArticleGoogle Scholar
- Ramoni M, Sebastiani P, Cohen P. Bayesian Clustering by Dynamics. Mach Learn. 2002;47(1):91–121.View ArticleGoogle Scholar
- Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Ser B. 1977;39(1):1–38.Google Scholar
- Misztal I. Comparison of computing properties of derivative and derivative-free algorithms in variance-component estimation by REML. J Anim Breed Genet. 1994;111(1–6):346–55.View ArticlePubMedGoogle Scholar
- Denoeux T. Maximum Likelihood Estimation from Uncertain Data in the Belief Function Framework. IEEE Trans Knowl Data Eng. 2013;25(1):119–30.View ArticleGoogle Scholar
- Ben-Hur A, Horn D, Siegelmann HT, Vapnik V. Support Vector Clustering. J Mach Learn Res. 2001;2:125–37.Google Scholar
- Lee J, Lee D. An improved cluster labeling method for support vector clustering. IEEE Trans Pattern Anal Mach Intell. 2005;27(3):461–4.View ArticlePubMedGoogle Scholar
- Lee J, Lee D. Dynamic Characterization of Cluster Structures for Robust and Inductive Support Vector Clustering. IEEE Trans Pattern Anal Mach Intell. 2006;28(11):1869–74.View ArticlePubMedGoogle Scholar
- Chiang J-H, Hao P-Y. A new kernel-based fuzzy clustering approach: support vector clustering with cell growing. IEEE Trans Fuzzy Syst. 2003;11(4):518–27.View ArticleGoogle Scholar
- Horng S-J, Su M-Y, Chen Y-H, Kao T-W, Chen R-J, Lai J-L, et al. A novel intrusion detection system based on hierarchical clustering and support vector machines. Expert Systs Appl. 2011;38(1):306–13.View ArticleGoogle Scholar
- Jun S, Park S-S, Jang D-S. Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systs Appl. 2014;41(7):3204–12.View ArticleGoogle Scholar
- Wang K, Liang C, Liu J, Xiao H, Huang S, Xu J, et al. Prediction of piRNAs using transposon interaction and a support vector machine. BMC Bioinf. 2014;15(1):419.View ArticleGoogle Scholar
- Long JS. Regression Models for Categorical and Limited Dependent Variables. London: Sage Publications; 1997.Google Scholar
- Felsenstein J, Churchill GA. A Hidden Markov Model Approach to Variation Among Sites in Rate of Evolution. Mol Biol Evol. 1996;13(1):93–104.View ArticlePubMedGoogle Scholar
- Jennrich RI, Sampson PF. Newton–Raphson and Related Algorithms for Maximum Likelihood Variance Component Estimation. Technometrics. 1976;18(1):11–7.View ArticleGoogle Scholar
- Adachi J, Hasegawa M. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput Sci Monogr. 1996;28:1-150.Google Scholar
- Berndt ER, Hall BH, Hall RE, Hausman JA. Estimation and Inference in Nonlinear Structural Models. Ann Econ Soc Meas. 1974;3(4):653–65.Google Scholar
- Davidon WC. Variable Metric Method for Minimization. SIAM J Optim. 1991;1(1):1–17.View ArticleGoogle Scholar
- Fletcher R, Powell MJD. A Rapidly Convergent Descent Method for Minimization. Comput J. 1963;6(2):163–8.View ArticleGoogle Scholar
- Fraley C, Raftery AE. MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering. Seattle, WA, USA: University of Washington; 2006.Google Scholar
- Cd A, Lee JA, Verleysen M. On Convergence Problems of the EM Algorithm for Finite Gaussian Mixtures. Bruges: European Symposium on Artificial Neural Networks (ESANN); 2003. p. 99–106.Google Scholar
- Sharma A, Paliwal KK. Fast principal component analysis using fixed-point algorithm. Pattern Recogn Lett. 2007;28(10):1151–5.View ArticleGoogle Scholar
- Sharma A, Paliwal KK. A Gradient Linear Discriminant Analysis for Small Sample Sized Problem. Neural Process Lett. 2008;27(1):17–24.View ArticleGoogle Scholar
- Sharma A, Paliwal KK. Cancer classification by gradient LDA technique using microarray gene expression data. Data Knowl Eng. 2008;66(2):338–47.View ArticleGoogle Scholar
- Sharma A, Imoto S, Miyano S. A Top-r Feature Selection Algorithm for Microarray Gene Expression Data. IEEE/ACM Trans Comput Biol Bioinform. 2012;9(3):754–64.View ArticlePubMedGoogle Scholar
- Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001;7(6):673–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Armstrong SA, Staunton JE, Silverman LB, Pieters R, Boer ML, Minden MD, et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2002;30(1):41–7.View ArticlePubMedGoogle Scholar
- Yeoh E-J, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1(2):133–43.View ArticlePubMedGoogle Scholar
- Rahman MM, Davis DN. Fuzzy Unordered Rules Induction Algorithm Used as Missing Value Imputation Methods for K-Mean Clustering on Real Cardiovascular Data. Lect Notes Eng Comput Sci. 2012;2197(1):391–4.Google Scholar
- Mirkin B. Clustering for Data Mining: A Data Recovery Approach. Boca Raton: Chapman & Hall; 2005.View ArticleGoogle Scholar
- Elhamifar E, Vidal R. Sparse Subspace Clustering: Algorithm, Theory, and Applications. IEEE Trans Pattern Anal Mach Intell. 2013;35(11):2765–81.View ArticlePubMedGoogle Scholar
- Yamaguchi-Kabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, Kubo M, et al. Japanese Population Structure, Based on SNP Genotypes from 7003 Individuals Compared to Other Ethnic Groups: Effects on Population-Based Association Studies. Am J Hum Genet. 2008;83(4):445–56.View ArticlePubMedPubMed CentralGoogle Scholar
- Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28(24):3326–8.View ArticlePubMedPubMed CentralGoogle Scholar