Stepwise iterative maximum likelihood clustering approach

Background Biological/genetic data is a complex mix of various forms or topologies which makes it quite difficult to analyze. An abundance of such data in this modern era requires the development of sophisticated statistical methods to analyze it in a reasonable amount of time. In many biological/genetic analyses, such as genome-wide association study (GWAS) analysis or multi-omics data analysis, it is required to cluster the plethora of data into sub-categories to understand the subtypes of populations, cancers or any other diseases. Traditionally, the k-means clustering algorithm is a dominant clustering method. This is due to its simplicity and reasonable level of accuracy. Many other clustering methods, including support vector clustering, have been developed in the past, but do not perform well with the biological data, either due to computational reasons or failure to identify clusters. Results The proposed SIML clustering algorithm has been tested on microarray datasets and SNP datasets. It has been compared with a number of clustering algorithms. On MLL datasets, SIML achieved highest clustering accuracy and rand score on 4/9 cases; similarly on SRBCT dataset, it got for 3/5 cases; on ALL subtype it got highest clustering accuracy for 5/7 cases and highest rand score for 4/7 cases. In addition, SIML overall clustering accuracy on a 3 cluster problem using SNP data were 97.3, 94.7 and 100 %, respectively, for each of the clusters. Conclusions In this paper, considering the nature of biological data, we proposed a maximum likelihood clustering approach using a stepwise iterative procedure. The advantage of this proposed method is that it not only uses the distance information, but also incorporate variance information for clustering. This method is able to cluster when data appeared in overlapping and complex forms. The experimental results illustrate its performance and usefulness over other clustering methods. A Matlab package of this method (SIML) is provided at the web-link http://www.riken.jp/en/research/labs/ims/med_sci_math/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1184-5) contains supplementary material, which is available to authorized users.


Background
In an unsupervised learning procedure, the class label of a training sample is not known and the aim is to partition the data into clusters. The unsupervised learning scheme uses the relationship between samples to perform partitioning. In many biological data (e.g. transcriptome data, genomic data etc.), the number of clusters and class labels are unknown. However, the distribution is sometimes known, which is usually normal. Therefore, it would be an advantage to build a technique that utilizes distance and variance information as it can track clusters with different conformations.
Over last several decades, the k-means clustering algorithm has been used quite significantly in partitioning the biological data. In the most recent multi-omics data analysis tools, like iCluster, and iClusterPlus [1], the underlying clustering method used was also k-means. Some tools in cancer research, like ConsensusCluster (CC) and CCPlus [2,3], also utilize k-means as one of the common clustering algorithms. Though the k-means clustering algorithm has been extensively applied [4] due to its simplicity and reasonable level of accuracy, it cannot track clusters when samples of different groups are overlapping to each other (i.e., data points of adjacent groups are spread in a way that the groups partly coincide over each other). In biological data, this is sometimes the case, and thereby leads to clusters which may not be accurate. This has a significant implication in biological findings, particularly in cancer subtypes analysis, population stratification analysis in GWAS and multi-omics data analysis. In general, k-means has played a significant role in carrying out analysis for various types of biological data over several years. Since data complexity and quantity are increasing, it is important to develop techniques that can perform clustering by following data topologies.
In the field of unsupervised learning and clustering, several wonderful techniques have emerged. Some of the techniques are briefly summarized here as follows: 1) clustering using some criterion functions e.g. i) sum-ofsquared error criterion; ii) related minimum variance criterion, iii) scattering criterion; iv) trace criterion; v) determinant criterion; and, vi) invariant criterion [5,6]; 2) clustering using iterative optimization [7][8][9]; 3) hierarchical clustering [10][11][12][13]; several hierarchical-based algorithms can be found in the literature; e.g., singlelinkage [14], complete-linkage [15], median-linkage [16] and so on. Single linkage (SLink) [14] merges two nearest-neighbor clusters at a time in an agglomerative hierarchical fashion. It uses Euclidean distance to measure the closeness between two clusters (if it is less than an arbitrary threshold). This method is very sensitive to data position, which sometimes creates problem by forming a cluster in a long chain (known as the chaining effect). The complete linkage (CLink) hierarchical approach [15] depends on the farthest-neighbor and reduces the effects of long chains. This technique is also sensitive to outliers. The use of average or median distance could be a way to overcome this sensitiveness. This was done in median linkage (MLink) hierarchical approach [16]; 4) clustering is also performed using Bayes classifier [17][18][19][20][21]; 5) clustering iterative maximum likelihood [22][23][24]; and, 6) support vector clustering [25][26][27].
In the recent literature, support vector clustering has gained a lot of attention [26][27][28][29][30][31]. However, it is expensive in processing time and sometimes fails to find meaningful clusters. In general, clustering methods based on Bayes classifier and maximum likelihood are still the preferred choice compared to support vector clustering for many applications. There are various approaches to implement these clustering methods.
In this paper, we focus on maximum likelihood estimate. There are three ways to implement the maximum likelihood method. 1) Analytic way: likelihood functions are differentiated and equated to zero and the equations are solved to find extrema. The second derivative is then taken to ensure if maxima has reached rather than minima. 2) Grid search: an exhaustive search over a region is conducted to find the parameters that produce largest likelihood. 3) Numerical analysis: an initial value of parameter is used in a hill climbing algorithm or gradient ascent algorithm (e.g. Newton-Rapson, BHHH, DFP) to find the maxima. Maximum likelihood is also estimated via EM algorithm [5,22,[32][33][34][35][36][37][38][39].
In general, it is impossible to use an analytic approach to find maximum likelihood estimates as the parametric form of data is unknown. Grid search is only possible when the dimensionality of the data is very small. Most of the time, maximum likelihood is computed by a hill climbing algorithm or by the EM algorithm. The potential problem with gradient algorithms is that when likelihood is not differentiable then it is not possible to find gradient to convergence. Considering this, in this paper, we propose a stepwise iterative maximum likelihood (SIML) procedure which does not require derivatives of likelihood functions. It can find all unknown parameters without solving first derivative and second derivatives of likelihood. The experimental results also show promising when compared to many state-of-the-art clustering methods.

Description of Maximum Likelihood Clustering
Here, we briefly discuss maximum likelihood method for clustering [5]. Assume a d -dimensional sample set χ = {x 1 , x 2 , …, x n } having n unlabelled samples, and, c is the number of clusters. Let Ω = {ω j } (for j = 1, 2, …, c) be the state of the nature or class label for j th cluster χ j . Suppose θ = {θ j } (for j = 1 … c) is any unknown parameter (having mean μ and covariance Σ). Then the mixture density is given by where p(x|ω j , θ j ) is the conditional density, and P(ω j ) is the a priori probability. The log likelihood can be represented by joint density Suppose that the joint density is differentiable with respect to θ then from Eqs. 1 and 2 where ∇ θ i L is the gradient of L with respect to θ i . If θ i and θ j are independent and suppose a posteriori probability is given as then from Eqs. 3 and 4, we have The gradient of likelihood (Eq. 5) can be equated to zero ( ∇ θ i L ¼ 0 ) to obtain maximum likelihood estimateθ i . The solution can be therefore obtained by In the above equations, θ is replaced by unknown mean and covariance parameters for normal distribution case, to yield maximum likelihood estimates. In the literature, parameter θ is iteratively updated to reach the final valueθ using hill climbing algorithms such as the Newton-Raphson method. In general, the computation of first and second derivatives of likelihood is required to find the solution. If the likelihood is differentiable and the a priori probability is non-zero, then convergence can be obtained. However, there is always a possibility of being trapped in a local optima.

Stepwise iterative maximum likelihood method
In this section, we describe our proposed method. This method seeks the most optimal partitions in an iterative way. We begin with an initial partition of data and shift a sample from one partition to another partition, and test if such a shift improves the overall log-likelihood. A simple illustration of SIML is given in Fig. 1.
If we define class-based log-likelihood of two clusters χ i and χ j as then we would be interested in knowing how the classbased log-likelihood functions (referred as log-likelihood function hereafter) change if a sample is shifted from χ i to χ j . In order to know this, let us define mean and covariance of χ i and χ j as μ i and μ j , and, as Σ i and Σ j , respectively. The following equations describe mean and covariance: and where n i and n j are number of samples in χ i and χ j , respectively. If the component density is normal and let P(ω i ) = n i /n (where n is the total number of samples) then Eqs. 9 and 10 can be written as Þ¼n i d we can write L i as Similarly, we can write L j as and the total log-likelihood for c clusters can be written as Fig. 1 An illustration of stepwise iterative maximum likelihood method using a c = 2 cluster case. In this illustration, two clusters and are given with likelihood functions L 1 and L 2 , respectively. The center of clusters are depicted by μ 1 and μ 2 (shown as '+' inside two clusters). Initial total likelihood is L old which is the sum of two likelihood functions (L 1 + L 2 ). A sample x∈ is checked for grouping. It is advantageous to shift sample x to cluster only if the new likelihood (L new = L 1 * + L 2 * ) is higher than the old likelihood; i.e., L new > L old where L k is from Eq. 15 or 16. If a samplex∈χ i is shifted to χ j , then the mean and covariance will change as follows (from Eqs. 11, 12, 13 and 14): where μ i * , μ j * , Σ i * and Σ j * are means and covariances after the shift.
In order to find the change in log-likelihood functions L i and L j , we first introduce the following Lemma.
Lemma 1 If a samplex∈χ i is shifted to cluster χ j and the changed covariance of χ j is defined as Proof By taking determinant of Σ j * , we get since for m × m square matrices |AB| = |A||B| and for a scalar c, |cA| = c m |A|. We can write Eq. L1 as From Sylvester's determinant theorem, rectangular matrices A ∈ ℝ m × n and B ∈ ℝ n × m in |I m × m + AB| is equal to |I n × n + BA|. Therefore, we can write since |c| = c. Substituting right hand side of Eq. L3 in Eq. L2 proves the Lemma.
As similar to Lemma 1, the determinant of the change in covariance of χ i can be written as We can now observe the change in L j (Eq. 16) due to the shift of a samplex from χ i to χ j as From Lemma 1 and Eq. 16, we can rewrite Eq. 23 after doing algebraic manipulation as where ΔL j is given by and constant C is given by In a similar manner, change in L i can be obtained by using Eqs. 15 and 22 as where ΔL i is given by and C is same as of Eq. 26. By adding Eqs. 24 and 27, we can get the change in total log-likelihood (L tot * ) since there is no change in any other clusters apart from χ i to χ j ; i.e., from Eqs. 17, 24 and 27 we have where ΔL tot = ΔL j − ΔL i . Therefore, the shift of a samplê x is advantageous if ΔL tot > 0. This will give the following algorithm ( Table 1): The following sections discuss the characteristic of the SIML method.
Initial settings of the procedure Similar to any other iterative based optimization technique, this technique also depends on the initial settings. Therefore, it is important to put consideration into the initial settings. In this paper, we implemented three ways of initializing the partitions: 1) random initialization, 2) k-means based initialization, and 3) initialization based on the solution of c − 1 partitions and the mean. These schemes are described as follows: 1. Random initialization: In this scheme, we create c random means around the center of the data. This technique works well when the number of clusters is small. If c is very large then it can miss clusters. 2. K-means initialization: In this scheme, the data is first partitioned into c clusters by using the k-means algorithm. The solution of k-means is used as an initial setting for the SIML method. This method works well even if the number of clusters is large. Most of the time this initialization technique provides good results. However, since the k-means algorithm does not track the data based on covariance information, it has limitations. 3. Initialization based on the solution of c − 1 clusters: The initialization of c clusters is done by using the solution of c − 1 clusters, which would give c − 1 locations. The c th location is the mean of the overall data itself. If only two clusters are required to find, then 2 locations around center of the data is used since the solution of 1-cluster is the center of the data itself.
In this paper, we used all the three schemes for initialization and in general schemes 2 and 3 provide satisfactory results for most of the data conformations.

Numerical stability
Due to numerical difficulties the convergences of an iterative algorithm can be missed (e.g. convergence problem for EM algorithm is discussed in [40]). The problem of numeral difficulties is of particular issue when data dimensionality is high. In this situation, iterative algorithms sometimes do not converge properly. This problem usually appears due to the small numerical values of the covariance matrix. If the eigenvalues of a covariance matrix Σ are small, then its determinant can give a value close to zero due to the fixed point architecture of the hardware. However, this problem can be easily overcome by first conducting eigenvalue decomposition of Σ and using the summation of the logarithm of eigenvalues. It is described as follows: The If the values of λ are small then |Σ| = 0. This problem can be overcome by simply taking logarithm as In a similar way, the inverse of Σ can cause problems in the term of Eq. 28; i.e., the computation of the term when the size of the covariance matrix is large. In order to make this numerically stable a small quantity ϵ > 0 can be included as follows: This will ensure that 1− P n i þ1þ > 0.

Small sample size case
When the dimensionality d is much greater than the number of samples n (d ≫ n) then small sample size problem appears [41][42][43][44]. Let a sample set χ = {x 1 , x 2 , …, x n } be drawn independently and let the mean and covariance of χ be denoted by μ and Σ, respectively. In the normal density we have a term P = (x − μ) T Σ − 1 (x − μ) to compute which cannot be solved due to singular covariance matrix as its inverse does not exist. A simple extension could be to use the pseudo-inverse of Σ (denoted here as Σ + ). However, this doesn't solve the problem. If samples x are from χ then P + = (x − μ) T Σ + (x − μ) will always be equal to the rank of Σ or basically n − 1 (for d ≫ n). This means that all the samples in a particular cluster would have the same probability and it would not be possible to justify classification of samples based on probability. A second way would be to regularize Σ, however, computing optimal regularization parameter Table 1 Stepwise iterative maximum likelihood method procedure 1. Initialization: select initial partitions with means μ 1 , μ 2 , …, μ c and covariance matrices Σ 1 , Σ 2 , …, Σ c 2. Loop: Select a samplex∈χ i .
3. If n i > 1 then compute Transferx to χ k if δ k = max δ j for all j.
7. If L tot doesn't change in n attempts then stop otherwise go to Loop. could be a challenging task. One way could be to apply principal component analysis (PCA) procedure on a d -dimensional sample set χ ∈ ℝ d to transform it to a parsimonious sample set Y ∈ ℝ h where h < min(d, n) h < min(d, n). Thereafter, the clustering procedure can be performed.

Determination of the number of clusters
It is potentially important to estimate the number of clusters c present in the sample set. Since this information is usually not provided, it is important to obtain the value of c with whatever information we have at hand. Basically, the only information we have is the sample itself. In the maximum likelihood procedure we compute likelihood from sample set. Therefore, this information can be utilized to estimate the number of clusters. In order to evaluate c, we can run the SIML algorithm for a range of clusters e.g. 1 ≤ c ≤ K to see at what point the likelihood function stabilizes or reaches maximum. In this paper, we investigated two ways to compute c. In the first way, we compute the maximum log-likelihood MaxL tot (c) achieved for all values of c ∈ [1, K]. At a particular value of c the MaxL tot reaches maximum and does not change much. This would be the estimated value of c. In the second way, we compute the difference between the maximum log-likelihood MaxL tot achieved and the first value of L tot after SIML procedure (excluding the initial L tot value computed from initial settings as this value is based on the first random guess). Therefore, for a particular number of cluster c, we will get this difference likelihood and we denote it as DelL tot which is equal to MaxL tot − L tot (1) or max(L tot ) − L tot (1), where L tot (r) defines the value of L tot at an iteration r. The curve of DelL tot as a function of c would give a peak at some value of c which would be its best value. In most of the data conformations, MaxL tot gives reliable results. Nonetheless, both the graphs of MaxL tot and DelL tot (as a function of c) are illustrated in the experimental section of the paper.

Results
In order to evaluate the algorithm, we carried out experiments on normal Gaussian data as well as on biological data. We divide this section into 5 subsections. In subsection 1, we illustrate the performance of various methods using three cluster case. Subsection 2 indulges on maximum likelihood plots as a function of number of clusters. In subsection 3, we discuss the processing time of the algorithm. In subsection 4, we discuss the performance in terms of clustering accuracy and rand score of various methods; and, in subsection 5 (parts I and II), we discuss SIML on biological data.

An illustration using three clusters
Since data distribution of GWAS can appear as approximately Gaussian, we generated normal distribution data with 3 different mean and covariance for simulation purposes. Furthermore, if we consider GWAS data from a continent (e.g. Europe) as one cluster, from a country (e.g. Germany) as second cluster and from a city (e.g. Berlin) as a third cluster, then third cluster (Berlin data) will reside inside second cluster (Germany data), and second cluster (Germany data) will reside inside the first cluster (European data). Therefore, clusters will overlap with each other. To simulate this scenario, we generated a sample set with 1, 500 samples, 2 dimensions and 3 clusters as shown in Fig. 2a, and Fig. 2b) that k-means clustered the 3 clusters without considering the distribution information. The processing time to perform the k-means algorithm was 0.82 s. Support vector clustering (CG method) [25] was difficult to perform as it is not possible to provide number of class information. The parameters were tuned so that 3 clusters are outputed. The processing time by this method was 1183.1 s (excluding the tuning time). It can be observed from the Fig. 2c that this method was failed to track the clusters. Next, support vector clustering (using SEP method) [26] was performed. The default parameters gave 45 clusters. Therefore, as similar to the previous CG method, tuning of parameters was carried out to extract only 3 clusters. Processing time was 25.2 s excluding the tuning time. This method also misses the clusters (Fig. 2d). Then we performed the proposed SIML method. This method was able to track all the 3 clusters in 4.49 s per repetition (Fig. 2e). The likelihood plots are discussed in the following section.

Likelihood plots
Here we discussed three plots: log-likelihood (L tot ) versus sample (Fig. 3a), maximum log-likelihood (MaxL tot ) as a function of number of clusters (Fig. 3b) and DelL tot as a function of number of clusters (Fig. 3c). Figure 3a depicts L tot plot for 3 clusters. When a sample is moved from one cluster to another cluster the value of L tot is updated. This is an increasing function and the maximum value of L tot is defined as MaxL tot in this paper. Figure 3b depicts MaxL tot plot. Since in general, the number of cluster c information is unknown, it is therefore crucial to estimate this value. In this paper we showed that by using MaxL tot plot and DelL tot plot, it is possible to estimate c. For this, one can provide a range of c values and the value for which MaxL tot curve converges (reaches highest peak or does not change much) is the estimated c. We use the same data we generated in Fig. 2a and provide 10 values of c as 1 ≤ c ≤ 10. It can be seen from MaxL tot plot that it converges or peaks at c = 3.

Processing time
Here we discuss the processing time of the SIML algorithm. In order to give a complete picture, we investigated the clock time in seconds for samples n = 3, 000, 9, 000, 27,000, 54,000 and 102,000 having 3 clusters. We use the same conformation of data as depicted in Fig. 2a, however, we increased the dimensionality to d = 10, 20, 100 and 200. Figure 4 shows the processing time of the algorithm when processed in Linux platform (Ubuntu 14.04 LTS, 64 bit) with 6 processors (Intel Xeon R CPU E5-1660 v2 @ 3.70 GHz) and with 128 GB memory for a repetition.

Clustering on artificial data
We performed clustering accuracy and rand score test on a set of artificial data. For artificial data, we generated d -dimensional, 4 cluster data such that cluster samples are overlapping to each other (in a similar way as shown in Fig. 2). There are in total 2000 samples (where each cluster having 500 samples). We computed cluster accuracy and rand score for various methods. For statistical stability, we generated data 20 times for a particular dimension d by changing random seed of the normal data. Thereby, we computed average clustering accuracy and average rand score over these 20 attempts for dimension d. We then varied dimension d = 2, 3, …, 20 and reported average clustering accuracy and average rand score in Fig. 5. For comparison, we used centroidbased technique like k-means, hierarchical-based technique like SLink [14], CLink [15] and MLink [16] and model-based technique (using EM algorithm) like mclust [39]. It can be observed from Fig. 5a that mclust and SIML methods perform quite well on Gaussian data. Kmeans algorithm also performs reasonably well on this data. MLink and SLink couldn't perform well. For average rand score (Fig. 5b), CLink, k-means, SIML and mclust are exhibiting reasonable performance. However, mclust and SIML are superior reaching almost 100 rand score. Since mclust and SIML are derived from Gaussian model, their performance on Gaussian data are well compared to other techniques.
Clustering on real data-I (publically available biological data) In this section, we utilized various biological data and reported clustering accuracy and rand score. We employed several methods such as k-means, SLink, CLink, MLink and mclust for comparison. The description of biological data is given as follows: SRBCT dataset [45]: the small round blue-cell tumor dataset consists of the expression of 2308 genes from 83 samples. This is a four class classification problem. The tumors are Burkitt lymphoma (BL), the Ewing family of tumors (EWS), neuroblastoma (NB) and rhabdomyosarcoma (RMS). The dataset consists of 11, 29, 18 and 25 samples of BL, EWS, NB and RMS respectively.  ALL subtype dataset [47]: this dataset consists of the expression of 12,558 genes of subtypes of acute lymphoblastic leukemia. The dataset has seven classes namely BCR-ABL, E2A-PBX1, hyperdiploid >50 chromosomes ALL, MLL, T-ALL, TEL-AML1 and other (contains diagnostic samples that did not fit into any of the former six classes). Samples per class are 15,27,64,20,43,79 and 79 respectively.
To vary the data dimensionality (number of features), we utilized Chi-squared feature selection method to rank the attributes. The dimensionality investigated was d = 2, 3, …, n m /2, where n m is the cluster with minimum number of samples. We then performed cluster analysis (to evaluate clustering accuracy and rand score) on these datasets and compared SIML with the k-means, SLink, CLink, MLink and mclust methods. The results are reported in Tables 2 and 3 (for SRBCT dataset), Tables 4 and 5 (for MLL dataset), Tables 6 and 7 (ALL subtype dataset) and Table 8 (for estimation of number of clusters by SIML). Clustering accuracy is depicted in Tables 2, 3, 4, 5, 6 and rand score is shown in Tables 2,  3, 4, 5, 6 and 7. The methods achieving highest results are depicted in bold faces.
It can be observed from Tables 2 and 3 that SIML achieved the highest clustering accuracy and rand score in 3/5 cases, and MLink and CLink achieved the highest performance in 1/5 case each. For the MLL dataset (Tables 4 and 5), mclust achieved the highest clustering accuracy and rand score in 4/9 cases and 3/9 cases, respectively. SIML was able to achieve 4/9 times highest clustering accuracy and rand score. Apart from SIML and mclust, k-means was also able to get reasonable performance especially for higher dimensions. For ALL subtype dataset (Tables 6 and 7), k-means achieved the highest clustering accuracy in 2/7 cases and highest rand score in 3/7 cases. SIML reported the highest clustering accuracy and rand score in 5/7 cases and 4/7 cases, respectively. These results show that SIML can perform reasonably well for many datasets employed in this work. In Table 8, we provided the summary of the number of clusters estimated by SIML. The corresponding MaxL tot plots are given in the Additional file 1. It can be seen from Table 8 that SIML estimates correctly the number of clusters most of the time.
Clustering on real data-II (SNPs data) In this section, we attempt to illustrate the use of SIML on real data case. In practical situation, there are two problems to address in a dataset: 1) how many clusters are present; and, 2) what are the locations of the clusters? [48][49][50]. Sometimes, it is also necessary to identify or remove some sub-population from the data in order to solve the issue of population stratification. The existence of population stratification unmatched between    The methods achieving highest results are depicted in bold faces cases and controls can produce false positives and negatives in GWAS [51]. For this exercise, we utilize data from a collection of 7001 individuals from the BioBank Japan (BBJ) project and 45 Japanese HapMap (JPT) samples [51]. We first performed PCA on the filtered data using the R package SNPRelate [52] to reduce the data dimensionality and conducted analysis on 2 dimensional data. Linkage disequilibrium (LD) pruning with a threshold of 0.2 was used to define a representative set of 32,090 SNPs for PCA. In summary, this two dimensional data contain three clusters: Hondo, RYU and CHB. Here we first computed true positives (and its corresponding accuracy) for Hondo, RYU and CHB clusters. This would provide us information regarding correctly labelled samples in each cluster. For this purpose, we executed all the methods to provide 3 clusters of the data. The true positives for various methods are depicted in Table 9.
From Table 9, we can see that k-means was able to cluster all CHB samples correctly and also attained high true positive for the RYU cluster. However, it displayed comparatively inferior performance for the Hondo cluster. SLink reported very high true positive for Hondo and CHB clusters. However, it completely missed the RYU cluster. CLink, MLink and SIML were able to label all 45 samples of CHB correctly. SIML achieved the highest true positive for RYU among these 3 methods    and CLink was slightly better (97.9 %) than SIML (97.3 %) for the Hondo cluster. In this case, mclust did not perform well. Nonetheless, mclust gave a high true positive for RYU cluster. It should be noted here that this data is highly imbalanced. Out of 7087 samples, 6891 samples belong to the Hondo cluster (i.e., almost 97 %) leaving only 3 % of samples for the RYU and CHB clusters. This imbalance creates a problem in a way that majority of samples turn to be labelled under the larger cluster leaving the smaller clusters. Nonetheless, SIML has shown encouraging results.
In the next analysis, we did not provide the number of clusters information to study the characteristics of SIML method. The resulting clustering is illustrated in Fig. 6. For this case, the MaxL tot plot gives peak at 3 clusters (Fig. 7) and therefore 3 clusters were used in this case. The true RYU and CHB labels are shown on the plot as circles and diamonds, respectively. Most of Hondo samples are in Cluster 2. There are around 6715 samples in Cluster 2 representing the Hondo region. Almost all CHB are clustered in Cluster 3 and most of RYU are clustered in Cluster 1. Around 8 RYU are clustered in Cluster 2 giving a false negative (FN) error of 8 samples (5.3 %) and no CHB sample is misclassified giving FN error of 0 samples (0 %). Cluster 1 and Cluster 3 can be classified easily and analysis can be conducted on Cluster 2 (Hondo) with very less FN error.
In summary, SIML successfully estimates the number of clusters as well as the locations. The SIML package was tested on Ubuntu 14.04 LTS OS (with 128 GB memory and Intel Xeon R CPU E5-1660 v2 @ 3.7 GHz x 6). The OS type is 64-bit. For Matlab we used 'Statistics and Machine Learning Toolbox'.

Conclusions
In this work, through considering conformations of many biological data, we developed a clustering algorithm based on maximum likelihood estimate. The proposed stepwise iterative maximum likelihood (SIML) method is different from other maximum likelihood methods as it does not require the computation of first and second derivative of likelihood functions. This avoids the necessity to have differentiable likelihood functions for convergence. The SIML method was tested on artificial and real data to evaluate its performance. We show that SIML was able to produce promising results over state-of-the-art methods. The SIML method  was also able to estimate the number of clusters successfully. The Matlab package of SIML is available from our webpage.

Additional file
Additional file 1: Estimation of number of clusters using SIML method. (DOCX 408 kb)