Volume 14 Supplement 1
Computational Intelligence in Bioinformatics and Biostatistics: new trends from the CIBB conference series
Finding reproducible cluster partitions for the kmeans algorithm
 Paulo JG Lisboa†^{1},
 Terence A Etchells†^{1},
 Ian H Jarman†^{1}Email author and
 Simon J Chambers†^{1}
DOI: 10.1186/1471210514S1S8
© Lisboa et al.; licensee BioMed Central Ltd. 2013
Published: 14 January 2013
Abstract
Kmeans clustering is widely used for exploratory data analysis. While its dependence on initialisation is wellknown, it is common practice to assume that the partition with lowest sumofsquares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that kmeans can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions.
This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2d map of the local minima found by the kmeans algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset.
Background
Structure finding in large datasets is important in first line exploratory data analysis. Clustering methods are commonly used for this purpose and among them kmeans clustering is widely used.
This has led to variations such as penalised and weighted kmeans [1] and in combination with other methods [2]. More complex approaches include hybrid hierarchical kmeans clustering algorithms with fewer parameters to adjust [3]. All of these methods depart considerably from the standard implementations that are still of interest.
Practical applications may require the derivation of a partition of the data that is representative of the best achievable clustering performance of the algorithm. This solution should clearly be reproducible under repeated initialisations. However, it is well known that the kmeans algorithm has guaranteed convergence only to a local minimum of a sum of squares objective function [4], [5], [6], finding the global optimum is in general, NPhard [7]. It is commonly assumed that it is sufficient to carry out a number of random initialisations followed by selection of the best separated solution, for instance measured by the sum of square distances from cluster prototypes (SSQ). Various aspects of this process, from the choice of separation measures to the best number of clusters, k, are guided by heuristic methods, reviewed in more detail later in this section.
It is perhaps surprising that as widely used a method as the standard kmeans algorithm does not have published, to our knowledge, a systematic assessment of whether it is always the case that the best SSQ suffices. We will show that while this is generally the case, as the number of clusters, k, increases within a range of practical interest, solutions with SSQ very close to the optimal value for that value of k can be substantially different from each other. This requires identification of a procedure to draw a single partition set which is both wellseparated and stable, in the sense that a very similar solution will be found by repeating the whole procedure from the beginning.
This is not to say that there is a unique best solution. Rather, the intention is to reproduce convergence to within a set of data partitions which is mutually consistent and similarly well associated with the data structure. The test that is applied is to measure the internal consistency of the clustering solutions obtained for a given value of k. A synthetic data set will be used to evaluate the reliability of the proposed method.
Recent comprehensive reviews of the kmeans algorithm [4], [5], [6] do not describe any prescriptive method that is empirically demonstrated to consistently return a clustering solution well associated with the known data structure, nor do they confirm the reliability of the method empirically by repeated mechanistic application to the same data set.
Historically, the motivation for the kmeans algorithm included minimising variance in stratified sampling [8, 9].This defines a sumofsquares objective function consisting of the withincluster variance of the sample, for which a convergent algorithm was later defined [10] by iterating two main stages  definition of prototypes and allocation of continuous data to each prototype. The prototypes assume special status by virtue of achieving a local optimum for the sample variance. Further optimisation beyond convergence of the batch algorithm is possible using online updates [11]. This is the clustering method used throughout in this paper. These two stages remain the core of generalised kmeans procedures [4]. They were linked to maximum likelihood estimation by Diday and Schroeder [4] and, for modelbased clustering, can be optimised with the EM algorithm [12]. The structure of the method can also be extended with the use of kernels [13] and it can be also applied to discrete data [14].
The illposed nature of the optimisation task in the standard algorithm[4–6] invites more specific guidance in the selection of solutions. An obvious first step is to choose the solution with the best empirical value of the objective function. The link with maximum likelihood points to nonoptimality of the global minimum of the objective function for the purpose of recovering the underlying structure of the data, consistent with the finding that simple minimization of the objective function can lead to suboptimal results [15, 16].This has led to departures from the standard algorithm to specify the choice of initial prototypes [16] and to impose order relationships on the data using adjacency requirements [15]. When attempting to sample a single partition set from the numerous local minima, two indicators may be of use. The first one is to measure the separation between clusters, which relates to a diagnostic test for when a partition set can be trusted [16].
The second indicator is the stability of cluster partitions in relation to each other [17].This is the closest work to the proposed method, albeit focusing on the choice of number of prototypes, k, with the conjecture that the correct number of clusters will result in an improvement in betweencluster stability, estimated by subsampling the data. While we query the definition of the consistency index from a statistical point of view, we have verified this conjecture when applied to multiple random initialisations of the complete data set and suggest this approach as a guide to the choice of cluster number, reflecting the view that ''when dividing data into an optimal number of clusters, the resulting partition is most resilient to random perturbations" [4]. This work is among several relating to the application of stability measures to clustering [18, 19] but they do not provide specific guidance for the selection of partitions.
The paper therefore proposes and evaluates a practical and straightforward framework to guide the application of kmeans clustering, with the property that multiple applications of the framework result in very similar clustering solutions with clearly defined optimality properties, even in the presence of complex data structures containing anisotropic and contiguous clusters, as well as high dimensional data. Application of the framework to the datasets gives a measure of the performance of the framework. To this aim, we specify the task to be the selection of a partition of the data i.e. into nonoverlapping subsets, given a value of 'k' for the number of prototypes, by repeated application of the standard kmeans algorithm. Therefore, this paper does not address ensemble and consensus methods or modelbased extensions of the standard algorithm. Neither do we discuss the relative merits of different data representations by preprocessing, including dimensionality reduction and the choice of distance measure all of which strongly condition the solution space. These approaches represent deviations from the standard algorithm [11], and so are outside the scope of the current paper.
The motivation for seeking a representative, stable clustering solution, with the standard kmeans algorithm is the observation, from the decomposition of the objective function in terms of the principal eigenvalues of the data covariance matrix, that nearoptimal solutions may be found, in which case they will be stable in the sense that any other 'good' partitions will be 'close' to them [7]. Yet, we show with our empirical results that the quality of clustering solutions can vary substantially both in measures of cluster separation and in the consistency from one random initialisation to the next. This is the case even when considering only well separated solutions.
From a more practical perspective, kmeans based algorithms are commonly used in the subtyping of diseases [20] or for the analysis of DNA microarray data [21], [22], [23], [24] where it is commonly used to allow researchers to gain insights and a better understanding of gene similarity. Of importance in these studies is the stability of the solutions obtained, if the results are unstable, any inferences may change with an alternate run of the algorithm. Use of the framework generates a map of clustering solutions where the appearance of structure reveals the typical variation in cluster performance that can be obtained, offering guidance to when to stop sampling. Most importantly it does so using an efficient approach to stabilise the solutions even when applied to complex and challenging data.
It is important to distinguish between the two objectives: the selection of an appropriate value of k and the selection of a stable reproducible solution. The gross structure of the SeCo map of local minima makes a useful indicator of good choices for the assumed cluster number, k. The proposed framework therefore provides an empirical equivalent to the methodology proposed for densitybased clustering by [25], however the purpose of this paper is not to evaluate methods for determining the appropriate value of k.
More closely related to this paper is the concept of optimality of structure recovery and is naturally expressed in obtaining selfconsistency when reproducing solutions. This leads to consideration of the stability of the partitions, for instance to guide consensus clustering [18].
 i.
Optimisation of the SSQ objective function by the kmeans algorithm with online updates can result in significantly different solutions with SSQ values close to empirical minimum.
 ii.
Filtering SSQ values e.g. best 10% allows the degeneracy of similar SSQ values to be resolved by choosing the individual cluster partition with maximum value of the internal consistency index. This is shown to be a representative solution in the sense that it associates well with the data structure on a challenging synthetic data set; moreover, the solution is reproducible since repeated implementations of this procedure will identify very similar partitions.
Section 2 describes the proposed sampling procedures, section 3 introduces the dataset and results to validate the proposed sampling procedures for synthetic data, comprising a mixture of wellseparated and overlapping cohorts of anisotropic multivariate normal distributions.
Methods
Where u_{ i } is the mean of the points in s_{ i }.
An alternative statistical measure of agreement between two partition sets, and also considered, is the Adjusted Rand Index of Hubert and Arabie (ARIHA) [29]. The measure was adjusted to avoid over inflation due to correspondence between two partitions arising from chance. The Cramérs Vindex and ARIHA have a Pearson correlation coefficient of 0.99, higher than the value 0.95 between the Cramérs Vindex and both the unadjusted ARI of Morey and Agresti and the Jaccard index [29].This shows that the two statistical indices are closely related, though not identical, with even better correlation for better correspondence between partitions.
 i.
Apply the cluster partition algorithm to a sample of size N_{ total } of cluster initialisations, each seeded with k randomly selected points sampled from the full data set i.e. the standard initialisation for kmeans
 ii.
Sort by separation score and select a fraction f by ranked score of ΔSSQ, defined as the difference between Total Sum of Squares and the Within Cluster Sum of Squares for a particular solution, returning a working sample of cluster partitions N_{ sample }= N_{ total }*f in number
 iii.
Calculate the N_{ sample }*(N_{ sample }1)/2 pair wise concordance indices C_{ V } for the selected cluster partitions and return the median value med(CV) of all pair wise concordance indices for each partition
 iv.
The Separation and Concordance (SeCo) map comprises the 2dimensional coordinates (ΔSSQ, med(CV)) for the selected cluster partitions.
 v.
Once the landscape of cluster partitions has been mapped using the SeCo map, where there is a spread of solutions with similar ΔSSQ, choose the solutions with the highest value of med(CV).
As the assumed number of partitions, k, is increased, the map generates a scatter of points with increasing ΔSSQ, but with a distribution of med(C_{ V }) that shows the stability of each assumed number of clusters, when fitting the data structure within the constraints of the particular clustering algorithm.
In this paper the total number of initialisations was taken to be 500 and through experimentation, presented in the next section, it is demonstrated that only the top decile by separation need be retained, resulting in a working sample size N_{ sample } = 50 for each value of the assumed cluster number k for general purposes. These parameters can be varied, the total sample size being required to be sufficient for a clear group structure to emerge among the cluster solutions in the SeCo map, while retaining a small enough fraction of the total initialisations to avoid cluttering the map.
Two additional measures are used to evaluate the performance of each method, Accuracy, the proportion of correctly classified objects, and Affinity, a measure of how often a data point is allocated to a particular cohort. Affinity is therefore a row level indicator and is calculated by taking a set of solutions and for each row determining the highest proportion for which an element is assigned to a particular cohort, the mean value of this forms an overall indicator of how often data points swap cohorts for the whole dataset.
Results
Artificial data
Means and covariance matrices for generating the components of the artificial dataset.
Mean  Covariance Matrix (i,j)  

x  y  z  11  12  13  21  22  23  31  32  33  N  
C1  0.799  1.011  3.336  0.336  0.044  0.074  0.044  0.371  0.21  0.074  0.21  0.582  64 
C2  0.441  0.569  2.331  0.428  0.06  0.002  0.06  0.123  0.157  0.002  0.157  0.648  42 
C3  0.649  0.344  4.154  0.62  0.023  0.035  0.023  0.137  0.07  0.035  0.07  0.446  61 
C4  1.077  0.072  2.815  0.366  0.002  0.076  0.002  0.043  0.104  0.076  0.104  0.563  32 
C5  0.39  0.242  0.256  0.536  0.013  0.031  0.013  0.348  0.117  0.031  0.117  0.689  197 
C6  1.358  0.658  1.639  0.309  0.06  0.055  0.06  0.245  0.013  0.055  0.013  0.532  131 
C7  1.261  0.125  0.862  0.323  0.017  0.027  0.017  0.386  0.06  0.027  0.06  0.403  163 
C8  0.593  3.024  0.498  0.776  0.033  0.175  0.033  0.491  0.003  0.175  0.003  0.695  97 
C9  0.251  0.539  0.53  0.711  0.025  0.055  0.025  0.352  0.081  0.055  0.081  0.576  106 
C10  0.374  0.267  1.973  0.39  0.097  0.041  0.097  0.343  0.014  0.041  0.014  0.322  183 
Pairwise indices of cseparation for the synthetic data
C1  C2  C3  C4  C5  C6  C7  C8  C9  C10  

C1  0  
C2  0.7805  0  
C3  1.2105  1.4828  0  
C4  1.5054  1.1924  1.0687  0  
C5  2.4975  1.7636  3.0649  2.3119  0  
C6  3.3913  2.8294  4.476  3.8029  1.1757  0  
C7  3.2516  2.5575  3.7002  2.7302  1.2151  2.2233  0  
C8  2.9776  2.4341  3.0901  2.4774  2.025  2.6082  2.2314  0  
C9  2.0388  1.2969  2.4543  1.6846  0.7109  1.8176  1.2393  2.2086  0  
C10  3.7087  3.0487  4.4727  3.5977  1.2717  1.4141  1.233  2.5497  1.6952  0 
An exact match between the original cohorts and the empirical cluster partitions is not expected because of the mixing and also due to the isotropic Euclidean metric which does not take account of the covariance structure of the data. Therefore instead of the original cohort allocation, kmeans was applied to the dataset and repeated 500 times for k = 10 and the solution with the lowest SSQ chosen to provide the basis of the reference partition.
The cluster means of this reference partition were then taken, and kmeans iterated to convergence on each of the smaller datasets using these centres as the initialisations. These form the reference partitions against which the agreement of empirical kmeans solutions with a 'good partition' of the data will be measured.
Applying the framework to the largest of the datasets (10,000 data points) gives a SeCo map, such as that presented in Figure 2. This shows the SeCo map for 500 runs of kmeans for numbers of partitions in the range 2 to 15. The yaxis shows the ΔSSQ and the xaxis the Internal Median CV, which is the median of all the pairwise Cramérs' V calculations for each solution.
This SeCo map shows there is a wide range of potential solutions for high values of k and that there is substantial variation in values with similar SSQ. As the value of k increases, the variation in the solutions increases, and for k> = 8, we see a drift to the left on the xaxis as solutions become less and less internally consistent. Even for lower values of k there are particular solutions which have much lower median concordance with the other solutions, such as for k = 4 and k = 5.
Cardiotocography data
The cardiotocography dataset [33, 34] comprises 2126 foetal cardiotocograms for which automatic processing has been applied. These measurements were classified by experts and two consensus classification labels applied to each with respect to morphologic pattern and foetal state. Consequently the data has both three and ten cohorts which could be used to evaluate performance.
Evaluation of this dataset was performed using the 10 class underlying partition however no reference partition was derived as with the artificial data, given the purpose is to evaluate performance in a realworld scenario. This dataset is known to be difficult for the kmeans algorithm to correctly partition, and for the ten cohort solution, a previous study has obtained around 40% [35]. For this experiment, the data were scaled to values between 0 and 1, and the equivalent measure of accuracy (mean and standard deviation of the proportion of correctly classified objects over 100 runs) for the single measure (SSQ alone) approach was 37% (s.d. 1.69) and for the dual measure (SSQ, CV) 38% (0.20). For the single and dual metrics the Cramérs' V values were 0.45 (s.d. 0.0103) and 0.46 (s.d. 0.0048) respectively.
Thresholding the objective function

The well separated cohort (number 8 in Figure 1) separates from the other cohorts as early as the three partition solution, and remains separated in all solutions.

For values of k below 7, the cohort structure is stable, whereby an increase in k results in one or more existing cohorts dividing to produce the new cohort.

For partition sets with values of k higher than 8, there is substantial intermixing of the cohorts
Given the strong mixing between the original cohort pairs, these results are consistent with the original cohort structure, as described in Figure 1.
Benchmarking results
The benchmarking was performed to compare the reproducibility of SSQ alone with that of the SeCo framework, using two measures of performance, as the underlying data becomes increasingly sparse. This is achieved by applying kmeans to each of the five artificial datasets (10000, 5000, 2500, 1000 and 500 data points, respectively) and the Cardiotocography dataset 500 times, selecting the solution with the best SSQ and the solution that the framework highlights as being optimal. Solutions for k = 8, k = 9 and k = 10 were selected and compared against the reference partitions previously generated. This process was repeated 100 times, such that the reproducibility and stability of each method could be compared.
Three indicators are used to evaluate the dual measure against the single measure approach, the Cramérs' V statistic is calculated for each of the solutions against a reference partition, which gives a performance measure. The accuracy of the clustering is calculated using the proportion of correctly classified results, and finally a stability indicator, Affinity is calculated, which provides information about the frequency with which individual data points swap cohorts.
These show that for k = 10, SSQ performs well, obtaining near perfect concordance with the reference partition, however in approximately ten per cent of cases, it performs less well, and instead of having concordance of ≈ 1, it is possible for the concordance to drop to ≈ 0.85. For k = 9, a more marked variation in the results occurs, with the best case obtaining ≈ 0.925 in 5% of cases, for approximately 30% of solutions the concordance drops to between 0.8 and 0.85, with the remainder settling at ≈ 0.875. For eight partitions there is little variation in the concordance with the reference partition for most results. This is expected as the SeCo map indicates that for k = 8 the solutions are highly consistent.
Using the SeCo Framework, in Figure 7b, the performance profile is different, with the solutions exhibiting high levels of consistency throughout. For k = 9, all the solutions perform equally well, and whilst there is no longer the higher peak of 5% of solutions, neither is there the dramatically reduced concordance for 30% of results. k = 8 shows the same concordance as before, which again is to be expected and corresponds with the amount of variation between solutions indicated by the SeCo map. For k = 10, the results are again consistent, however not showing a drop in concordance in the last 5% of cases seen before.
Of note is that there is a continuous degradation in performance for this dataset for each value of k for the single measure whereas for the dual measure the profile is far more stable. For both sets of solutions k = 8 has a flat trajectory and produces largely the same solution, and it is interesting to note that for the SSQ values approximately 20% of both the k = 9 and k = 10 solutions would fall below the lower confidence interval for this line, unlike the dual measure where only the k = 10 line would drop below this level and then for much less than 5% of solutions.
Summary results for six datasets comparing accuracy and affinity for the single measure and dual measure framework
Single Measure  Dual Measure  

Dataset  Accuracy (Std. Dev)  Affinity  Accuracy (Std. Dev)  Affinity 
Artificial 500  0.7758 (0.038)  0.922  0.7701 (0.015)  0.931 
Artificial 1,000  0.9263 (0.037)  0.994  0.7773 (0.015)  0.980 
Artificial 2,000  0.7332 (0.058)  0.888  0.7345 (0.003)  0.992 
Artificial 5,000  0.9079 (0.089)  0.937  0.961 (0.008)  0.989 
Artificial 10,000  0.9929 (0.032)  0.993  0.9994 (0.001)  0.999 
Cardiotocography  0.3655 (0.017)  0.792  0.3775 (0.002)  0.983 
Current best practice of using SSQ to select a single kmeans partition set from many, is shown here to perform less consistently than might be expected, and repeated application of this metric has significant potential to produce a suboptimal result. By contrast using a stability measure in conjunction with SSQ has been shown to perform consistently and aside from a particular result, the pattern is stable, in that using the stability measure in conjunction with the separation measure improves the stability and reproducibility for obtaining a solution when using kmeans. In eleven of the twelve benchmark comparisons the SeCo framework performed equivalently to or better than selecting the solution with the lowest SSQ alone.
Conclusions
A framework is proposed for combining two performance measures, one for intracluster separation and the other to measure intercluster stability, which guides the sampling of a single partition after repeated random initialisations of the standard kmeans algorithm. It is shown that mechanistic application of the proposed method returns very well associated cluster solutions for each cluster number, especially relevant for data where wellseparated cluster partitions can show weak association, reflecting poor correspondence in the sense of a contingency table comparing cluster membership, quantified by the concordance index, among clusters with high values of the separation index.
A bioinformatics and multiple synthetic datasets have shown that the sampled solutions are consistently good in their agreement with the known recoverable data structure. Repeated application of the framework to both the synthetic and realworld data show that the performance of the dual measure approach in general better than that of a single metric. This illustrates the main contribution of the paper, namely to show how consistently good clustering solutions can be sampled from the local minima generated by repeated random initialisation, through the use of a visualisation map of the relative performance of each local minimum compared with the rest. The result is to move away from the currently accepted reliance on optimal cluster separation alone, since this can result in unnecessary variation in cluster composition as measured by the cluster stability measures and, furthermore, is shown empirically to be less associated with the known data structure.
Of great importance for bioinformatics is the need for consistent assignment of individuals to a single cluster, the dual measure approach shows a greater likelihood of obtaining stable partitions both in terms of gross structure and also the affinity of individual elements to particular cohorts.
The correlation of cooccurrence of score pairs in the SeCo map with useful choices of cluster number confirm the merits of the stabilitybased principles outlined in [17] but refines and extends the definition and application of this approach. It is conjectured that the framework will extend to generalised kmeans procedures, including the use of medoids for continuous data and kmodes for discrete data.
Further work will use pairwise concordance measures to identify core and peripheral cluster composition, with the aim of obtaining quantitative measures of confidence in cluster membership similar to silhouette diagnostics [25] but without the need for a parametric densitybased approach. And also to evaluate selection of optimal k within the structure of the framework compared to existing methods such as the gap statistic[36].
Declarations
The publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 1, 2013: Computational Intelligence in Bioinformatics and Biostatistics: new trends from the CIBB conference series. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S1.
Notes
Declarations
Authors’ Affiliations
References
 Tseng GC: Penalized and weighted Kmeans for clustering with scattered objects and prior information in highthroughput biological data. Bioinformatics. 2007, 23: 224755. 10.1093/bioinformatics/btm320.View ArticlePubMedGoogle Scholar
 Yu S, Liu X, Tranchevent LC, Glänzel W, Suykens JaK, De Moor B, Moreau Y: Optimized data fusion for Kmeans Laplacian clustering. Bioinformatics. 2011, 27: 11826. 10.1093/bioinformatics/btq569.PubMed CentralView ArticlePubMedGoogle Scholar
 Chen B, He J, Pellicer S, Pan Y: Using Hybrid Hierarchical Kmeans (HHK) clustering algorithm for protein sequence motif SuperRuleTree (SRT) structure construction. International Journal of Data Mining and Bioinformatics. 2010, 4: 31610.1504/IJDMB.2010.033523.View ArticlePubMedGoogle Scholar
 Bock HH: Origins and extensions of the kmeans algorithm in cluster analysis. Electronic Journ@l for History of Probability and Statistics. 2008, 4: 118.Google Scholar
 Jain AK: Data clustering: 50 years beyond Kmeans. Pattern Recognition Letters. 2010, 31: 651666. 10.1016/j.patrec.2009.09.011.View ArticleGoogle Scholar
 Steinley D: Kmeans clustering: a halfcentury synthesis. The British journal of mathematical and statistical psychology. 2006, 59: 134. 10.1348/000711005X48266.View ArticlePubMedGoogle Scholar
 Meilă M: The uniqueness of a good optimum for Kmeans. Proceedings of the 23rd international conference on Machine learning  ICML '06. 2006, New York, New York, USA: ACM Press, 625632.View ArticleGoogle Scholar
 Dalenius T: The problem of optimal stratification I. Skandinavisk Aktuarietidskrift. 1950, 203213.Google Scholar
 Dalenius T, Gurney M: The problem of optimal stratification II. Skandinavisk Aktuarietidskrift. 1951, 133148.Google Scholar
 Steinhaus H: Sur la division des corp materiels en parties. Bull Acad Polon Sci. 1956, 1: 801804.Google Scholar
 Hartigan Ja, Wong MA: Algorithm AS 136: A KMeans Clustering Algorithm. Applied Statistics. 1979, 28: 10010.2307/2346830.View ArticleGoogle Scholar
 Alldrin N, Smith A, Turnbull D: Clustering with EM and Kmeans. University of San Diego, California, Tech Report. 2003Google Scholar
 Filippone M, Camastra F, Masulli F, Rovetta S: A survey of kernel and spectral methods for clustering. Pattern Recognition. 2008, 41: 176190. 10.1016/j.patcog.2007.05.018.View ArticleGoogle Scholar
 Huang Z: Extensions to the kmeans algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery. 1998, 2: 283304. 10.1023/A:1009769707641.View ArticleGoogle Scholar
 Steinley D, Hubert L: OrderConstrained Solutions in KMeans Clustering: Even Better Than Being Globally Optimal. Psychometrika. 2008, 73: 647664. 10.1007/s113360089058z.View ArticleGoogle Scholar
 Steinley D: Profiling local optima in Kmeans clustering: developing a diagnostic technique. Psychological methods. 2006, 11: 17892.View ArticlePubMedGoogle Scholar
 BenHur A, Elisseeff A, Guyon I: A stability based method for discovering structure in clustered data. Pac Symp Biocomput. 2002, 617.Google Scholar
 Steinley D: Stability analysis in Kmeans clustering. The British journal of mathematical and statistical psychology. 2008, 61: 25573. 10.1348/000711007X184849.View ArticlePubMedGoogle Scholar
 Lange T, Roth V, Braun ML, Buhmann JM: Stabilitybased validation of clustering solutions. Neural computation. 2004, 16: 1299323. 10.1162/089976604773717621.View ArticlePubMedGoogle Scholar
 Lisboa PJG, Jarman IH, Etchells TA, Chambers SJ, Bacciu D, Whittaker J, Garibaldi JM, OrtegaMartorell S, Vellido A, Ellis IH: Discovering hidden pathways in bioinformatics. Eighth International Meeting on Computational Intelligence Methods in Bioinformatics and Biostatistics. 2011Google Scholar
 Xu R, Wunsch D: Survey of clustering algorithms. IEEE transactions on neural networks/a publication of the IEEE Neural Networks Council. 2005, 16: 64578. 10.1109/TNN.2005.845141.View ArticlePubMedGoogle Scholar
 Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P: Clustering methods for the analysis of dna microarray data. 1999Google Scholar
 Suresh R, Dinakaran K, Valarmathie P: Model Based Modified KMeans Clustering for Microarray Data. 2009 International Conference on Information Management and Engineering. 2009, 271273.View ArticleGoogle Scholar
 Wu FX, Zhang WJ, Kusalik A: Determination of the minimum sample size in microarray experiments to cluster genes using kmeans clustering. Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings. 2003, IEEE Comput. Soc, 401406.Google Scholar
 Menardi G: Densitybased Silhouette diagnostics for clustering methods. Statistics and Computing. 2009, 21: 295308.View ArticleGoogle Scholar
 Kuncheva LI, Vetrov DP: Evaluation of stability of kmeans cluster ensembles with respect to random initialization. IEEE transactions on pattern analysis and machine intelligence. 2006, 28: 1798808.View ArticlePubMedGoogle Scholar
 Agresti A: Categorical Data Analysis. Wiley, 1990558.
 Cramer H: Mathematical Methods of Statistics. (PMS9). 1999, Princeton University PressGoogle Scholar
 Steinley D: Properties of the HubertArabie adjusted Rand index. Psychological methods. 2004, 9: 38696.View ArticlePubMedGoogle Scholar
 Abd ElRehim DM, Ball G, Pinder SE, Rakha E, Paish C, Robertson JFR, Macmillan D, Blamey RW, Ellis IO: Highthroughput protein expression analysis using tissue microarray technology of a large wellcharacterised series identifies biologically distinct classes of breast cancer confirming recent cDNA expression analyses. International journal of cancer. 2005, 116: 34050. 10.1002/ijc.21004.View ArticleGoogle Scholar
 Lisboa P, Ellis I, Green a, Ambrogi F, Dias M: Clusterbased visualisation with scatter matrices. Pattern Recognition Letters. 2008, 29: 18141823. 10.1016/j.patrec.2008.05.021.View ArticleGoogle Scholar
 Sammon JW: A Nonlinear Mapping for Data Structure Analysis. IEEE Transactions on Computers. 1969, C18: 401409.View ArticleGoogle Scholar
 Ayres de Campos D, Bernardes J, Garrido A, Marques de Sá J, Pereira Leite L: Sisporto 2.0: A program for automated analysis of cardiotocograms. The Journal of MaternalFetal Medicine. 2000, 9: 311318. 10.1002/15206661(200009/10)9:5<311::AIDMFM12>3.0.CO;29.View ArticlePubMedGoogle Scholar
 Frank A, Asuncion A: UCI Machine Learning Repository. 2010Google Scholar
 Chen T, Zhang NL, Liu T, Poon KM, Wang Y: Modelbased multidimensional clustering of categorical data. Artificial Intelligence. 2012, 176: 22462269. 10.1016/j.artint.2011.09.003.View ArticleGoogle Scholar
 Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2001, 63: 411423. 10.1111/14679868.00293.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.