Skip to main content

A practical comparison of two K-Means clustering algorithms

Abstract

Background

Data clustering is a powerful technique for identifying data with similar characteristics, such as genes with similar expression patterns. However, not all implementations of clustering algorithms yield the same performance or the same clusters.

Results

In this paper, we study two implementations of a general method for data clustering: k-means clustering. Our experimentation compares the running times and distance efficiency of Lloyd's K-means Clustering and the Progressive Greedy K-means Clustering.

Conclusion

Based on our implementation, not just in processing time, but also in terms of mean squared-difference (MSD), Lloyd's K-means Clustering algorithm is more efficient. This analysis was performed using both a gene expression level sample and on randomly-generated datasets in three-dimensional space. However, other circumstances may dictate a different choice in some situations.

Background

Researchers are inundated with data with little obvious information readily accessible; this is especially true in the many disciplines of the life sciences. These data may be very confusing and perplexing to biologists when viewed as a whole. To make these data more meaningful and to derive important biological understanding from these data, researchers have access to many different data processing techniques. One popular and meaningful approach is to cluster data into groups, where each group aggregates data with similar biological characteristics.

Data clustering is a very powerful technique in many application areas. Not only may the clusters have meaning themselves, but clustering allows for efficient data management techniques in that data that is grouped in the same manner will usually be accessed together. Access to data within a cluster may predict that other data in that cluster will be accessed soon; this can lead to optimized storage strategies which perform much better than if the data were randomly stored.

An easy abstraction for clustering data is based on multi-dimensional proximity relationships. While there may be other relationships among the data items, we focus on a distance relationship between data so that a meaningful and simple analytical conclusion can be made from simpler comparisons. Using proximity relationships, data is clustered in such a way that the squared-error distortion is minimized both globally and locally. The effectiveness of the algorithms analyzed are measured against this criterion. The mean squared-error distortion is defined as

d(V, X) = (d(v1, X)2 + d(v2, X)2 + ... + d(v i , X)2 + ... + d(v n , X)2)/n

where X = {x1, x2,..., x k } is the closest cluster center to a point in V = {v1, v2,..., v n } and n is the total number of points [1].

There are various algorithms that exist to implement clustering in terms of proximity measures. Depending on the quality of the cluster, the implementation speed of these algorithms can vary. In this article, we focus on two widely used k-means clustering algorithms. A k-means clustering algorithm can be formally defined as a function that receives as input a set of points in multi-dimensional space and a number, k, of desired centers or cluster representatives; one area of active research is the issue of optimally "seeding" the algorithm with the proper value of k and the starting locations of the k cluster centers. With this input, the algorithm produces an output set of point sets such that each point set has a defined center that minimizes the cumulative distance to the center of all points in that set, for all the possible choices of each set.

We have implemented two versions of the k-means clustering algorithm: Lloyd's K-means Clustering and Progressive Greedy K-means Clustering. The former is a relatively faster algorithm and is fairly straightforward. The latter is a more conservative approach and can run for a much longer time but can sometimes yield better results in terms of distance measures.

We first describe these algorithms, then we examine these algorithms and discuss some experimental results. These results are analyzed based on the running time for the algorithms and the mean squared-error distortion and are compared in terms of complexity and efficiency.

Methods

Algorithm description: Lloyd's K-means Clustering algorithm

Lloyd's K-means Clustering algorithm was designed by S. P. Lloyd [2]. Given a number k, separate all data in a given partition into k separate clusters, each with a center that acts as a representative. There are iterations that reset these centers then reassign each point to the closest center. Then the next iteration repeats until the centers do not move. The algorithm is as follows [1]:

1. Assign each data point to the cluster C i corresponding to the closest cluster representative x i (1 ≤ i ≤ k)

2. After the assignments of all n data points, compute new cluster representatives according to the center of gravity of each cluster.

While the Lloyd's algorithm often converges to a local minimum of the squared error distortion rather than the global minimum [1], it is the faster of the two algorithms discussed in this paper.

We used C as the programming language to implement this algorithm using two primary structures for the points: an array of points that is dynamically declared when the user specifies the input points and arrays for each of k centers. These latter arrays for each center themselves have arrays within them – one for each dimensional in a multi-dimensional space – for the points that are assigned to that particular center (for our analysis, we have used three-dimensional points).

Algorithm description: Progressive Greedy K-means Clustering algorithm

The Progressive Greedy K-means Clustering algorithm is similar to Lloyd's in that it searches for the best center of gravity for each point, but it assigns points to a center based on a different technique. In each iteration, Lloyd's algorithm reassigns a point to a new center and then readjusts the centers accordingly. The Progressive Greedy approach does not act upon every point in each iteration; rather the point which would most benefit moving to another cluster is reassigned. Every iteration in the Progressive Greedy algorithm calculates the "cost" of every point in terms of a Euclidean distance (in three-dimensional space), i.e.,

√[(x1 - x2)2 + (y1 - y2)2 + (z1 - z2)2]

Each point p = (x p , y p , z p ) has a cost associated with it in terms of the current center C i = (x i , y i , z i ) to which it belongs. The point is a candidate to be moved if the Eculidean distance cost can be reduced by moving that point from one cluster C i to another cluster C j = (x j , y j , z j ) with that cluster having a closer center. In other words, a point is a candidate to be moved from C i to C j if√[(xi - xp)2 + (yi - yp)2 + (zi - zp)2] - √[(xj - xp)2 + (yj - yp)2 + (zj - zp)2]

is greater than 0. Once all the candidates are calculated, the point with the largest difference is then moved. If no point has a difference value greater than 0, the algorithm is finished.

Each iteration in the Progressive Greedy K-means Clustering algorithm does the following:

1. Calculate the cost of moving each point to each of the other cluster centers as well as the cost of its current cluster center. For every point, store the best change if less than the cost of its current cluster center.

2. If there is a point with a best change, move it. If there is more than one, pick the one point that when moved sees the greatest improvement.

3. If nothing else can be done, finished.

The Progressive Greedy K-means Clustering is slower, but the sacrifice is an attempt to minimize the squared-error distortion mentioned earlier.

The implementation of Progressive K-means clustering uses the same C data structures as was used for Lloyd's.

Results

Analysis of biological data

M. B. Eisen, et. al. [3] were one of the first groups to apply the clustering approach to the analysis the gene expression data.

We applied both clustering algorithms to the analysis of microarray data. The clustering algorithms classified gene expression data into clusters such that functionally-related genes are grouped together. In the following example [1], the expression information of ten genes is recorded at three different times (see Table 1). The distance matrix of the ten genes was calculated based on the Euclidean distance in three-dimensional space. The clustering algorithms grouped the gene expression data into clusters satisfying the following two conditions [1]:

Table 1 Expression levels of ten genes at three different times.
  • within a cluster, any two genes should be highly similar to each other (i.e., the distance between them should be small; this condition is called homogeneity), and

  • any two genes from different clusters should be very different from each other (i.e., the distance between them should be large; this condition is called separation).

Both algorithms yielded the same three clusters of the ten genes as follows: {g1, g6, g7}, {g3, g5, g8}, and {g2, g4, g9, g10}. Tables 2 and 3, respectively, are the running time comparisons and mean squared-distance comparisons of the two clustering algorithms applied to these biological data.

Table 2 Running time comparison in seconds for different k values.
Table 3 MSD comparisons for different k values (actual values).

Analysis of a randomly-generated data set

We used computer-generated random points to test the two clustering algorithms; presumably, this data represents few natural clusters which should present close to a "worst case" for the clustering algorithms. Figures 1 to 4 show the running time comparisons of various runs using different values of k and different numbers of points. Each individual value in these Figures is a mean time of multiple runs and is expressed in terms of seconds, though what is important here is the relative size of these values.

Figure 1
figure 1

Running time comparison when k = 3.

Figure 2
figure 2

Running time comparison when k = 4.

Figure 3
figure 3

Running time comparison when k = 5 (excludes the running times of Progressive Greedy algorithm when the number of points exceeds 10,000).

Figure 4
figure 4

Running time comparison when k = 10 (excludes the running times of Progressive Greedy algorithm when the number of points exceeds 10,000).

A comparison of mean square differences are shown in Tables 4 and 5 using different numbers of points and k values of 5 and 10, respectively. In these Tables, the maximum and minimum local cluster mean squares are shown alongside the general global average MSD.

Table 4 MSD comparisons with different number of points when k = 5 (in millions of actual values).
Table 5 MSD comparisons with different number of points when k = 10 (in millions of actual values).

Conclusion

The advantage of Lloyd's K-means Clustering algorithm compared to the Progressive Greedy K-means Clustering algorithm is clear from the above comparisons. Based on our implementation, not just in processing time, but also in terms of mean squared-difference, Lloyd's K-means Clustering algorithm is more efficient. For very large data sets, Lloyd's algorithm definitely works faster. When the number of points exceeds 10000, the Progressive Greedy K-means Clustering algorithm needs optimization to even to be able to handle the very large floating point values associated with finding the mean squared-difference. Without optimization, Progressive Greedy K-means Clustering would not even run without generating floating point exception errors. We therefore conclude that Lloyd's K-means Clustering algorithm seems to be the better algorithm. However, other circumstances may dictate a different choice in some situations.

References

  1. Jones NC, Pevzner PA: An Introduction to Bioinformatics Algorithms. 2004, The MIT Press

    Google Scholar 

  2. Lloyd SP: Least squares quantization in PCM [Pulse-Code Modulation.]. IEEE Transactions on Information Theory. 1982, 28: 129-137. 10.1109/TIT.1982.1056489.

    Article  Google Scholar 

  3. Eisten MB, Spellman PT, Brown PO, Bostein D: Cluster analysis and display of genome-wide expression pattern. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Steven F. Jennings for comments on the preliminary version of this work. This publication was made possible in part by NIH Grant #P20 RR-16460 from the IDeA Networks of Biomedical Research Excellence (INBRE) Program of the National Center for Research Resources.

This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S6.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiuzhen Huang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

GAW carried out the k-means clustering algorithm design and implementation. XH participated in the design and applications of the algorithms. Both authors have read and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wilkin, G.A., Huang, X. A practical comparison of two K-Means clustering algorithms. BMC Bioinformatics 9 (Suppl 6), S19 (2008). https://doi.org/10.1186/1471-2105-9-S6-S19

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-S6-S19

Keywords