A practical comparison of two K-Means clustering algorithms

Background Data clustering is a powerful technique for identifying data with similar characteristics, such as genes with similar expression patterns. However, not all implementations of clustering algorithms yield the same performance or the same clusters. Results In this paper, we study two implementations of a general method for data clustering: k-means clustering. Our experimentation compares the running times and distance efficiency of Lloyd's K-means Clustering and the Progressive Greedy K-means Clustering. Conclusion Based on our implementation, not just in processing time, but also in terms of mean squared-difference (MSD), Lloyd's K-means Clustering algorithm is more efficient. This analysis was performed using both a gene expression level sample and on randomly-generated datasets in three-dimensional space. However, other circumstances may dictate a different choice in some situations.


Background
Researchers are inundated with data with little obvious information readily accessible; this is especially true in the many disciplines of the life sciences. These data may be very confusing and perplexing to biologists when viewed as a whole. To make these data more meaningful and to derive important biological understanding from these data, researchers have access to many different data processing techniques. One popular and meaningful approach is to cluster data into groups, where each group aggregates data with similar biological characteristics.
Data clustering is a very powerful technique in many application areas. Not only may the clusters have meaning themselves, but clustering allows for efficient data management techniques in that data that is grouped in the same manner will usually be accessed together. Access to data within a cluster may predict that other data in that cluster will be accessed soon; this can lead to optimized storage strategies which perform much better than if the data were randomly stored.
An easy abstraction for clustering data is based on multidimensional proximity relationships. While there may be other relationships among the data items, we focus on a distance relationship between data so that a meaningful and simple analytical conclusion can be made from simpler comparisons. Using proximity relationships, data is from Symposium of Computations in Bioinformatics and Bioscience (SCBB07) Iowa City, Iowa, USA. 13-15 August 2007 clustered in such a way that the squared-error distortion is minimized both globally and locally. The effectiveness of the algorithms analyzed are measured against this criterion. The mean squared-error distortion is defined as ., x k } is the closest cluster center to a point in V = {v 1 , v 2 ,..., v n } and n is the total number of points [1].
There are various algorithms that exist to implement clustering in terms of proximity measures. Depending on the quality of the cluster, the implementation speed of these algorithms can vary. In this article, we focus on two widely used k-means clustering algorithms. A k-means clustering algorithm can be formally defined as a function that receives as input a set of points in multi-dimensional space and a number, k, of desired centers or cluster representatives; one area of active research is the issue of optimally "seeding" the algorithm with the proper value of k and the starting locations of the k cluster centers. With this input, the algorithm produces an output set of point sets such that each point set has a defined center that minimizes the cumulative distance to the center of all points in that set, for all the possible choices of each set.
We have implemented two versions of the k-means clustering algorithm: Lloyd's K-means Clustering and Progressive Greedy K-means Clustering. The former is a relatively faster algorithm and is fairly straightforward. The latter is a more conservative approach and can run for a much longer time but can sometimes yield better results in terms of distance measures.
We first describe these algorithms, then we examine these algorithms and discuss some experimental results. These results are analyzed based on the running time for the algorithms and the mean squared-error distortion and are compared in terms of complexity and efficiency.

Algorithm description: Lloyd's K-means Clustering algorithm
Lloyd's K-means Clustering algorithm was designed by S. P. Lloyd [2]. Given a number k, separate all data in a given partition into k separate clusters, each with a center that acts as a representative. There are iterations that reset these centers then reassign each point to the closest center. Then the next iteration repeats until the centers do not move. The algorithm is as follows [1]: 1. Assign each data point to the cluster C i corresponding to the closest cluster representative

After the assignments of all n data points, compute new cluster representatives according to the center of gravity of each cluster.
While the Lloyd's algorithm often converges to a local minimum of the squared error distortion rather than the global minimum [1], it is the faster of the two algorithms discussed in this paper.
We used C as the programming language to implement this algorithm using two primary structures for the points: an array of points that is dynamically declared when the user specifies the input points and arrays for each of k centers. These latter arrays for each center themselves have arrays within them -one for each dimensional in a multidimensional space -for the points that are assigned to that particular center (for our analysis, we have used threedimensional points).

Clustering algorithm
The Progressive Greedy K-means Clustering algorithm is similar to Lloyd's in that it searches for the best center of gravity for each point, but it assigns points to a center based on a different technique. In each iteration, Lloyd's algorithm reassigns a point to a new center and then readjusts the centers accordingly. The Progressive Greedy approach does not act upon every point in each iteration; rather the point which would most benefit moving to another cluster is reassigned. Every iteration in the Progressive Greedy algorithm calculates the "cost" of every point in terms of a Euclidean distance (in three-dimensional space), i.e., √[(x 1 -x 2 ) 2 + (y 1 -y 2 ) 2 + (z 1 -z 2 ) 2 ] Each point p = (x p , y p , z p ) has a cost associated with it in terms of the current center C i = (x i , y i , z i ) to which it belongs. The point is a candidate to be moved if the Eculidean distance cost can be reduced by moving that point from one cluster C i to another cluster C j = (x j , y j , z j ) with that cluster having a closer center. In other words, a point is a candidate to be moved from is greater than 0. Once all the candidates are calculated, the point with the largest difference is then moved. If no point has a difference value greater than 0, the algorithm is finished.
Each iteration in the Progressive Greedy K-means Clustering algorithm does the following:

Calculate the cost of moving each point to each of the other cluster centers as well as the cost of its current cluster center. For every point, store the best change if less than the cost of its current cluster center.
2. If there is a point with a best change, move it. If there is more than one, pick the one point that when moved sees the greatest improvement.
3. If nothing else can be done, finished.
The Progressive Greedy K-means Clustering is slower, but the sacrifice is an attempt to minimize the squared-error distortion mentioned earlier.
The implementation of Progressive K-means clustering uses the same C data structures as was used for Lloyd's.

Analysis of biological data
M. B. Eisen, et. al. [3] were one of the first groups to apply the clustering approach to the analysis the gene expression data.
We applied both clustering algorithms to the analysis of microarray data. The clustering algorithms classified gene expression data into clusters such that functionally-related genes are grouped together. In the following example [1], the expression information of ten genes is recorded at three different times (see Table 1). The distance matrix of the ten genes was calculated based on the Euclidean distance in three-dimensional space. The clustering algorithms grouped the gene expression data into clusters satisfying the following two conditions [1]: • within a cluster, any two genes should be highly similar to each other (i.e., the distance between them should be small; this condition is called homogeneity), and • any two genes from different clusters should be very different from each other (i.e., the distance between them should be large; this condition is called separation).
Both algorithms yielded the same three clusters of the ten genes as follows: {g 1 , g 6 , g 7 }, {g 3 , g 5 , g 8 }, and {g 2 , g 4 , g 9 , g 10 }. Tables 2 and 3, respectively, are the running time comparisons and mean squared-distance comparisons of the two clustering algorithms applied to these biological data.

Analysis of a randomly-generated data set
We used computer-generated random points to test the two clustering algorithms; presumably, this data represents few natural clusters which should present close to a "worst case" for the clustering algorithms. Figures 1 to 4 show the running time comparisons of various runs using different values of k and different numbers of points. Each individual value in these Figures is a mean time of multiple runs and is expressed in terms of seconds, though what is important here is the relative size of these values.
A comparison of mean square differences are shown in Tables 4 and 5 using different numbers of points and k values of 5 and 10, respectively. In these Tables, the maximum and minimum local cluster mean squares are shown alongside the general global average MSD.

Conclusion
The advantage of Lloyd's K-means Clustering algorithm compared to the Progressive Greedy K-means Clustering algorithm is clear from the above comparisons. Based on our implementation, not just in processing time, but also in terms of mean squared-difference, Lloyd's K-means Clustering algorithm is more efficient. For very large data sets, Lloyd's algorithm definitely works faster. When the number of points exceeds 10000, the Progressive Greedy    Running time comparison when k = 5 (excludes the running times of Progressive Greedy algorithm when the number of points exceeds 10,000) Figure 3 Running time comparison when k = 5 (excludes the running times of Progressive Greedy algorithm when the number of points exceeds 10,000).
Running time comparison when k = 3 Figure 1 Running time comparison when k = 3.
Running time comparison when k = 4 Figure 2 Running time comparison when k = 4.
Running time comparison when k = 10 (excludes the running times of Progressive Greedy algorithm when the number of points exceeds 10,000)

Figure 4
Running time comparison when k = 10 (excludes the running times of Progressive Greedy algorithm when the number of points exceeds 10,000).  K-means Clustering algorithm needs optimization to even to be able to handle the very large floating point values associated with finding the mean squared-difference. Without optimization, Progressive Greedy K-means Clustering would not even run without generating floating point exception errors. We therefore conclude that Lloyd's K-means Clustering algorithm seems to be the better algorithm. However, other circumstances may dictate a different choice in some situations.