Internal measures
WCSS and its approximations
Consider a set of n items G = {g1,..., g
n
}, where g
i
is specified by m numeric values, referred to as features or conditions, 1 ≤ i ≤ n. That is, each g
i
is an element in m-dimensional space. Let D be the corresponding n × m data matrix, let C = {c1,..., c
k
} be a clustering solution for G produced by a clustering algorithm and let
(1)
where is the centroid of cluster c
r
. Then, we have:
(2)
Assume now to have k
max
clustering solutions, each with a number of clusters in [1, k
max
]. Assume also that we want to use WCSS to estimate, based on those solutions, what is the real number k* of clusters in our dataset. Intuitively, for values of k <k*, the value of WCSS should be substantially decreasing, as a function of the number of clusters k. Indeed, as we get closer and closer to the real number of clusters in the data, the compactness of each cluster should substantially increase, causing a substantial decrease in WCSS. On the other hand, for values of k* > k, the compactness of the clusters will not increase as much, causing the value of WCSS not to decrease as much. The following heuristic approach comes out [15]: Plot the values of WCSS, computed on the given clustering solutions, in the range [1, k
max
]; choose as k* the abscissa closest to the "knee" in the WCSS curve. Fig. 2 provides an example. Indeed, the dataset in Fig. 2(a) has two natural clusters and the "knee" in the plot of the WCSS curve in Fig. 2(b) indicates k* = 2.
The approximation of the WCSS curve proposed here is based on the idea of reducing the number of executions by a clustering algorithm A for the computation of WCSS(k), for each k in a given interval [1, k
max
]. In fact, given an integer R > 0, the approximate algorithm to compute WCSS uses algorithm A to obtain a clustering solution with k clusters only for values of k multiples of R. We refer to R as the refresh step. For all other k's, a clustering solution is obtained by merging two clusters in a chosen clustering solution already available. The procedure below gives the high level details. It takes as input R, A, D and k
max
. Algorithm A must be able to take as input a clustering solution with k clusters and refine it to give as output a clustering solution with the same number of clusters.
Procedure WCSS-R(R, A, D, k
max
)
(1) Compute a clustering solution with k
max
clusters using algorithm A on dataset D. Compute WCSS(k
max
) using .
(2) For k := k
max
- 1 down to k = 1, execute steps (2.a) (2.b) and (2.c).
(2.a) (Merge) Merge the two clusters in Pk+1with minimum Euclidean distance between centroids to obtain a temporary clustering solution with k clusters.
(2.b) (Refresh) If (R = 0) or (k mod R > 0) set P
k
equal to . Else compute new P
k
based on . That is, is given as input to A, as an initial partition of D in k clusters, and P
k
is the result of that computation.
(2.c) Compute WCSS(k) using P
k
.
Technically, the main idea in the approximation scheme is to interleave the execution of a partitional clustering algorithm A with a merge step typical of agglomerative clustering. The gain in speed is realized by having a fast merge step, based on k + 1 clusters, to obtain k clusters instead of a new full fledged computation, starting from scratch, of the algorithm A to obtain the same number of clusters. The approximation scheme would work also for hierarchical algorithms, provided that they comply with the requirement that, given in input a dataset, they will return a partition into k groups. However, in this circumstance, the approximation scheme would be a nearly exact replica of the hierarchical algorithm. In conclusion, we have proposed a general approximation scheme, where the gain is realized when the merge step is faster than a complete computation of a clustering algorithm A. We have experimented with K-means-R and with values of the refresh step R = 0, 2, 5, i.e., the partitional clustering algorithm is used only once, every two and five steps, respectively.
KL
KL is based on WCSS, but it is automatic, i.e., a numeric value for k* is returned. LetDIFF(k) = (k - 1)2/mWCSS(k - 1) - k2/mWCSS(k)
Recall from the previous subsection the behavior of WCSS, as a function of k and with respect to k*. Based of those considerations, we expect the following behavior for DIFF(k):
-
(i)
for k <k*, both DIFF(k) and DIFF(k + 1) should be large positive values.
-
(ii)
for k > k*, both DIFF(k) and DIFF(k + 1) should be small values, and one or both might be negative.
-
(iii)
for k = k*, DIFF(k) should be large positive, but DIFF(k + 1) should be relatively small (might be negative).
Based on these considerations, Krzanowski and Lai proposed to choose the estimate on the number of clusters as the k maximizing:
(4)
That is,
(5)
Notice that KL(k) is not defined for the important special case of k = 1, i.e., no cluster structure in the data.
Gap and its geometric approximation
The measures presented so far are either useless or not defined for the important special case k = 1. Tibshirani et al. [15] brilliantly combine techniques of hypothesis testing in statistics with the WCSS heuristic, to obtain Gap, a measure that can deal also with the case k = 1. In order to describe the method, we need to recall briefly null models used to test for the hypothesis of no cluster structure in a dataset, i.e., the null hypothesis. We limit ourselves to introduce the two that find common use in microarray data analysis [15], in addition to another closely related and classic one [34]:
(M.1) Ps (The Poisson Model). The items can be represented by points that are randomly drawn from a region R in m-dimensional space. In order to use this model, one needs to specify the region within which the points are to be uniformly distributed. The simplest regions that have been considered are the m-dimensional hypercube and hypersphere enclosing the points specified by the matrix D. Other possibilities, in order to make the model more data-dependent, is to choose the convex hull enclosing the points specified by D.
(M.2) Pc (The Poisson Model Aligned with Principal Components of the Data). This is basically as (M.1), except that the region R is a hypercube aligned with the principal components of the data matrix D.
In detail, assume that the columns of D have mean zero and let D = UXVTbe its singular value decomposition. Transform via D' = DV. Now, use D' as in (M.1) with sampling in a hypercube R to obtain a data set Z'. Back transform via Z = Z'VTto obtain a new dataset.
(M.3) Pr (The Permutational Model). Given the data matrix D, one obtains a new data matrix D' by randomly permuting the elements within the rows and/or the columns of D. In order to properly implement this model, care must be taken in specifying a proper permutation for the data since some similarity and distance functions may be insensitive to permutations of coordinates within a point.
That is, although D' is a random permutation of D, it may happen that the distance or similarity among the points in D' is the same as in D, resulting in indistinguishable datasets for clustering algorithms.
The intuition behind Gap is brilliantly elegant. Recall that the "knee" in the WCSS curve can be used to predict k*. Unfortunately, the localization of such a point may be subjective. Now, consider the WCSS curves in Fig. 3. That is, the plot is obtained with use of the WCSS(k) values. The curve in green is the WCSS computed with K-means-R on the CNS Rat dataset. The curve in red is the average WCSS curve, computed on ten datasets generated from the original data via the Ps null model. As it is evident from the figure, the red curve has a nearly constant slope: an expected behavior on datasets with no cluster structure in them. The vertical lines indicate the gap between the null model curves and the real curve. Since WCSS is expected to decrease sharply up to k*, on the real dataset, while it has a nearly constant slope on the null model datasets, the length of the vertical segments is expected to increase up to k* and then to decrease. In fact, in Fig. 3, we see that k* = 7, a value very close to the number of classes (six) in the dataset. Normalizing the WCSS curves via logs and accounting also for the simulation error, such an intuition can be given under the form of the procedure reported next, where ℓ is the number of simulation steps and the remaining parameters are as in procedure WCSS-R.
Procedure GP(ℓ, A, D, k
max
)
(1) For 1 ≤ i ≤ ℓ, compute a new data matrix D
i
, using the chosen null model. Let D0 denote the original data matrix.
(1.a) For 0 ≤ i ≤ ℓ and 1 ≤ k ≤ k
max
, compute a clustering solution Pi,kon D
i
using algorithm A.
(2) For 0 ≤ i ≤ ℓ and 1 ≤ k ≤ k
max
, compute log(WCSS(k)) on Pi,kand store the result in matrix SL[i, k].
(2.a) For 1 ≤ k ≤ k
max
, compute .
(2.b) For 1 ≤ k ≤ k
max
, compute the standard deviation sd(k) of the set of numbers {SL[1, k],... SL[ℓ, k ]} and let .
(3) Return as k* the first value of k such that Gap(k) ≤ Gap(k + 1) - s(k + 1).
The prediction of k* is based on running a certain number of times the procedure GP taking then the most frequent outcome as the prediction. We also point out that further improvements and generalizations of Gap have been proposed in [30].
The geometric interpretation of Gap and the behavior of the WCSS curve accross null models suggests the fast approximation G-Gap. The intuition, based on experimental observations, is that one can skip the entire simulation phase of Gap, without compromising too much the accuracy of the prediction of k*. Indeed, based on the WCSS curve, the plot of the log WCSS curve one expects, for a given clustering algorithm and null model, is a straight line with a slope somewhat analogous to that of the log WCSS curve and dominating it. Therefore, one can simply identify the "knee" in the WCSS by translating the end-points of the log WCSS curve on the original dataset by a given amount a, to obtain two points g
s
and g
e
. Those two points are then joined by a straight line, which is used to replace the null model curve to compute the segment lengths used to predict k*, i.e, the first maximum among those segment lenghts as k increases. An example is provided in Fig. 4 with the WCSS curve of Fig. 3. The prediction is k* = 7, which is very close to the correct k* = 6. We point out that the use of the WCSS curve in the figure is to make clear the behavior of the segment lengths, which would be unnoticeable with the log WCSS curve, although the result would be the same.
Clest
Clest estimates k* by iterating the following: randomly partition the original dataset in a learning set and training set. The learning set is used to build a classifier for the data, then to be used to derive "gold standard" partitions of the training set. That is, the classifier is assumed to be a reliable model for the data. It is then used to assess the quality of the partitions of the training set obtained by a given clustering algorithm. Clest generalizes in many respects an approach proposed by Breckenridge [35] and can be regarded as a clever combination of hypothesis testing and resampling techniques. We detail it in the following procedure. With reference to GP, the four new parameters it takes as input are: (a) an external index E, i.e., a function that quantifies how similar two partitions are; (b) p
max
, a "significance level" threshold; (c) d
min
, a minimum allowed difference between "computed and expected" values; (d) H, the number of resampling steps; (e) a classifier used to obtain the "gold standard" partitions of the training set.
(1) For 2 ≤ k ≤ k
max
, perform steps (1.a)-(1.d).
(1.a) For 1 ≤ h ≤ H, split the input dataset in L and T, the learning and training sets, respectively.
Cluster the elements of L in k clusters using algorithm A and build a classifier . Apply to T in order to obtain a "gold solution" GS
k
. Cluster the elements of T in k groups GA
k
using algorithm A.
Let SIM[k, h] = E(GS
k
, GA
k
).
(1.b) Compute the observed similarity statistic t
k
= median(SIM[k, 1],..., SIM[k, H]).
(1.c) For 1 ≤ b ≤ ℓ, generate (via a null model), a data matrix D(b), and repeat steps (1.a) and (1.b) on D(b).
(1.d) Compute the average of these H statistics, and denote it with . Finally, compute the p-value p
k
of t
k
and let d
k
= t
k
- be the difference between the statistic observed and its estimate expected value.
(2) Define a set K = {2 ≤ k ≤ k
max
: p
k
≤ p
max
and d
k
≥ d
min
}
(3) Based on K return a prediction for k* as: if K is empty then k* = 1, else k* = argmax k ∈ Kd
k
ME
Let f denote a sampling ratio, i.e., the percentage of items one extracts from sampling a given dataset. The idea supporting ME is that the inherent structure in a dataset has been identified once one finds a k such that partitions into k clusters produced by a clustering algorithm are similar, when obtained by repeatedly subsampling the dataset. This idea can be formalized by the following algorithm, that takes as input parameters analogous to Clest except for f and Subs, this latter being a subsampling scheme that extracts f percentage items from D. It returns as output an (k
max
- 1) × H array SIM, analogous to the one computed by Clest and whose role here is best explained once that the procedure is presented.
Procedure ME(f, H, A, E, D, Subs, k
max
)
(1) For 2 ≤ k ≤ k
max
and 1 ≤ h ≤ H, perform steps (1.a)-(1.c).
(1.a) Compute two subsamples D(1) and D(2) of D, via Subs.
(1.b) For each D(i), compute a clustering solution GAk,iwith k clusters, using algorithm A, 1 ≤ i ≤ 2.
(1.c) Let be GAk,1, but restricted to the elements common to D(1) and D(2). Let be defined analogously. Let SIM[k, h] = E(, ).
Once that the SIM array is computed, its values are histogrammed, separately for each value of k, i.e., by rows. The optimal number of clusters is predicted to be the lowest value of k such that there is a transition of the SIM value distribution from being close to one to a wider range of values. An example is given in Fig. 5, where the transition described above takes place at k = 2 for a correct prediction of k* = 2.
Consensus
Consensus is also based on resampling techniques. In analogy with WCSS, one needs to analyze a suitably defined curve in order to find k*. It is best presented as a procedure. With reference to GP, the two new parameters it takes as input are a resampling scheme Sub, i.e., a way to sample from D to build a new data matrix, and the number H of resampling iterations.
Procedure Consensus(Sub, H, A, D, k
max
)
(1) For 2 ≤ k ≤ k
max
, initialize to empty the set M of connectivity matrices and perform steps (1.a) and (1.b).
(1.a) For 1 ≤ h ≤ H, compute a perturbed data matrix D(h)using resampling scheme Sub; cluster the elements in k clusters using algorithm A and D(h). Compute a connectivity matrix M(h)and insert it into M.
(1.b) Based on the connectivity matrices in M, compute a consensus matrix .
(2) Based on the k
max
- 1 consensus matrices, return a prediction for k*.
The resampling scheme in this case extracts, uniformly and at random, a given percentage of the rows of D. As for the connectivity matrix M(h), one has M(h)(i, j) = 1 if items i and j are in the same cluster and zero otherwise. Moreover, we also need to define an indicator matrix I(h)such that I(h)(i, j) = 1 if items i and j are both in D(h)and it is zero otherwise. Then, the consensus matrix is defined as a properly normalized sum of all connectivity matrices in all perturbed datasets:
(6)
Based on , Monti et al. define a value A(k) measuring the level of stability in cluster assignments, as reflected by the consensus matrix. Formally,
where CDF is the empirical cumulative distribution defined over the range [0, 1] as follows:
with l equal to 1 if the condition is true and 0 otherwise. Finally, based on A(k), one can define:
Moreover, Monti et al. suggested the use of the function Δ' for non-hierarchical algorithms. It is defined as Δ although one uses A'(k) = maxk' ∈ [2, k]A(k'). Based on the Δ or Δ' curve, the value of k* is obtained using the following intuitive idea, based also on experimetal observations.
-
(i)
For each k ≤ k*, the area A(k) markedly increases. This results in an analogous pronounced decrease of the Δ and Δ' curves.
-
(ii)
For k > k*, the area A(k) has no meaningful increases. This results in a stable plot of the Δ and Δ' curves.
From this behaviour, the "rule of thumb" to identify k* is: take as k* the abscissa corresponding to the smallest non-negative value where the curve starts to stabilize; that is, no big variation in the curve takes place from that point on. An example is given in Fig. 6.
As pointed out in the previous section, our recommendation to use the Δ curve instead if the Δ' curve for non-hierarchical algorithms contradicts the recommendation by Monti et al. The reason is the following: A(k) is a value that is expected to behaves like a non-decreasing function of k, for hierarchical algorithms. Therefore Δ(k) would be expected to be positive or, when negative, not too far from zero. Such a monotonicity of A(k) is not expected for non-hierarchical algorithms. Therefore, another definition of Δ is needed to ensure a behavior of this function analogous to the hierarchical algorithms. We find that, for the partitional algorithms we have used, A(k) displays nearly the same monotonicity properties of the hierarchical algorithms. The end result is that the same definition of Δ can be used for both types of algorithms. To the best of our knowledge, Monti et al. defined the function Δ', but their experimentation was limited to hierarchical algorithms.
FOM and its extensions and approximations
FOM is a family of internal validation measures introduced by Ka Yee Yeung et al. specifically for microarray data and later extended in several directions by Datta and Datta [36]. Such a family is based on the jackknife approach and it has been designed for use as a relative measure assessing the predictive power of a clustering algorithm, i.e., its ability to predict the correct number of clusters in a dataset. Experiments by Ka Yee Yeung et al. show that the FOM family of measures satisfies the following properties, with a good degree of accuracy. For a given clustering algorithm, it has a low value in correspondence with the number of clusters that are really present in the data. Moreover, when comparing clustering algorithms for a given number of clusters k, the lower the value of FOM for a given algorithm, the better its predictive power. We now review this work, using the 2-norm FOM, which is the most used instance in the FOM family.
Assume that a clustering algorithm is given the data matrix D with column e excluded. Assume also that, with that reduced dataset, the algorithm produces a set of k clusters C = {c1,..., c
k
}. Let D(g, e) be the expression level of gene g and m
i
(e) be the average expression level of condition e for genes in cluster c
i
.
The 2-norm FOM with respect to k clusters and condition e is defined as:
(7)
Notice that FOM(e, k) is essentially a root mean square deviation. The aggregate 2-norm FOM for k clusters is then:
(8)
Both formulae (7) and (8) can be used to measure the predictive power of an algorithm. The first gives us more flexibility, since we can pick any condition, while the second gives us a total estimate over all conditions. So far, (8) is the formula mostly used in the literature. Moreover, since the experimental studies conducted by Ka Yee Yeung et al. show that FOM(k) behaves as a decreasing function of k, an adjustment factor has been introduced to properly compare clustering solutions with different numbers of clusters. A theoretical analysis by Ka Yee Yeung et al. provides the following adjustment factor:
(9)
When (9) divides (7), we refer to (7) and (8) as adjusted FOMs. We use the adjusted aggregate FOM for our experiments and, for brevity, we refer to it simply as FOM.
The use of FOM in order to establish how many clusters are present in the data follows the same heuristic methodology outlined for WCSS, i.e., one tries to identify the "knee" in the FOM plot as a function of the number of clusters. An example is provided in Fig. 7. Such an analogy between FOM and WCSS immediately suggest to extend some of the knowledge available about WCSS to FOM, as follows:
-
The approximation of FOM is based on exactly the same ideas and schemes presented for the approximation of WCSS. Indeed, FOM(e, k) in equation (7) can be approximated in exactly the same way as WCSS(k). Then, one uses equation (8) to approximate FOM. We denote those approximations as FOM-R.
-
The G-Gap idea can be extended verbatim to FOM to make it automatic and to obtain G-FOM.
-
The KL technique can be extended to FOM, although the extension is subtle. Indeed, a verbatim extension of it would yield poor results (experiments not shown). Rather, consider formula (3), with WCSS(k) substituted by FOM(k). As k increases towards k*, DIFF(k) increases to decrease sharply and then assume nearly constant values as it moves away from k*. Fig. 7 provides a small example of this behavior. So, one can take as k* the abscissa corresponding to the maximum of DIFF(k) in the interval [3, k
max
]. We refer to this method as DIFF-FOM.
Availability
All software and datasets involved in our experimentation are available at the supplementary material web site. The software is given in a jar executable file for a Java Run Time environment. It works for Linux (various versions—see supplementary material web site), Mac OS X and Windows operating systems. Minimal system requirements are specified at the supplementary material web site, together with installation instructions. Moreover, we also provide the binaries of K-means and hierarchical methods.