In the following sections we summarize the characteristics of the stability-based procedures for the assessment of the reliability of clusterings, and we introduce our proposed method based on the Bernstein inequality.
Model order selection through stability based procedures
Let be C a clustering algorithm, ρ(D) a given random perturbation procedure applied to a data set D and sim a suitable similarity measure between two clusterings (e.g. the Jaccard similarity [13]). Among the random perturbations we recall random projections from a high dimensional to a low dimensional subspace [14], or bootstrap procedures to sample a random subset of data from the original data set D[8]. Fixing an integer k (the number of clusters), we define S
k
(0 ≤ S
k
≤ 1) as the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm C to data pairs D1 and D2 obtained by randomly and independently perturbing the original data D.
If S
k
is concentrated close to 1, then the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable. This idea, mutuated by a qualitative method proposed in [15], can be formalized using the integral g(k) of the cumulative distribution F
k
of S
k
[7]:
(1)
If g(k) is close to 0 then the values of the random variable S
k
are close to 1 and hence the k-clustering is stable, while for larger values of g(k) the k-clustering is less reliable. This observation comes from the following fact:
Fact:
Proof:
Let be f
k
(s) the probability density function of S
k
; then
Moreover:
☐.
Hence, g(k) ≃ 0 implies Var[S
k
] ≃ 0. As a consequence, g(k) or equivalently E[S
k
] can be used as a good index of the reliability of the k-clusterings (clusterings with k clusters). E[S
k
] may be estimated by the empirical mean ξ
k
of n replicated similarity measures between pairs of perturbed clusterings:
(2)
where S
kj
represents the similarity between two k-clusterings obtained through the application of the algorithm C to a pair of perturbed data.
We may perform a sorting of the ξ
k
:
(3)
where p is an index permutation such that ξp(1) ≥ ξp(2) ≥ … ≥ ξp(H). In this way we obtain an ordering of the clusterings, from the most to the least reliable one.
Exploiting this ordering, we proposed a χ2-based statistical test to detect and to estimate the statistical significance of multiple-structures discovered by clustering algorithms [7]. The main drawbacks of this approach consists in an implicit normality assumption for the distribution of the S
k
(random variables that measure the similarity between two perturbed k-clusterings, see above), and in a user defined threshold parameter that determines when two k-clusterings can be considered similar and “stable”. Indeed, in general we have no guarantee that the S
k
random variables are normally distributed; moreover the “optimal” choice of the threshold parameter seems to be application dependent and may affect the overall test results.
In this contribution, to address these problems we propose a new statistical method that, adopting a stability-based approach, makes no assumptions about the distribution of the random variables and does not require any user-defined threshold parameter.
Hypothesis testing based on Bernstein inequality
We briefly recall the Bernstein inequality, because this inequality is used to build-up our proposed hypothesis testing procedure.
Bernstein inequality. If Y1, Y2, …, Y
n
are independent random variables s.t. 0 ≤ Y
i
≤ 1, with then
(4)
Using the Bernstein inequality, we would estimate if for a given r, 2 ≤ r ≤ H, there exists a statistically significant difference between the reliability of the best p(1) clustering and the p(r) clustering (eq. 3). In other words we may state the null hypothesis H0 and the alternative hypothesis in the following way:
H0: p(1) clustering is not more reliable than p(r) clustering, that is E[Sp(1)] ≤ E[Sp(r)]
H
a
: p(1) clustering is more reliable than p(r) clustering, that is E[Sp(1)] > E[Sp(r)]
To this end, consider the following random variables:
(5)
We start considering the first and last ranked clustering p(1) and p(H). In this case the above null hypothesis H0 becomes: E[Sp(1)] ≤ E[Sp(H)], or equivalently E[Sp(1)] − E[Sp(H)] = E[P
H
] ≤ 0. The distribution of the random variable X
H
(eq. 5) is in general unknown; anyway note that in the Bernstein inequality no assumption is made about the distribution of the random variables Y
i
(eq. 4). Hence, fixing a parameter Δ ≥ 0, considering true the null hypothesis E[P
H
] ≤ 0, and using Bernstein inequality, we have:
(6)
Considering an instance (a measured value) of the random variable X
H
, if we let we obtain the following probability of type I error:
with
If we reject the null hypothesis: a significant difference between the two clusterings is detected at α significance level and we continue by testing the p(H − 1) clustering. More in general if the null hypothesis has been rejected for the p(H − r + 1) clustering, 1 ≤ r ≤ H − 2 then we consider the p(H − r) clustering, and using the Boole inequality, we can estimate the type I error:
(7)
As in the previous case, if P
err
(H − r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and p(H − r) clustering and we iteratively continue the procedure estimating P
err
(H − r − 1).
This procedure stops if either of these cases succeeds:
-
I)
The null hypothesis is rejected till r = H − 2, that is ∀r, 1 ≤ r ≤ H − 2, P
err
(H − r) < α: all the possible null hypotheses have been rejected and the only reliable clustering at α-significance level is the top ranked one, that is the p(1) clustering.
II) The null hypothesis cannot be rejected for r ≤ H − 2, that is, ∃r, 1 ≤ r ≤ H − 2, P
err
(H − r) ≥ α: in this case the clusterings that are significantly less reliable than the top ranked p(1) clustering are the p(r + 1), p(r + 2),…, p(H) clusterings.
Note that in this second case we cannot state that there is no significant difference between the first r top-ranked clusterings, since the upper bound provided by the Bernstein inequality is not guaranteed to be tight. To answer to this question, we may apply the χ2-based hypothesis testing proposed in [7] to the remaining top ranked clusterings to establish which of them are significant at α level, but in this case we need to assume that the similarity measures between pairs of clusterings are distributed according to a normal distribution.
If we assume that the X
i
random variables (eq. 5) are (at least approximately) independent, we can obtain a variant of the previous Bernstein inequality-based approach, that we name Bernstein ind. for brevity. By this approach we should in principle obtain lower p values, thus assuring lower false positive rates than the Bernstein test without independence assumptions.
With these independence assumptions the null hypothesis H
0
and the alternative hypothesis for the Bernstein ind. test can be formulated as follows:
H0: ∃i, 2 ≤ i ≤ r ≤ H such that E[Sp(1)] ≤ E[Sp(r)]: it does exist at least one p(i)-clustering equally or more reliable than the first one in the group of the first r ordered clusterings.
H
a
: ∀i, 2 ≤ i ≤ r ≤ H, E[Sp(1)] > E[Sp(r)]: all the clusterings in the group of the first r ordered clusterings are less reliable than the first one.
If we assume that the null hypothesis is true, using the independence among the X
i
random variables, we may obtain the type I error:
(8)
Starting from r = H, if P
err
(r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and the other first r-clustering and we iteratively continue the procedure estimating P
err
(r − 1). As in the Bernstein test, the procedure is iterated until we remain with a single clustering (and this will be the only significant one), or until P
err
(r) ≥ α and in this case we cannot reject the null hypothesis and the first r clusterings can be considered equally reliable. Note that, strictly speaking, in this case we can only say that at least one of the first r clusterings is equally or more reliable than the first one.