In the following sections we summarize the characteristics of the stabilitybased procedures for the assessment of the reliability of clusterings, and we introduce our proposed method based on the Bernstein inequality.
Model order selection through stability based procedures
Let be C a clustering algorithm, ρ(D) a given random perturbation procedure applied to a data set D and sim a suitable similarity measure between two clusterings (e.g. the Jaccard similarity [13]). Among the random perturbations we recall random projections from a high dimensional to a low dimensional subspace [14], or bootstrap procedures to sample a random subset of data from the original data set D[8]. Fixing an integer k (the number of clusters), we define S_{
k
} (0 ≤ S_{
k
} ≤ 1) as the random variable given by the similarity between two kclusterings obtained by applying a clustering algorithm C to data pairs D_{1} and D_{2} obtained by randomly and independently perturbing the original data D.
If S_{
k
} is concentrated close to 1, then the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable. This idea, mutuated by a qualitative method proposed in [15], can be formalized using the integral g(k) of the cumulative distribution F_{
k
} of S_{
k
}[7]:
g\left(k\right)\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{\displaystyle {\int}_{0}^{1}{F}_{k}\left(s\right)ds}\phantom{\rule{0.5em}{0ex}}
(1)
If g(k) is close to 0 then the values of the random variable S_{
k
} are close to 1 and hence the kclustering is stable, while for larger values of g(k) the kclustering is less reliable. This observation comes from the following fact:
Fact: E\left[{S}_{k}\right]\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}1\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}g\left(k\right),\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}Var\left[{S}_{k}\right]\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}g\left(k\right)\phantom{\rule{0.5em}{0ex}}\left(1\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}g\left(k\right)\right).
Proof:
Let be f_{
k
}(s) the probability density function of S_{
k
}; then
E\left[{S}_{k}\right]\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{\displaystyle {\int}_{0}^{1}s{f}_{k}\left(s\right)ds}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{\displaystyle {\int}_{0}^{1}s{F}_{k}^{\text{'}}\left(s\right)ds}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}1\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{\displaystyle {\int}_{0}^{1}{F}_{k}\left(s\right)ds}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}1\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}g\left(k\right)
Moreover:
Var\left[{S}_{k}\right]\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}E\left[{S}_{k}^{2}\right]\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}E{\left[{S}_{k}\right]}^{2}\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}E\left[{S}_{k}\right]\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}E{\left[{S}_{k}\right]}^{2}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}g\left(k\right)\phantom{\rule{0.5em}{0ex}}\left(1g\left(k\right)\right)
☐.
Hence, g(k) ≃ 0 implies Var[S_{
k
}] ≃ 0. As a consequence, g(k) or equivalently E[S_{
k
}] can be used as a good index of the reliability of the kclusterings (clusterings with k clusters). E[S_{
k
}] may be estimated by the empirical mean ξ_{
k
} of n replicated similarity measures between pairs of perturbed clusterings:
{\xi}_{k}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{\displaystyle \sum _{j=1}^{n}\frac{{S}_{kj}}{n}}
(2)
where S_{
kj
} represents the similarity between two kclusterings obtained through the application of the algorithm C to a pair of perturbed data.
We may perform a sorting of the ξ_{
k
}:
\left({\xi}_{2},\phantom{\rule{0.5em}{0ex}}{\xi}_{3,\phantom{\rule{0.5em}{0ex}}\mathrm{...}\phantom{\rule{0.5em}{0ex}},\phantom{\rule{0.5em}{0ex}}}{\xi}_{H+1}\right)\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\underrightarrow{sort}\phantom{\rule{0.5em}{0ex}}\left({\xi}_{p\left(1\right)},\phantom{\rule{0.5em}{0ex}}{\xi}_{p\left(2\right)}\phantom{\rule{0.5em}{0ex}},\mathrm{...}\phantom{\rule{0.5em}{0ex}},\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{\xi}_{p\left(H\right)}\right)
(3)
where p is an index permutation such that ξ_{p(1)} ≥ ξ_{p(2)} ≥ … ≥ ξ_{p(H)}. In this way we obtain an ordering of the clusterings, from the most to the least reliable one.
Exploiting this ordering, we proposed a χ^{2}based statistical test to detect and to estimate the statistical significance of multiplestructures discovered by clustering algorithms [7]. The main drawbacks of this approach consists in an implicit normality assumption for the distribution of the S_{
k
} (random variables that measure the similarity between two perturbed kclusterings, see above), and in a user defined threshold parameter that determines when two kclusterings can be considered similar and “stable”. Indeed, in general we have no guarantee that the S_{
k
} random variables are normally distributed; moreover the “optimal” choice of the threshold parameter seems to be application dependent and may affect the overall test results.
In this contribution, to address these problems we propose a new statistical method that, adopting a stabilitybased approach, makes no assumptions about the distribution of the random variables and does not require any userdefined threshold parameter.
Hypothesis testing based on Bernstein inequality
We briefly recall the Bernstein inequality, because this inequality is used to buildup our proposed hypothesis testing procedure.
Bernstein inequality. If Y_{1}, Y_{2}, …, Y_{
n
} are independent random variables s.t. 0 ≤ Y_{
i
} ≤ 1, with \mu \phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}E\left[{Y}_{i}\right],\phantom{\rule{0.5em}{0ex}}{\sigma}^{2}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}Var\left[{Y}_{i}\right],\phantom{\rule{0.5em}{0ex}}\overline{Y}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}\sum {Y}_{i}/n then
Prob\left\{\overline{Y}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\mu \phantom{\rule{0.5em}{0ex}}\ge \phantom{\rule{0.5em}{0ex}}\Delta \right\}\phantom{\rule{0.5em}{0ex}}\le e{\phantom{\rule{0.5em}{0ex}}}^{\frac{n{\Delta}^{2}}{2{\sigma}^{2}+\phantom{\rule{0.5em}{0ex}}2/3\phantom{\rule{0.5em}{0ex}}\Delta}}
(4)
Using the Bernstein inequality, we would estimate if for a given r, 2 ≤ r ≤ H, there exists a statistically significant difference between the reliability of the best p(1) clustering and the p(r) clustering (eq. 3). In other words we may state the null hypothesis H_{0} and the alternative hypothesis in the following way:
H_{0}: p(1) clustering is not more reliable than p(r) clustering, that is E[S_{p(1)}] ≤ E[S_{p(r)}]
H_{
a
}: p(1) clustering is more reliable than p(r) clustering, that is E[S_{p(1)}] > E[S_{p(r)}]
To this end, consider the following random variables:
{P}_{i}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{S}_{p\left(1\right)}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{S}_{p\left(i\right)}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\text{and}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{X}_{i}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{\xi}_{p\left(1\right)}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{\xi}_{p\left(i\right)}
(5)
We start considering the first and last ranked clustering p(1) and p(H). In this case the above null hypothesis H_{0} becomes: E[S_{p(1)}] ≤ E[S_{p(H)}], or equivalently E[S_{p(1)}] − E[S_{p(H)}] = E[P_{
H
}] ≤ 0. The distribution of the random variable X_{
H
} (eq. 5) is in general unknown; anyway note that in the Bernstein inequality no assumption is made about the distribution of the random variables Y_{
i
} (eq. 4). Hence, fixing a parameter Δ ≥ 0, considering true the null hypothesis E[P_{
H
}] ≤ 0, and using Bernstein inequality, we have:
Prob\left\{{X}_{H}\ge \phantom{\rule{0.5em}{0ex}}\Delta \right\}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}Prob\phantom{\rule{0.5em}{0ex}}\left\{{X}_{H}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}E\left[{P}_{H}\right]\ge \phantom{\rule{0.5em}{0ex}}\Delta \right\}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}{e}^{\frac{n\phantom{\rule{0.5em}{0ex}}{\Delta}^{2}}{2{\sigma}^{2}+2/3\Delta}}
(6)
Considering an instance (a measured value) {\widehat{X}}_{H} of the random variable X_{
H
}, if we let \Delta ={\widehat{X}}_{H} we obtain the following probability of type I error:
{P}_{err}\left\{{X}_{H}\phantom{\rule{0.5em}{0ex}}\ge \phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{H}\right\}\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}{e}^{\frac{\phantom{\rule{0.5em}{0ex}}n\phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{H}^{2}}{\phantom{\rule{0.5em}{0ex}}2{\sigma}_{H}^{2}+\phantom{\rule{0.5em}{0ex}}2/3\phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{H}}}
with {\sigma}_{H}^{2}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{\sigma}_{p\left(1\right)}^{2}\phantom{\rule{0.5em}{0ex}}+\phantom{\rule{0.5em}{0ex}}{\sigma}_{p\left(H\right)}^{2}.
If {P}_{err}\left\{{X}_{H}\phantom{\rule{0.5em}{0ex}}\ge \phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{H}\right\}<\phantom{\rule{0.5em}{0ex}}\alpha , we reject the null hypothesis: a significant difference between the two clusterings is detected at α significance level and we continue by testing the p(H − 1) clustering. More in general if the null hypothesis has been rejected for the p(H − r + 1) clustering, 1 ≤ r ≤ H − 2 then we consider the p(H − r) clustering, and using the Boole inequality, we can estimate the type I error:
{P}_{err}\left(H\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}r\right)\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}Prob\left\{\underset{Hr\le i\le H}{{\displaystyle \vee}}\phantom{\rule{0.5em}{0ex}}{X}_{i}\phantom{\rule{0.5em}{0ex}}\ge \phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{i}\right\}\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}{\displaystyle \sum _{i=Hr}^{H}Prob\left\{{X}_{i}\phantom{\rule{0.5em}{0ex}}\ge \phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{i}\right\}}\phantom{\rule{0.5em}{0ex}}\le {\displaystyle \sum _{i=Hr}^{H}{e}^{\frac{\phantom{\rule{0.5em}{0ex}}n\phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{i}^{2}}{2{\sigma}_{i}^{2}+\phantom{\rule{0.5em}{0ex}}2/3\phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{i}\phantom{\rule{0.5em}{0ex}}}}}
(7)
As in the previous case, if P_{
err
}(H − r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and p(H − r) clustering and we iteratively continue the procedure estimating P_{
err
}(H − r − 1).
This procedure stops if either of these cases succeeds:

I)
The null hypothesis is rejected till r = H − 2, that is ∀r, 1 ≤ r ≤ H − 2, P_{
err
}(H − r) < α: all the possible null hypotheses have been rejected and the only reliable clustering at αsignificance level is the top ranked one, that is the p(1) clustering.
II) The null hypothesis cannot be rejected for r ≤ H − 2, that is, ∃r, 1 ≤ r ≤ H − 2, P_{
err
}(H − r) ≥ α: in this case the clusterings that are significantly less reliable than the top ranked p(1) clustering are the p(r + 1), p(r + 2),…, p(H) clusterings.
Note that in this second case we cannot state that there is no significant difference between the first r topranked clusterings, since the upper bound provided by the Bernstein inequality is not guaranteed to be tight. To answer to this question, we may apply the χ^{2}based hypothesis testing proposed in [7] to the remaining top ranked clusterings to establish which of them are significant at α level, but in this case we need to assume that the similarity measures between pairs of clusterings are distributed according to a normal distribution.
If we assume that the X_{
i
} random variables (eq. 5) are (at least approximately) independent, we can obtain a variant of the previous Bernstein inequalitybased approach, that we name Bernstein ind. for brevity. By this approach we should in principle obtain lower p values, thus assuring lower false positive rates than the Bernstein test without independence assumptions.
With these independence assumptions the null hypothesis H_{
0
} and the alternative hypothesis for the Bernstein ind. test can be formulated as follows:
H_{0}: ∃i, 2 ≤ i ≤ r ≤ H such that E[S_{p(1)}] ≤ E[S_{p(r)}]: it does exist at least one p(i)clustering equally or more reliable than the first one in the group of the first r ordered clusterings.
H_{
a
}: ∀i, 2 ≤ i ≤ r ≤ H, E[S_{p(1)}] > E[S_{p(r)}]: all the clusterings in the group of the first r ordered clusterings are less reliable than the first one.
If we assume that the null hypothesis is true, using the independence among the X_{
i
} random variables, we may obtain the type I error:
{P}_{err}\left(r\right)\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}Prob\left\{\underset{2\le i\le r}{{\displaystyle \Lambda}}{X}_{i}\phantom{\rule{0.5em}{0ex}}\ge \phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{i}\right\}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{\displaystyle \prod _{i=2}^{r}Prob}\left\{{X}_{i}\phantom{\rule{0.5em}{0ex}}\ge \phantom{\rule{0.5em}{0ex}}{\widehat{X}}_{i}\right\}\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}{\displaystyle \prod _{i=2}^{r}\phantom{\rule{0.5em}{0ex}}{e}^{\frac{n{\widehat{X}}_{i}^{2}}{2{\sigma}_{i}^{2}+\phantom{\rule{0.5em}{0ex}}2/3{\widehat{X}}_{i}}}}
(8)
Starting from r = H, if P_{
err
}(r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and the other first rclustering and we iteratively continue the procedure estimating P_{
err
}(r − 1). As in the Bernstein test, the procedure is iterated until we remain with a single clustering (and this will be the only significant one), or until P_{
err
}(r) ≥ α and in this case we cannot reject the null hypothesis and the first r clusterings can be considered equally reliable. Note that, strictly speaking, in this case we can only say that at least one of the first r clusterings is equally or more reliable than the first one.