In this section we present our approach to stability-based model order selection, considering randomized maps with bounded distortion to perturb the data, stability indices based on the distribution of the clustering similarity measures, and finally we present our χ2-based test for assessing the significance of the clustering solutions.
Data perturbations using randomized maps with bounded distortions
A major requirement for clustering algorithms is the reproducibility of their solutions with other data sets drawn from the same source; this is particularly true with bio-molecular data, where the robustness of the solutions is of paramount importance in bio-medical applications. From this standpoint the reliability of a clustering solution is tied to its stability: we may consider reliable a cluster if it is stable, that is if it is maintained across multiple data sets drawn from the same source. In real cases, however, we may dispose only of limited data, and hence we need to introduce multiple " small" perturbations into the original data to simulate multiple "similar" samples from the same underlying unknown distribution. By applying appropriate indices based on similarity measures between clusterings we can then estimate the stability and hence the reliability of the clustering solutions.
We propose to perturb the original data using random projections μ : ℝd→ ℝd'from high d-dimensional spaces to lower d'-dimensional subspaces. A related approach is presented in [17], where the authors proposed to perturb the data randomly choosing a subset of the original features (random subspace projection [18]); the authors did not propose any principled method to choose the dimension of the projected subspace, but a key problem consists in finding a d' such that for every pair of data p, q ∈ ℝd, the distances between the projections μ(p) and μ(q) are approximately preserved with high probability. A natural measure of the approximation is the distortion dist
μ
:
If dist
μ
(p, q) = 1, the distances are preserved; if 1 - ε ≤ dist
μ
(p, q) ≤ 1 + ε, we say that an ε-distortion level is introduced.
In [15] we experimentally showed that random subspace projections used in [17] may introduce large distortions into gene expression data, thus introducing bias into stability indices based on this kind of random projections. For these reasons we propose to apply randomized maps with guaranteed low distortions, according to the Johnson-Lindenstrauss (JL) lemma [19], that we restate in the following way: Given a d-dimensional data set D = {p1, p2,...,p
n
} ⊂ ℝdand a distortion level ε, randomly choosing a d'-dimensional subspace S ⊂ ℝd, with d' = c logn/ε2, where c is a suitable constant, with high probability (say ≥ 0.95) the random projection μ :ℝd→ S verifies 1 - ε ≤ dist
μ
(p
i
, p
j
) ≤ 1 + ε for all p
i
≠ p
j
.
In practice, using randomized maps that obey the JL lemma, we may perturb the data introducing only bounded distortions, approximately preserving the metric structure of the original data [15]. Note that the dimension of the projected subspace depends only on the cardinality of the original data and the desired ε-distortion, and not from the dimension d of the original space.
The embedding exhibited in [19] consists in projections from ℝdin random d'-dimensional subspaces. Similar results may be obtained by using simpler maps [20, 21], represented through random d' × d matrices , where r
ij
are random variables such that:
E[r
ij
] = 0, Var[r
ij
] = 1
Strictly speaking, these are not projections, but for sake of simplicity, we call random projections even this kind of embeddings. Examples of random projections are the following:
-
1.
Bernoulli random projections: represented by d' × d matrices , where r
ij
are uniformly chosen in {-1, 1}, such that Prob(r
ij
= 1) = Prob(r
ij
= -1) = 1/2 (that is the r
ij
are Bernoulli random variables). In this case the JL lemma holds with c ≃ 4.
-
2.
Achlioptas random projections [20]: represented by d' × d matrices , where r
ij
are chosen in {-, 0, }, such that Prob(r
ij
= 0) = 2/3, Prob(r
ij
= ) = Prob(r
ij
= -) = 1/6. In this case also we have E[r
ij
] = 0 and Var[r
ij
] = 1 and the JL lemma holds.
-
3.
Normal random projections [21, 22]: this JL lemma compliant randomized map is represented by a d' × d matrix , where r
ij
are distributed according to a gaussian with 0 mean and unit variance.
-
4.
Random Subspace (RS) [17, 18]: represented by d' × d matrices , where r
ij
are uniformly chosen with entries in {0, 1}, and with exactly one 1 per row and at most one 1 per column. Unfortunately, RS does not satisfy the JL lemma.
Using the above randomized maps (with the exception of RS projections), the JL lemma guarantees that, with high probability, the "compressed" examples of the data set represented by the matrix D
R
= RD have approximately the same distance (up to a ε-distortion level) of the corresponding examples in the original space, represented by the columns of the matrix D, as long as d' ≥ c logn/ε2.
We propose a general MOSRAM (Model Order Selection by RAndomized Maps) algorithmic scheme, that implements the above ideas about random projection with bounded distortions to generate a set of similarity indices of clusterings obtained by pairs of randomly projected data. The main difference with respect to the method proposed in [13] is that by our approach we perturb the original data using a randomized mapping μ : ℝd→ ℝd':
MOSRAM algorithm
Input:
D : a dataset;
k
max
: max number of clusters;
m : number of similarity measures;
μ : a randomized map;
: a clustering algorithm;
sim : a clustering similarity measure.
Output:
M(i, k): a bidimensional list of similarity measures for each k (1 ≤ i ≤ m, 2 ≤ k ≤ k
max
)
begin
for k := 2 to k
max
for i := 1 to m
begin
proj
a
:= μ(D)
proj
b
:= μ(D)
C
a
:= (proj
a
, k)
C
b
:= (proj
b
, k)
M(i, k) := sim(C
a
, C
b
)
end
end.
The algorithm computes m similarity measures for each desired number of clusters k. Every measure is achieved by applying sim to the clustering C
a
and C
b
, outputs of the clustering algorithm , having as input k and the projected data proj
a
and proj
b
. These data are generated through randomized maps μ, with a desired distortion level ε. It is worth noting that we make no assumptions about the shape of the clusters, and in principle any clustering algorithm , randomized map μ, and clustering similarity measure sim may be used (e.g. the Jaccard or the Fowlkes and Mallows coefficients [23]).
Stability indices based on the distribution of the similarity measures
Using the similarity measures obtained through the MOSRAM algorithm, we may compute stability indices to assess the reliability of clustering solutions.
More precisely, let be a clustering algorithm, ρ a random perturbation procedure (e.g. a resampling or a random projection) and sim a suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity).
We may define the random variable S
k
, 0 ≤ S
k
≤ 1:
S
k
= sim((D1, k), (D2, k)) (2)
where D1 = ρ(1)(D) and D2 = ρ(2)(D) are obtained through random and independent perturbations of the data set D; the intuitive idea is that if S
k
is concentrated close to 1, the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable.
Let f
k
(s) be the density function of S
k
and F
k
(s) its cumulative distribution function. A parameter of concentration implicitly used in [13] is the integral g(k) of the cumulative distribution:
Note that if S
k
is centered in 1, g(k) is close to 0, and hence it can be used as a measure of stability. Moreover, the following facts show that g(k) is strictly related to both the expectation E[S
k
] and the variance Var[S
k
] of the random variable S
k
:
Fact 1
E[S
k
] = 1 - g(k).
Indeed, integrating by parts:
Fact 2
Var[S
k
] ≤ g(k)(1 - g(k).
Since 0 ≤ S
k
≤ 1 it follows ≤ S
k
; therefore, using Fact 1:
Var[S
k
] = E[] - E[S
k
]2 ≤ E[S
k
] - E[S
k
]2 = g(k)(1 - g(k)) (5)
In conclusion, g(k) ≃ 0 then E[S
k
] ≃ 1 and Var[S
k
] = 0, i.e. S
k
is centered close to 1. As a consequence, E[S
k
] can be used as an index of the reliability of the k-clustering: if E[S
k
] ≃ 1, the clustering is stable, if E[S
k
] ≪ 1 the clustering can be considered less reliable.
We can estimate E[S
k
] by means of m similarity measures M(i, k)(1 ≤ i ≤ m) computed by the MOSRAM algorithm. In fact E[S
k
] may be estimated by the empirical mean ξ
k
:
A χ2-based test for the assessment of the significance of the solutions
In this section we propose a method for automatically finding the " optimal" number of clusters and to detect significant and possibly multi-level structures simultaneously present in the data. First of all, let us consider the vector (ξ2, ξ3,...,ξH+1) (eq. 6) computed by using the output of the MOSRAM algorithm. We may perform a sorting of this vector:
where p is the permutation index such that ξp(1)≥ ξp(2)≥ ... ≥ ξ(H). Roughly speaking, this ordering represents the "most reliable" p(1)-clustering down to the least reliable p(H)-clustering; exploiting this we would establish which are the significant clusterings (if any) discovered in the data.
To this end, for each k ∈ = {2, 3,...,H + 1}, let us consider the random variable S
k
defined in eq. 2, whose expectation is our proposed stability index. For all k and for a fixed threshold to∈ [0, 1] consider the Bernoulli random variable B
k
= I(S
k
> to), where I is the indicator function: I(P) = 1 if P is True, I(P) = 0 if P is False. The sum of i.i.d. copies of B
k
is distributed according to a binomial distribution with parameters m and θ
k
= Prob(I(S
k
> to)).
If we hypothesize that all the binomial populations are independently drawn from the same distribution (i.e. θ
k
= θ, for all k ∈ ), for sufficiently large values of m the random variables are independent and approximately normally distributed. Consider now the random variable:
This variable is known to be distributed as a χ2 with || - 1 degrees of freedom, informally because the constraint between the random variables X
k
, k ∈ introduces a dependence between them, thus leading to a loss of one degree of freedom. By estimating the variance mθ(1 - θ), with the statistic m (1 - ), we conclude that the following statistic
is approximately distributed according to (see, e.g. [24] chapter 12, or [25] chapter 30 for more details).
A realization x
k
of the random variable X
k
(and the corresponding realization y of Y) can be computed by using the output of the MOSRAM algorithm:
Using y, we can test the following alternative hypotheses:
If we may reject the null hypothesis at α significance level, that is we may conclude that with probability 1 - α the considered proportions are different, and hence that at least one k-clustering significantly differs from the others.
Using the above statistical test, we propose an iterative procedure to detect the significant number(s) of clusterings:
-
1.
Consider the ordered vector ξ = (ξp(1), ξp(2),...,ξp(H))
-
2.
Repeat the χ2-based test until no significant difference is detected or the only remaining clustering is p(1)(the top-ranked one). At each iteration, if a significant difference is detected, remove the bottom-ranked clustering from ξ
The output of the proposed procedure is the set of the remaining (top sorted) k-clusterings that correspond to the set of the estimate stable number of clusters (at α significance level). Equivalently, following the sorting of ξ, we may compute the p-value (probability of committing an error if we reject the null hypothesis) for all the ordered groups of clusterings from the p(1)...p(H) to the the p(1), p(2) group, each time removing the bottom ranked clustering from the ξ vector. Note that if the set of the remaining top-ranked clusterings contains more than one clustering, we may find multiple structures simultaneously present in the data (at α significance level).