In this section we present our approach to stabilitybased model order selection, considering randomized maps with bounded distortion to perturb the data, stability indices based on the distribution of the clustering similarity measures, and finally we present our χ^{2}based test for assessing the significance of the clustering solutions.
Data perturbations using randomized maps with bounded distortions
A major requirement for clustering algorithms is the reproducibility of their solutions with other data sets drawn from the same source; this is particularly true with biomolecular data, where the robustness of the solutions is of paramount importance in biomedical applications. From this standpoint the reliability of a clustering solution is tied to its stability: we may consider reliable a cluster if it is stable, that is if it is maintained across multiple data sets drawn from the same source. In real cases, however, we may dispose only of limited data, and hence we need to introduce multiple " small" perturbations into the original data to simulate multiple "similar" samples from the same underlying unknown distribution. By applying appropriate indices based on similarity measures between clusterings we can then estimate the stability and hence the reliability of the clustering solutions.
We propose to perturb the original data using random projections μ : ℝ^{d}→ ℝ^{d'}from high ddimensional spaces to lower d'dimensional subspaces. A related approach is presented in [17], where the authors proposed to perturb the data randomly choosing a subset of the original features (random subspace projection [18]); the authors did not propose any principled method to choose the dimension of the projected subspace, but a key problem consists in finding a d' such that for every pair of data p, q ∈ ℝ^{d}, the distances between the projections μ(p) and μ(q) are approximately preserved with high probability. A natural measure of the approximation is the distortion dist_{
μ
}:
dis{t}_{\mu}(p,q)=\frac{\left\right\mu (p)\mu (q){}_{2}}{\left\rightpq{}_{2}}\phantom{\rule{0.5em}{0ex}}\left(1\right)
If dist_{
μ
}(p, q) = 1, the distances are preserved; if 1  ε ≤ dist_{
μ
}(p, q) ≤ 1 + ε, we say that an εdistortion level is introduced.
In [15] we experimentally showed that random subspace projections used in [17] may introduce large distortions into gene expression data, thus introducing bias into stability indices based on this kind of random projections. For these reasons we propose to apply randomized maps with guaranteed low distortions, according to the JohnsonLindenstrauss (JL) lemma [19], that we restate in the following way: Given a ddimensional data set D = {p_{1}, p_{2},...,p_{
n
}} ⊂ ℝ^{d}and a distortion level ε, randomly choosing a d'dimensional subspace S ⊂ ℝ^{d}, with d' = c logn/ε^{2}, where c is a suitable constant, with high probability (say ≥ 0.95) the random projection μ :ℝ^{d}→ S verifies 1  ε ≤ dist_{
μ
}(p_{
i
}, p_{
j
}) ≤ 1 + ε for all p_{
i
}≠ p_{
j
}.
In practice, using randomized maps that obey the JL lemma, we may perturb the data introducing only bounded distortions, approximately preserving the metric structure of the original data [15]. Note that the dimension of the projected subspace depends only on the cardinality of the original data and the desired εdistortion, and not from the dimension d of the original space.
The embedding exhibited in [19] consists in projections from ℝ^{d}in random d'dimensional subspaces. Similar results may be obtained by using simpler maps [20, 21], represented through random d' × d matrices R=1/\sqrt{{d}^{\prime}}({r}_{ij}), where r_{
ij
}are random variables such that:
E[r_{
ij
}] = 0, Var[r_{
ij
}] = 1
Strictly speaking, these are not projections, but for sake of simplicity, we call random projections even this kind of embeddings. Examples of random projections are the following:

1.
Bernoulli random projections: represented by d' × d matrices R=1/\sqrt{{d}^{\prime}}({r}_{ij}), where r_{
ij
}are uniformly chosen in {1, 1}, such that Prob(r_{
ij
}= 1) = Prob(r_{
ij
}= 1) = 1/2 (that is the r_{
ij
}are Bernoulli random variables). In this case the JL lemma holds with c ≃ 4.

2.
Achlioptas random projections [20]: represented by d' × d matrices P=1/\sqrt{{d}^{\prime}}({r}_{ij}), where r_{
ij
}are chosen in {\sqrt{3}, 0, \sqrt{3}}, such that Prob(r_{
ij
}= 0) = 2/3, Prob(r_{
ij
}= \sqrt{3}) = Prob(r_{
ij
}= \sqrt{3}) = 1/6. In this case also we have E[r_{
ij
}] = 0 and Var[r_{
ij
}] = 1 and the JL lemma holds.

3.
Normal random projections [21, 22]: this JL lemma compliant randomized map is represented by a d' × d matrix R=1/\sqrt{{d}^{\prime}}({r}_{ij}), where r_{
ij
}are distributed according to a gaussian with 0 mean and unit variance.

4.
Random Subspace (RS) [17, 18]: represented by d' × d matrices R=\sqrt{d/d\text{'}}({r}_{ij}), where r_{
ij
}are uniformly chosen with entries in {0, 1}, and with exactly one 1 per row and at most one 1 per column. Unfortunately, RS does not satisfy the JL lemma.
Using the above randomized maps (with the exception of RS projections), the JL lemma guarantees that, with high probability, the "compressed" examples of the data set represented by the matrix D_{
R
}= RD have approximately the same distance (up to a εdistortion level) of the corresponding examples in the original space, represented by the columns of the matrix D, as long as d' ≥ c logn/ε^{2}.
We propose a general MOSRAM (Model Order Selection by RAndomized Maps) algorithmic scheme, that implements the above ideas about random projection with bounded distortions to generate a set of similarity indices of clusterings obtained by pairs of randomly projected data. The main difference with respect to the method proposed in [13] is that by our approach we perturb the original data using a randomized mapping μ : ℝ^{d}→ ℝ^{d'}:
MOSRAM algorithm
Input:
D : a dataset;
k_{
max
}: max number of clusters;
m : number of similarity measures;
μ : a randomized map;
\mathcal{C}: a clustering algorithm;
sim : a clustering similarity measure.
Output:
M(i, k): a bidimensional list of similarity measures for each k (1 ≤ i ≤ m, 2 ≤ k ≤ k_{
max
})
begin
for k := 2 to k_{
max
}
for i := 1 to m
begin
proj_{
a
}:= μ(D)
proj_{
b
}:= μ(D)
C_{
a
}:= \mathcal{C}(proj_{
a
}, k)
C_{
b
}:= \mathcal{C}(proj_{
b
}, k)
M(i, k) := sim(C_{
a
}, C_{
b
})
end
end.
The algorithm computes m similarity measures for each desired number of clusters k. Every measure is achieved by applying sim to the clustering C_{
a
}and C_{
b
}, outputs of the clustering algorithm \mathcal{C}, having as input k and the projected data proj_{
a
}and proj_{
b
}. These data are generated through randomized maps μ, with a desired distortion level ε. It is worth noting that we make no assumptions about the shape of the clusters, and in principle any clustering algorithm \mathcal{C}, randomized map μ, and clustering similarity measure sim may be used (e.g. the Jaccard or the Fowlkes and Mallows coefficients [23]).
Stability indices based on the distribution of the similarity measures
Using the similarity measures obtained through the MOSRAM algorithm, we may compute stability indices to assess the reliability of clustering solutions.
More precisely, let \mathcal{C} be a clustering algorithm, ρ a random perturbation procedure (e.g. a resampling or a random projection) and sim a suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity).
We may define the random variable S_{
k
}, 0 ≤ S_{
k
}≤ 1:
S_{
k
}= sim(\mathcal{C}(D_{1}, k), \mathcal{C}(D_{2}, k)) (2)
where D_{1} = ρ^{(1)}(D) and D_{2} = ρ^{(2)}(D) are obtained through random and independent perturbations of the data set D; the intuitive idea is that if S_{
k
}is concentrated close to 1, the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable.
Let f_{
k
}(s) be the density function of S_{
k
}and F_{
k
}(s) its cumulative distribution function. A parameter of concentration implicitly used in [13] is the integral g(k) of the cumulative distribution:
g(k)={\displaystyle {\int}_{0}^{1}{F}_{k}(s)ds}\phantom{\rule{0.5em}{0ex}}\left(3\right)
Note that if S_{
k
}is centered in 1, g(k) is close to 0, and hence it can be used as a measure of stability. Moreover, the following facts show that g(k) is strictly related to both the expectation E[S_{
k
}] and the variance Var[S_{
k
}] of the random variable S_{
k
}:
Fact 1
E[S_{
k
}] = 1  g(k).
Indeed, integrating by parts:
E[{S}_{k}]={\displaystyle {\int}_{0}^{1}s{f}_{k}(s)ds={\displaystyle {\int}_{0}^{1}s{{F}^{\prime}}_{k}(s)ds=1{\displaystyle {\int}_{0}^{1}{F}_{k}(s)ds=1g(k)}}}\phantom{\rule{0.5em}{0ex}}\left(4\right)
Fact 2
Var[S_{
k
}] ≤ g(k)(1  g(k).
Since 0 ≤ S_{
k
}≤ 1 it follows {S}_{k}^{2} ≤ S_{
k
}; therefore, using Fact 1:
Var[S_{
k
}] = E[{S}_{k}^{2}]  E[S_{
k
}]^{2} ≤ E[S_{
k
}]  E[S_{
k
}]^{2} = g(k)(1  g(k)) (5)
In conclusion, g(k) ≃ 0 then E[S_{
k
}] ≃ 1 and Var[S_{
k
}] = 0, i.e. S_{
k
}is centered close to 1. As a consequence, E[S_{
k
}] can be used as an index of the reliability of the kclustering: if E[S_{
k
}] ≃ 1, the clustering is stable, if E[S_{
k
}] ≪ 1 the clustering can be considered less reliable.
We can estimate E[S_{
k
}] by means of m similarity measures M(i, k)(1 ≤ i ≤ m) computed by the MOSRAM algorithm. In fact E[S_{
k
}] may be estimated by the empirical mean ξ_{
k
}:
{\xi}_{k}={\displaystyle \sum _{i=1}^{m}\frac{M(i,k)}{m}}\phantom{\rule{0.5em}{0ex}}\left(6\right)
A χ^{2}based test for the assessment of the significance of the solutions
In this section we propose a method for automatically finding the " optimal" number of clusters and to detect significant and possibly multilevel structures simultaneously present in the data. First of all, let us consider the vector (ξ_{2}, ξ_{3},...,ξ_{H+1}) (eq. 6) computed by using the output of the MOSRAM algorithm. We may perform a sorting of this vector:
({\xi}_{2},{\xi}_{3},\mathrm{...},{\xi}_{H+1})\stackrel{sort}{\to}({\xi}_{p(1)},{\xi}_{p(2)},\mathrm{...},{\xi}_{p(H)})\phantom{\rule{0.5em}{0ex}}\left(7\right)
where p is the permutation index such that ξ_{p(1)}≥ ξ_{p(2)}≥ ... ≥ ξ_{(H)}. Roughly speaking, this ordering represents the "most reliable" p(1)clustering down to the least reliable p(H)clustering; exploiting this we would establish which are the significant clusterings (if any) discovered in the data.
To this end, for each k ∈ \mathcal{K} = {2, 3,...,H + 1}, let us consider the random variable S_{
k
}defined in eq. 2, whose expectation is our proposed stability index. For all k and for a fixed threshold t^{o}∈ [0, 1] consider the Bernoulli random variable B_{
k
}= I(S_{
k
}> t^{o}), where I is the indicator function: I(P) = 1 if P is True, I(P) = 0 if P is False. The sum {X}_{k}={\displaystyle {\sum}_{j=1}^{m}{B}_{k}^{j}} of i.i.d. copies of B_{
k
}is distributed according to a binomial distribution with parameters m and θ_{
k
}= Prob(I(S_{
k
}> t^{o})).
If we hypothesize that all the binomial populations are independently drawn from the same distribution (i.e. θ_{
k
}= θ, for all k ∈ \mathcal{K}), for sufficiently large values of m the random variables \frac{{X}_{k}m{\theta}_{k}}{\sqrt{m{\theta}_{k}(1{\theta}_{k})}} are independent and approximately normally distributed. Consider now the random variable:
\begin{array}{ccc}{\displaystyle \sum _{k\in \mathcal{K}}\frac{{({X}_{k}m\widehat{\theta})}^{2}}{m\theta (1\theta )}}& \text{with}& \widehat{\theta}\end{array}=\frac{{\displaystyle {\sum}_{k\in \mathcal{K}}{X}_{k}}}{\left\mathcal{K}\right\cdot m}\phantom{\rule{0.5em}{0ex}}\left(8\right)
This variable is known to be distributed as a χ^{2} with \mathcal{K}  1 degrees of freedom, informally because the constraint \widehat{\theta} between the random variables X_{
k
}, k ∈ \mathcal{K} introduces a dependence between them, thus leading to a loss of one degree of freedom. By estimating the variance mθ(1  θ), with the statistic m \widehat{\theta}(1  \widehat{\theta}), we conclude that the following statistic
Y={\displaystyle \sum _{k\in \mathcal{K}}\frac{{({X}_{k}m\widehat{\theta})}^{2}}{m\widehat{\theta}(1\widehat{\theta})}~{\chi}_{\left\mathcal{K}\right1}^{2}}\phantom{\rule{0.5em}{0ex}}\left(9\right)
is approximately distributed according to {\chi}_{\left\mathcal{K}\right1}^{2} (see, e.g. [24] chapter 12, or [25] chapter 30 for more details).
A realization x_{
k
}of the random variable X_{
k
}(and the corresponding realization y of Y) can be computed by using the output of the MOSRAM algorithm:
{x}_{k}={\displaystyle \sum _{i=1}^{m}I(M(i,k)>{t}^{o})}\phantom{\rule{0.5em}{0ex}}\left(10\right)
Using y, we can test the following alternative hypotheses:
If y\ge {\chi}_{\alpha ,\left\mathcal{K}\right1}^{2} we may reject the null hypothesis at α significance level, that is we may conclude that with probability 1  α the considered proportions are different, and hence that at least one kclustering significantly differs from the others.
Using the above statistical test, we propose an iterative procedure to detect the significant number(s) of clusterings:

1.
Consider the ordered vector ξ = (ξ_{p(1)}, ξ_{p(2)},...,ξ_{p(H)})

2.
Repeat the χ^{2}based test until no significant difference is detected or the only remaining clustering is p(1)(the topranked one). At each iteration, if a significant difference is detected, remove the bottomranked clustering from ξ
The output of the proposed procedure is the set of the remaining (top sorted) kclusterings that correspond to the set of the estimate stable number of clusters (at α significance level). Equivalently, following the sorting of ξ, we may compute the pvalue (probability of committing an error if we reject the null hypothesis) for all the ordered groups of clusterings from the p(1)...p(H) to the the p(1), p(2) group, each time removing the bottom ranked clustering from the ξ vector. Note that if the set of the remaining topranked clusterings contains more than one clustering, we may find multiple structures simultaneously present in the data (at α significance level).