Model order selection for bio-molecular data clustering

Background Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori" biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. Results We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures). Conclusion The experimental results show that our model order selection methods are competitive with other state-of-the-art stability based algorithms and are able to detect multiple levels of structure underlying both synthetic and gene expression data.

ural" number of clusters underlying the data [5,6]. From a machine learning standpoint, this is an intrinsically "illposed" problem, since in unsupervised learning we lack an external objective criterion, that is we have not an equivalent of a priori known class label as in supervised learning, and hence the evaluation of the reliability of the discovered classes becomes elusive and difficult. From a biological standpoint, in many cases we have no sufficient biological knowledge to "a priori" evaluate both the number of clusters (e.g. the number of biologically distinct tumor classes), as well as the validity of the discovered clusters (e.g. the reliability of new discovered tumor classes) [7].
To deal with these problems, several methods for assessing the validity of the discovered clusters and to test the existence of biologically meaningful clusters have been proposed (see [8] for a review).
Recently, several methods based on the concept of stability have been proposed to estimate the " optimal" number of clusters in complex bio-molecular data [9][10][11]. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations.
Different procedures have been introduced to randomly perturb the data, ranging from bootstrapping techniques [9,12,13], to noise injection into the data [14] or random projections into lower dimensional subspaces [15,16].
In particular, Smolkin and Gosh [17] applied an unsupervised version of the random subspace method [18] to estimate the stability of clustering solutions. By this approach, subsets of features are randomly selected multiple times, and clusterings obtained on the corresponding projected subspaces are compared with the clustering obtained in the original space to assess its stability. Even if this approach gives useful information about the reliability of high-dimensional clusterings, we showed that random subspace projections may induce large distortions in gene expression data, thus obscuring their real structure [15]. Moreover, a major problem with data perturbations obtained through random projections from a higher to a lower dimensional space is the choice of the dimension of the projected subspace.
In this paper we extend the Smolkin and Gosh approach to more general randomized maps from higher to lowerdimensional subspaces, in order to reduce the distortion induced by random projections. Moreover, we introduce a principled method based on the Johnson and Lindenstrauss lemma [19] to properly choose the dimension of the projected subspace. Our proposed stability indices are related to those proposed by Ben-Hur et al. [13]: their stability measures are obtained from the distribution of similarity measures across multiple pairs of clustered data perturbed through resampling techniques. In this work we propose stability indices that depend on the distribution of the similarity measures between pairs of clusterings, but data perturbation is realized through random projections to lower dimensional subspaces, in order to exploit the high-dimensionality of bio-molecular data.
Another major problem related to stability-based methods is to estimate the statistical significance of the structures discovered by clustering algorithms. To face this problem we propose a χ 2 -based statistical test that may be applied to any stability method based on the distribution of similarity measures between pairs of clusterings. We experimentally show that by this approach we may discover multiple structures simultaneously present in the data (e.g. hierarchical structures), associating a p-value to the clusterings selected by a given stability-based method for model order selection.

Methods
In this section we present our approach to stability-based model order selection, considering randomized maps with bounded distortion to perturb the data, stability indices based on the distribution of the clustering similarity measures, and finally we present our χ 2 -based test for assessing the significance of the clustering solutions.

Data perturbations using randomized maps with bounded distortions
A major requirement for clustering algorithms is the reproducibility of their solutions with other data sets drawn from the same source; this is particularly true with bio-molecular data, where the robustness of the solutions is of paramount importance in bio-medical applications. From this standpoint the reliability of a clustering solution is tied to its stability: we may consider reliable a cluster if it is stable, that is if it is maintained across multiple data sets drawn from the same source. In real cases, however, we may dispose only of limited data, and hence we need to introduce multiple " small" perturbations into the original data to simulate multiple "similar" samples from the same underlying unknown distribution. By applying appropriate indices based on similarity measures between clusterings we can then estimate the stability and hence the reliability of the clustering solutions.
We propose to perturb the original data using random projections μ : ‫ޒ‬ d → ‫ޒ‬ d' from high d-dimensional spaces to lower d'-dimensional subspaces. A related approach is presented in [17], where the authors proposed to perturb the data randomly choosing a subset of the original features (random subspace projection [18]); the authors did not propose any principled method to choose the dimension of the projected subspace, but a key problem consists in finding a d' such that for every pair of data p, q ∈ ‫ޒ‬ d , the distances between the projections μ(p) and μ(q) are approximately preserved with high probability. A natural measure of the approximation is the distortion dist μ : If dist μ (p, q) = 1, the distances are preserved; if 1 -ε ≤ dist μ (p, q) ≤ 1 + ε, we say that an ε-distortion level is introduced.
In [15] we experimentally showed that random subspace projections used in [17] may introduce large distortions into gene expression data, thus introducing bias into stability indices based on this kind of random projections. For these reasons we propose to apply randomized maps with guaranteed low distortions, according to the Johnson-Lindenstrauss (JL) lemma [19], that we restate in the following way: Given a d-dimensional data set D = {p 1 , p 2 ,...,p n } ⊂ ‫ޒ‬ d and a distortion level ε, randomly choosing a d'- In practice, using randomized maps that obey the JL lemma, we may perturb the data introducing only bounded distortions, approximately preserving the metric structure of the original data [15]. Note that the dimension of the projected subspace depends only on the cardinality of the original data and the desired ε-distortion, and not from the dimension d of the original space.
The embedding exhibited in [19] consists in projections from ‫ޒ‬ d in random d'-dimensional subspaces. Similar results may be obtained by using simpler maps [20,21], represented through random d' × d matrices , where r ij are random variables such that: Strictly speaking, these are not projections, but for sake of simplicity, we call random projections even this kind of embeddings. Examples of random projections are the following:  We propose a general MOSRAM (Model Order Selection by RAndomized Maps) algorithmic scheme, that implements the above ideas about random projection with bounded distortions to generate a set of similarity indices of clusterings obtained by pairs of randomly projected data. The main difference with respect to the method proposed in [13] is that by our approach we perturb the original data using a randomized mapping μ : sim : a clustering similarity measure. Output: The algorithm computes m similarity measures for each desired number of clusters k. Every measure is achieved by applying sim to the clustering C a and C b , outputs of the clustering algorithm , having as input k and the projected data proj a and proj b . These data are generated through randomized maps μ, with a desired distortion level ε. It is worth noting that we make no assumptions about the shape of the clusters, and in principle any clustering algorithm , randomized map μ, and clustering similarity measure sim may be used (e.g. the Jaccard or the Fowlkes and Mallows coefficients [23]).

Stability indices based on the distribution of the similarity measures
Using the similarity measures obtained through the MOS-RAM algorithm, we may compute stability indices to assess the reliability of clustering solutions.
More precisely, let be a clustering algorithm, ρ a random perturbation procedure (e.g. a resampling or a random projection) and sim a suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity).
We may define the random variable S k , 0 ≤ S k ≤ 1: where D 1 = ρ (1) (D) and D 2 = ρ (2) (D) are obtained through random and independent perturbations of the data set D; the intuitive idea is that if S k is concentrated close to 1, the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable.
Let f k (s) be the density function of S k and F k (s) its cumulative distribution function. A parameter of concentration implicitly used in [13] is the integral g(k) of the cumulative distribution: Note that if S k is centered in 1, g(k) is close to 0, and hence it can be used as a measure of stability. Moreover, the following facts show that g(k) is strictly related to both the expectation E[S k ] and the variance Var[S k ] of the random variable S k : Indeed, integrating by parts: Since 0 ≤ S k ≤ 1 it follows ≤ S k ; therefore, using Fact 1: In fact E[S k ] may be estimated by the empirical mean ξ k :

-based test for the assessment of the significance of the solutions
In this section we propose a method for automatically finding the " optimal" number of clusters and to detect significant and possibly multi-level structures simultaneously present in the data. First of all, let us consider the vector (ξ 2 , ξ 3 ,...,ξ H+1 ) (eq. 6) computed by using the output of the MOSRAM algorithm. We may perform a sorting of this vector: where p is the permutation index such that ξ p(1) ≥ ξ p(2) ≥ ... ≥ ξ (H) . Roughly speaking, this ordering represents the "most reliable" p(1)-clustering down to the least reliable p(H)-clustering; exploiting this we would establish which are the significant clusterings (if any) discovered in the data.
To this end, for each k ∈ = {2, 3,...,H + 1}, let us consider the random variable S k defined in eq. 2, whose expectation is our proposed stability index. For all k and for a fixed threshold t o ∈ [0, 1] consider the Bernoulli ran- ing to a binomial distribution with parameters m and θ k = Prob(I(S k > t o )).
If we hypothesize that all the binomial populations are independently drawn from the same distribution (i.e. θ k = θ, for all k ∈ ), for sufficiently large values of m the random variables are independent and approximately normally distributed. Consider now the random variable: This variable is known to be distributed as a χ 2 with | | -1 degrees of freedom, informally because the constraint between the random variables X k , k ∈ introduces a dependence between them, thus leading to a loss of one degree of freedom. By estimating the variance mθ(1 -θ), with the statistic m (1 -), we conclude that the following statistic is approximately distributed according to (see, e.g. [24] chapter 12, or [25] chapter 30 for more details).
A realization x k of the random variable X k (and the corresponding realization y of Y) can be computed by using the output of the MOSRAM algorithm: Using y, we can test the following alternative hypotheses: • Ho: all the θ k are equal to θ(the considered set of k-clusterings are equally reliable) • Ha: the θ k are not all equal between them (the considered set of k-clusterings are not equally reliable) If we may reject the null hypothesis at α significance level, that is we may conclude that with probability 1 -α the considered proportions are different, and hence that at least one k-clustering significantly differs from the others.
Using the above statistical test, we propose an iterative procedure to detect the significant number(s) of clusterings:

Results and discussion
We present experiments with synthetic and gene expression data to show the effectiveness of our approach. At first, using synthetic data, we show that our proposed methods can detect not only the "correct" number of clusters, but also multiple structures underlying the data. Then we apply our MOSRAM algorithm to discover the " natural" number of clusters in gene expression data, and we compare the results with other algorithms for model order selection. In our experiments we used the classical kmeans [26] and Prediction Around Medoid (PAM) [27] clustering algorithms, and we applied the Bernoulli, Achlioptas and Normal random projections, but in this section we show only the results obtained with Bernoulli projections, since with the other randomized maps we achieved the same results without any significant difference. In all our experiments we set the threshold t o (see Section "A χ 2based test for the assessment of the significance of the solutions" to 0.9. Moreover we applied our proposed χ 2 -based procedure to individuate sets of significant k-clusterings into the data. The methods and algorithms described in this paper have been implemented in the mosclust R package, publicly available at [28].

Detection of multiple levels of structure in synthetic data
To show the ability of our method to discover multiple structures simultaneously present in the data, we propose an experiment with a 1000-dimensional synthetic multivariate gaussian data set (sample1) with relatively low cardinality (60 examples), characterized by a two-level hierarchical structure, highlighted by the projection of the data into the two main principal components (Figure 1): indeed a two-level structure, with respectively 2 and 6 clusters is self-evident in the data.
Two clusterings (using the Prediction Around Medoid algorithm) are detected at 10 -4 significance level by applying our MOSRAM algorithm and the proposed χ 2 -based statistical procedure. In particular we performed 100 pairs of Bernoulli projections with a distortion bounded to 1.2 (ε = 0.2), yielding to random projections from 1000 to 479-dimensional subspaces. Indeed Table 1 reports the sorted means of the stability measures together with their variance and the corresponding p-values computed according to the proposed χ 2 -based statistical test, showing that 2 and 6-clusterings are the best scored, as well as the most significant k-clusterings discovered in the data. This situation is depicted in Figure 2, where the histograms of the similarity measures for k = 2 and k = 6 clusters are tightly concentrated near

Experiments with DNA microarray data
To show the effectiveness of our methods with gene expression data we applied MOSRAM and the proposed statistical test to Leukemia [29] and Lymphoma [1] samples. These data sets have been analyzed with other model order selection algorithms previously proposed [10,13,[30][31][32]: at the end of this section we compare the results obtained with the cited methods with our proposed MOSRAM algorithm.

Leukemia
This well known data set [29] is composed by 72 leukemia samples analyzed with oligonucleotide Affymetrix microarrays. The Leukemia data set is composed by a group of 25 acute myeloid leukemia (AML) samples and another group of 47 acute lymphoblastic leukemia (ALL) samples, that can be subdivided into 38 B-Cell and 9 T-Cell subgroups, resulting in a two-level hierarchical structure. We applied the same pre-processing steps performed by the authors of the Leukemia study [29], obtaining 3571 genes from the original 7129 gene expression values. We further selected the 100 genes with the highest variance across samples, since low variance genes are unlikely to be informative for the purpose of clustering [10,31]. We analyzed both the 3571-dimensional data and the data restricted to the 100 genes with highest variance, using respectively Bernoulli projections with ε ∈ {0.1, 0.2, 0.3, 0.4} and projections to 80-dimensional subspaces. In both cases the k-means clustering algorithm has been applied. Figure 3 summarizes the results using gene expression levels of the genes with highest variance. Table 2 reports the sorted means of the stability measures together with their variance and the corresponding p-values computed according to the proposed χ 2 -based statistical test. By these results 2 and 3 clusters are correctly predicted at α = 10 -5 significance level. Indeed the empirical means of the stability measures (eq. 6) for 2 and 3 clusters are quite similar and the corresponding lines (black and red) of the empirical cumulative distributions (ecdfs) cross several times, while the other ecdfs are clearly apart from them ( Figure 3). The same results are approximately obtained also using the 3571-dimensional data with random projections to 428-dimensional subspaces (ε = 0.2), but 2 and 3 clusters are predicted at α = 10 -13 significance level. Similar results are also achieved with ε = 0.1 and ε = 0.3, while with ε = 0.4 the results are less reliable due to the relatively large distortion induced (data not shown). Also using the PAM [27] and hierarchical clustering algorithms with the Ward method [33] we obtained a two-level structure with 2 and 3 clusters at α = 10 -5 significance level.

Lymphoma
Three different lymphoid malignancies are represented in the Lymphoma gene expression data set [1]: Diffuse Large B-Cell Lymphoma (DLBCL), Follicular Lymphoma (FL) and Chronic Lymphocytic Leukemia (CLL). The gene expression measurements are obtained with a cDNA microarray specialized for genes related to lymphoid diseases, the Lymphochip, which provides expression levels for 4026 genes [34]. The 62 available samples are subdivided in 42 DLBCL, 11 CLL and 9 FL. We performed preprocessing of the data according to [1], replacing missing values with 0 and then normalizing the data to zero mean and unit variance across genes. As a final step, according to [10], we further selected the 200 genes with highest variance across samples, obtaining a resulting data set with A two-level hierarchical structure with 2 and 6 clusters is revealed by principal components analysis (data projected into the two components with highest variance) Figure 1 A two-level hierarchical structure with 2 and 6 clusters is revealed by principal components analysis (data projected into the two components with highest variance).  The k-means clustering algorithm has been applied.
The results with highest variance genes are summarized in Figure 4 and Table 3. The statistical test identifies as significant only the 2-clustering. Indeed, looking at the ecdf of the stability index values (left), the 2-clustering (black) is clearly separated from the others. The 3 (red) and 4-clustering (green) graphs, are quite distinct from the others, as shown also by the corresponding empirical mean of the stability index values (Figure 4), but they are also clearly separated from the 2-clustering curve. Accordingly, our proposed χ 2 -based test found a significant difference between the 2-clustering and all the others. Similar results are obtained also with hierarchical clustering and PAM algorithms. Using all the 4026 genes and Bernoulli random projections to 413-dimensional (ε = 0.2) subspaces with the Ward's hierarchical clustering algorithm our method finds as significant the 3-clustering as well as the 2-clustering. In this case also similar results are obtained with ε = 0.1 and ε = 0.3. It is worth noting that the subdivision of Lymphoma samples in 3 classes (DLBCL, CLL and FL) have been defined on histopathological and morphological basis and it has been shown that this classification does not correspond to the bio-molecular characteristics and to clinical outcome classes of non Hodgkin lymphomas. In particular studies based on the gene expression signatures of the DLBCL patients [1] and on their supervised analysis [35], showed the existence of two subclasses of DLBCLs. Moreover Shipp et al. [36] highlighted that FL patients frequently evolve over time and acquire the clinical features of DLBCLs, and Lange et al. [10] found that a 3-clustering solution groups together FL, CLL and a subgroup of DLBCLs, while another subgroup of DLBCLs sets up another cluster, even if the overall stability of the clustering is lower with respect to the 2-clustering solution.
The relationships between FL and subgroups of DLBCL patients are confirmed also by recent studies on the individual stability of the clusters in DLBCL and FL patients [15]. These considerations show also that the stability analysis of patients clusters in DNA microarray analysis are only the first step to discover significant subclasses of pathologies at bio-molecular level, while another necessary step is represented by the bio-medical validation.

Comparison with other methods
We compared the results obtained by the MOSRAM algorithm with other model order selection methods using the Leukemia and Lymphoma data sets analyzed in the previous section. In particular we focused our comparison with other state-of-the-art stability-based methods proposed in the literature.  The Model Explorer algorithm adopts subsampling techniques to perturb the data (data are randomly drawn without replacement) and applies stability measures based on the empirical distribution of the stability measures [13]. This approach is quite similar to ours but we applied random projections to perturb the data and a statistical test to identify significant numbers of clusters, instead of simply qualitatively looking at the distributions of the stability indices. The Figure of Merit measure is based on a resampling approach too, but the stability of the solutions is assessed directly comparing the solution obtained on the full sample with that obtained on the subsamples [32]. We considered also stability-based methods that apply supervised algorithms to assess the quality of the discovered clusterings instead of comparing pairs of perturbed clusterings [10,31]: the main differences between these Histograms of the similarity measure distributions for different numbers of clusters last approaches are the choice of the supervised predictor and other parameters (no guidance is given in [31], while in [10] a more structured approach is proposed). Finally we considered also a non-stability-based method, the Gap statistic, that applies an estimates of the gap between the total sum within-class dissimilarities and a null reference distribution (the uniform distribution on the smallest hyper-rectangle that contains all the data) to assess the "optimal" number of clusters in the data. Table 4 shows the number of clusters selected by the different methods, as well as their "true" number. The "true" number is estimated according to the a priori biological knowledge about the data [1,29] (see Section Experiments with DNA microarray data). The best results achieved with the two gene expression data sets are highlighted with a bounding box. The MOSRAM algorithm achieves results competitive with the other state-of-the-art model order selection methods. Indeed MOSRAM correctly predicts the "true" number of clusters with the Leukemia data set and partially with the Lymphoma data set. Note that the 2clustering prediction with Lymphoma may be considered reliable, as outlined in the corresponding experimental section.
Leukemia data set: empirical cumulative distribution functions of the similarity measures for different number of clusters k involved in microarray chips, the redundancy of information stored in the spots of microarrays are all characteristics in favour of our approach. On the contrary using bootstrapping techniques to obtain smaller samples from just small samples of patients should induce more randomness in the estimate of cluster stability. A resampling based approach appears to be better suited to evaluate the cluster stability of genes, since significantly larger samples are available in this case [12]. The alternative based on noise injection into the data to obtain multiple instance of perturbed data poses difficult statistical problems for evaluating what kind and which magnitude of noise should be added to the data [17].
All the perturbation-based methods need to properly select a parameter to control the amount of perturbation of the data: resampled-based methods need to select the "optimal" fraction of the data to be subsampled; noiseinjection-based methods needs to choice the amount of noise to be introduced; random subspace and random projections-based methods needs to select the proper dimension of the projected data. Anyway, our approach cumulative distribution similarities provides a theoretically motivated method to automatically find an "optimal" value for the perturbation parameter, and in our experiments we observed that values of ε ≤ 0.2 led to reliable results. Moreover our proposed approach provides also a statistical test that may be applied also with other stability-based methods to assess the significance of the discovered solutions.
Despite of the convincing experimental results obtained with stability-based methods there are some drawbacks and open problems associated with these techniques. Indeed, as shown by [8], a given clustering may converge to a suboptimal solution owing to the shape of the data manifold and not to the real structure of the data, thus introducing bias in the stability indices. Moreover in [37] it has been shown that stability based methods based on resampling techniques, when cost-based clustering algorithms are used, may fail to detect the correct number of clusters, if the data are not symmetric. However it is unclear if these results may be extended to other stabilitybased methods (e.g. to our proposed methods based on random projections) or to other more general classes of clustering algorithms.

Conclusion
We proposed a stability-based method, based on random projections, for assessing the validity of clusterings discovered in high-dimensional post-genomic data. The reliability of the discovered k-clusterings may be estimated exploiting the distribution of the clustering pairwise similarities, and a χ 2 -based statistical test tailored to unsupervised model order selection. In the theoretical framework of randomized maps that satisfy the JL lemma, a principled approach to select the dimension of the projected data, and to approximately preserve the structure of the original data is given, thus yielding to the design of reliable stability indices for model order selection in bio-molecular data clusterings.
The χ 2 -based statistical test may be applied to any stability method that make use of the distribution of the similarity measures between pairs of clusterings.
Our experimental results with synthetic data and real gene expression data show that our proposed method is able to find significant structures, comprising multiple structures simultaneously present into bio-molecular data.
As an outgoing development, considering that the χ 2based test assumes that the random variables representing distributions for different number of clusters are normally distributed, we are developing a new distribution-independent approach based on the Bernstein inequality to assess the significance of the discovered k-clusterings.