- Open Access
Discovering multi–level structures in bio-molecular data through the Bernstein inequality
© Bertoni and Valentini; licensee BioMed Central Ltd. 2008
- Published: 26 March 2008
The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the “optimal” number of clusters in bio-molecular data. A major problem with stability-based methods is the detection of multi-level structures (e.g. hierarchical functional classes of genes), and the assessment of their statistical significance. In this context, a chi-square based statistical test of hypothesis has been proposed; however, to assure the correctness of this technique some assumptions about the distribution of the data are needed.
To assess the statistical significance and to discover multi-level structures in bio-molecular data, a new method based on Bernstein's inequality is proposed. This approach makes no assumptions about the distribution of the data, thus assuring a reliable application to a large range of bioinformatics problems. Results with synthetic and DNA microarray data show the effectiveness of the proposed method.
The Bernstein test, due to its loose assumptions, is more sensitive than the chi-square test to the detection of multiple structures simultaneously present in the data. Nevertheless it is less selective, that is subject to more false positives, but adding independence assumptions, a more selective variant of the Bernstein inequality-based test is also presented. The proposed methods can be applied to discover multiple structures and to assess their significance in different types of bio-molecular data.
- Acute Myeloid Leukemia
- Acute Lymphoblastic Leukemia
- Chronic Lymphocytic Leukemia
- Random Projection
- Multiple Structure
Unsupervised cluster analysis of bio-molecular data is one of the main and well-established research lines in bioinformatics . Classes of co-expressed genes, classes of functionally related proteins, or subgroups of patients with malignancies differentiated at bio-molecular level can be discovered through clustering algorithms, and several other tasks related to the analysis of bio-molecular data require the development and application of unsupervised clustering techniques [2–4]. Anyway, in most bioinformatics problems, we need to assess the reliability of the discovered clusters, as well as the proper selection of the “natural” number of clusters underlying the data .
Recently, several methods based on the concept of stability have been proposed to estimate the “optimal” number of clusters [6, 7]: multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. Different procedures may be applied to randomly perturb the data, ranging from bootstrapping techniques , to noise injection into the data  or random projections into lower dimensional subspaces [10, 11].
A major problem with stability-based methods is the detection of multi-level structures underlying the data (e.g. hierarchical subclasses of diseases, or hierarchical functional classes of genes). For instance, it is possible that data exhibit a hierarchical structure, with subclusters inside other clusters, and we need to detect these multi-level structures, possibly estimating their reliability and statistical significance. In , it is proposed a χ2-based statistical test of hypothesis to assess the significance of the “optimal” number of clusters and to discover multiple structures simultaneously present in bio-molecular data; however, by this approach, on one hand some assumptions about the distribution of the similarity measures are needed to estimate the reliability of the obtained clusterings, and on the other hand test results depend on the choice of user-defined parameters.
In this contribution we propose a distribution-free approach that does not assume any “a priori” distribution of the similarity measures, and that does not require any user-defined additional parameter. The proposed approach is based on the classical Bernstein inequality , and for its loose assumptions about the distribution of the data may in principle be applied to any unsupervised model order selection problem. More precisely the proposed stability-based method may be applied to several tasks related to the unsupervised analysis of complex bio-molecular data: (a) the assessment of the reliability of a given clustering solution; (b) the clustering model order selection, that is the discovery of the “natural” number of clusters in the data; (c) the assessment of the statistical significance of a given clustering solution; (d) the discovery of multiple structures underlying the data, i.e. the detection of multiple reliable clustering solutions at a given significance level.
In the following sections we summarize the characteristics of the stability-based procedures for the assessment of the reliability of clusterings, and we introduce our proposed method based on the Bernstein inequality.
Model order selection through stability based procedures
Let be C a clustering algorithm, ρ(D) a given random perturbation procedure applied to a data set D and sim a suitable similarity measure between two clusterings (e.g. the Jaccard similarity ). Among the random perturbations we recall random projections from a high dimensional to a low dimensional subspace , or bootstrap procedures to sample a random subset of data from the original data set D. Fixing an integer k (the number of clusters), we define S k (0 ≤ S k ≤ 1) as the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm C to data pairs D1 and D2 obtained by randomly and independently perturbing the original data D.
If g(k) is close to 0 then the values of the random variable S k are close to 1 and hence the k-clustering is stable, while for larger values of g(k) the k-clustering is less reliable. This observation comes from the following fact:
where S kj represents the similarity between two k-clusterings obtained through the application of the algorithm C to a pair of perturbed data.
where p is an index permutation such that ξp(1) ≥ ξp(2) ≥ … ≥ ξp(H). In this way we obtain an ordering of the clusterings, from the most to the least reliable one.
Exploiting this ordering, we proposed a χ2-based statistical test to detect and to estimate the statistical significance of multiple-structures discovered by clustering algorithms . The main drawbacks of this approach consists in an implicit normality assumption for the distribution of the S k (random variables that measure the similarity between two perturbed k-clusterings, see above), and in a user defined threshold parameter that determines when two k-clusterings can be considered similar and “stable”. Indeed, in general we have no guarantee that the S k random variables are normally distributed; moreover the “optimal” choice of the threshold parameter seems to be application dependent and may affect the overall test results.
In this contribution, to address these problems we propose a new statistical method that, adopting a stability-based approach, makes no assumptions about the distribution of the random variables and does not require any user-defined threshold parameter.
Hypothesis testing based on Bernstein inequality
We briefly recall the Bernstein inequality, because this inequality is used to build-up our proposed hypothesis testing procedure.
Using the Bernstein inequality, we would estimate if for a given r, 2 ≤ r ≤ H, there exists a statistically significant difference between the reliability of the best p(1) clustering and the p(r) clustering (eq. 3). In other words we may state the null hypothesis H0 and the alternative hypothesis in the following way:
H0: p(1) clustering is not more reliable than p(r) clustering, that is E[Sp(1)] ≤ E[Sp(r)]
H a : p(1) clustering is more reliable than p(r) clustering, that is E[Sp(1)] > E[Sp(r)]
As in the previous case, if P err (H − r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and p(H − r) clustering and we iteratively continue the procedure estimating P err (H − r − 1).
The null hypothesis is rejected till r = H − 2, that is ∀r, 1 ≤ r ≤ H − 2, P err (H − r) < α: all the possible null hypotheses have been rejected and the only reliable clustering at α-significance level is the top ranked one, that is the p(1) clustering.
II) The null hypothesis cannot be rejected for r ≤ H − 2, that is, ∃r, 1 ≤ r ≤ H − 2, P err (H − r) ≥ α: in this case the clusterings that are significantly less reliable than the top ranked p(1) clustering are the p(r + 1), p(r + 2),…, p(H) clusterings.
Note that in this second case we cannot state that there is no significant difference between the first r top-ranked clusterings, since the upper bound provided by the Bernstein inequality is not guaranteed to be tight. To answer to this question, we may apply the χ2-based hypothesis testing proposed in  to the remaining top ranked clusterings to establish which of them are significant at α level, but in this case we need to assume that the similarity measures between pairs of clusterings are distributed according to a normal distribution.
If we assume that the X i random variables (eq. 5) are (at least approximately) independent, we can obtain a variant of the previous Bernstein inequality-based approach, that we name Bernstein ind. for brevity. By this approach we should in principle obtain lower p values, thus assuring lower false positive rates than the Bernstein test without independence assumptions.
With these independence assumptions the null hypothesis H 0 and the alternative hypothesis for the Bernstein ind. test can be formulated as follows:
H0: ∃i, 2 ≤ i ≤ r ≤ H such that E[Sp(1)] ≤ E[Sp(r)]: it does exist at least one p(i)-clustering equally or more reliable than the first one in the group of the first r ordered clusterings.
H a : ∀i, 2 ≤ i ≤ r ≤ H, E[Sp(1)] > E[Sp(r)]: all the clusterings in the group of the first r ordered clusterings are less reliable than the first one.
Starting from r = H, if P err (r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and the other first r-clustering and we iteratively continue the procedure estimating P err (r − 1). As in the Bernstein test, the procedure is iterated until we remain with a single clustering (and this will be the only significant one), or until P err (r) ≥ α and in this case we cannot reject the null hypothesis and the first r clusterings can be considered equally reliable. Note that, strictly speaking, in this case we can only say that at least one of the first r clusterings is equally or more reliable than the first one.
In this section we apply the Bernstein test to synthetic and DNA microarray data analysis, and compare it to the previously proposed χ2-based test . For the experiments we used the mosclust R package , and all the data used in the experiments are available from the authors.
Analysis of hierarchical structures in synthetic data
Synthetic data: comparison of the χ2 and Bernstein inequality-based tests.
Bernstein ind. stands for the Bernstein test with assumption of independence between the random variables representing the empirical mean of the similarity measures.
Structures discovered (10−5 significance level)
Discovery of multi-level structures in DNA microarray data
As an example of the application of the Bernstein test to the discovery of multiple structures in bio-molecular data, we consider two classical DNA microarray data sets: Leukemia and Lymphoma. The Leukemia data set is composed by a group of 25 acute myeloid leukemia (AML) samples and another group of 47 acute lymphoblastic leukemia (ALL) samples, that can be subdivided into 38 B-Cell and 9 T-Cell subgroups, resulting in a two-level hierarchical structure.
We applied both resampling and random projections to lower dimensional subspaces to perturb the original data using the R package mosclust  that implements the Bernstein-based test and the stability measures described in Sect. Model order selection through stability based procedures.
Leukemia data set: empirical means (ξ) and p values computed according to the Bernstein inequality.
Lymphoma data: comparison of the χ2 and Bernstein inequality-based tests.
t represents the threshold level for the χ2-based test.
Structures discovered (0:001 significance level)
Considering the Leukemia and Lymphoma data sets, the proposed Bernstein test achieves results competitive with state-of-the-art stability methods proposed in the literature. Indeed the Model Explorer algorithm, based on subsampling techniques, correctly detect only the 2-clustering structure both in Leukemia and Lymphoma. Another subsampling-based method (Figure of Merit) detects 2, 8 and 19-clusterings in Leukemia and 2 and 9-clusterings in Lymphoma. Stability methods that apply supervised algorithms to assess the quality of the discovered clusterings correctly detect only a 3-clustering in Leukemia and a 2-clustering in Lymphoma[6, 24]. Our previously proposed χ2-based test correctly detects both 2 and 3-clusterings in both data sets, if random projections are used as perturbation method, but it fails to detect the 3-clustering in Lymphoma when subsampling techniques are applied. On the contrary, the Bernstein test discovers both the two-level structures in Leukemia and Lymphoma, independently of the applied perturbation method.
The experimental results with both synthetic and gene expression data support the hypothesis that the Bernstein test is more sensitive to multiple structures underlying the data. Indeed in the first experiment with synthetic data it correctly predicts also the third level of structure, that is the 12-clustering; on the other hand it is subject to false positives, as shown by the wrong discovery of a 7-clustering (Table 1). These results are confirmed by the fact that Bernstein p values decrease more slowly with respect to the χ2 test (Fig. 4 and 5): in this way for a given significance level it is likely that the Bernstein test selects larger sets of structures underlying the data. The risk of an increased rate of false positives may be balanced by the assumption of independence between the random variables, yielding to the proposed Bernstein ind. test (eq. 8), less subject to false positives, but potentially less sensitive to multiple structures underlying the data.
In real applications to complex bio-molecular data, we suggest to apply both Bernstein-based and χ2-based procedures: structures discovered by both tests are likely to be significant, and Bernstein-based tests can discover potential structures not detectable with the more selective χ2-based test. Moreover the computational burden due to the application of the χ2 and Bernstein-based iterative procedures is irrelevant with respect to the execution of clustering algorithms.
We proposed a test of hypothesis based on Bernstein inequality to estimate if there is a significant difference between the reliability of different clusterings performed on the same data. Our proposed method can be applied to discover multiple or hierarchical structures, using different clustering algorithms and different perturbation methods. Even if in our experiments we applied the Bernstein test to the analysis of gene expression data, this approach may be in principle applied to discover multiple structures in any type of complex bio-molecular data. Indeed no user-defined parameters are required, and very loose assumptions are made about the distribution of the data and the distribution of the similarity values used to estimate the stability of the discovered clusterings, thus assuring a reliable application of the method to a large range of bioinformatics problems.
Our experiments with synthetic and gene expression data show that Bernstein-based tests are more sensitive than χ2-based tests to multiple structures embedded in the data: in this way not self-evident structures may be detected too, as well as subtle relationships between the data. A drawback of the Bernstein test is its larger expected rate of false positives, but assuming independence between the empirical means of the similarity values a new test (Bernstein ind.), less subject to false positives, has been proposed.
Developments of this work could consist in the adaptation and application of the proposed methods to large scale bioinformatics problems, to discover multiple structures underlying the data when a very large number of clusters is potentially involved.
We would like to thank the anonymous reviewers for their comments and suggestions. This work has been developed in the context of CIMAINA Center of Excellence, and it has been funded by the Italian COFIN project Linguaggi formali ed automi: metodi, modelli ed applicazioni.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 2, 2008: Italian Society of Bioinformatics (BITS): Annual Meeting 2007. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S2
- Dopazo J: Functional Interpretation of Microarray Experiments. OMICS 2006,10(3):398–410. 10.1089/omi.2006.10.398View ArticlePubMedGoogle Scholar
- Gasch P, Eisen M: Exploring the conditional regulation of yeast gene expression through fuzzy k-means clustering. Genome Biology 2002.,3(11):Google Scholar
- Dyrskjøt L, Thykjaer T, Kruhøffer M, Jensen J, Marcussen N, Hamilton-Dutoit S, Wolf H, Ørntoft T: Identifying distinct classes of bladder carcinoma using microarrays. Nature Genetics 2003, 33: 90–96. jan 10.1038/ng1061View ArticlePubMedGoogle Scholar
- Kaplan N, Friedlich M, Fromer M, Linial M: A functional hierarchical organization of the protein sequence space. BMC Bioinformatics 2004, 5: 196. 10.1186/1471-2105-5-196PubMed CentralView ArticlePubMedGoogle Scholar
- Handl J, Knowles J, Kell D: Computational cluster validation in post-genomic data analysis. Bioinformatics 2005,21(15):3201–3215. 10.1093/bioinformatics/bti517View ArticlePubMedGoogle Scholar
- Lange T, Roth V, Braun M, Buhmann J: Stability-based Validation of Clustering Solutions. Neural Computation 2004, 16: 1299–1323. 10.1162/089976604773717621View ArticlePubMedGoogle Scholar
- Bertoni A, Valentini G: Model order selection for bio-molecular data clustering. BMC Bioinformatics 2007,8(Suppl 2):S7. 10.1186/1471-2105-8-S2-S7PubMed CentralView ArticlePubMedGoogle Scholar
- Monti S, Tamayo P, Mesirov J, Golub T: Consensus Clustering: A Resampling-based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 2003, 52: 91–118. 10.1023/A:1023949509487View ArticleGoogle Scholar
- McShane L, Radmacher D, Freidlin B, Yu R, Li M, Simon R: Method for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 2002,18(11):1462–1469. 10.1093/bioinformatics/18.11.1462View ArticlePubMedGoogle Scholar
- Smolkin M, Gosh D: Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics 2003,4():36. 10.1186/1471-2105-4-36PubMed CentralView ArticlePubMedGoogle Scholar
- Bertoni A, Valentini G: Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artificial Intelligence in Medicine 2006,37(2):85–109. 10.1016/j.artmed.2006.03.005View ArticlePubMedGoogle Scholar
- Hoeffding W: Probability inequalities for sums of independent random variables. J. Amer. Statist. Assoc. 1963, 58: 13–30. 10.2307/2282952View ArticleGoogle Scholar
- Jain A, Murty M, Flynn P: Data Clustering: a Review. ACM Computing Surveys 1999,31(3):264–323. 10.1145/331499.331504View ArticleGoogle Scholar
- Achlioptas D: Database-friendly random projections. In Proc. ACM Symp. on the Principles of Database Systems, Contemporary Mathematics. Edited by: Edited by Buneman P. New York, NY, USA: ACM Press; 2001:274–281.Google Scholar
- Ben-Hur A, Ellisseeff A, Guyon I: A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing. Volume 7. Edited by: Edited by Altman R, Dunker A, Hunter L, Klein T, Lauderdale K, Lihue, Hawaii, USA. World Scientific; 2002:6–17.Google Scholar
- Valentini G: Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics 2007,23(3):387–389. 10.1093/bioinformatics/btl600View ArticlePubMedGoogle Scholar
- Valentini G: Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data. Bioinformatics 2006,22(3):369–370. 10.1093/bioinformatics/bti817View ArticlePubMedGoogle Scholar
- Ward J: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58: 236–244. 10.2307/2282967View ArticleGoogle Scholar
- Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley; 1990.View ArticleGoogle Scholar
- Golub T, et al.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson J, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–511. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Alizadeh A, et al.: The Lymphochip: a specialized cDNA microarray for genomic-scale analysis of gene expression in normal and malignant lymphocytes. In Cold Spring Harbor Symp. Quant. Biol. 2001.Google Scholar
- Levine E, Domany E: Resampling method for unsupervised estimation of cluster validity. Neural Computation 2001,13(11):2573–2593. 10.1162/089976601753196030View ArticlePubMedGoogle Scholar
- Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 2002,3(7):RESEARCH0036-. 10.1186/gb-2002-3-7-research0036PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.