A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis
- Raffaele Giancarlo^{1},
- Giosué Lo Bosco^{1}Email author,
- Luca Pinello^{2} and
- Filippo Utro^{3}
https://doi.org/10.1186/1471-2105-14-S1-S6
© Giancarlo et al.; licensee BioMed Central Ltd. 2013
Published: 14 January 2013
Abstract
Background
Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research.
Results
A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as Balanced Misclassification Index (BMI, for short) and of a nontrivial modification of the well known Receiver Operating Curve (ROC, for short), which we refer to as Corrected ROC (CROC, for short). The main results are: (a) a quantitative and qualitative method to describe the intrinsic separation ability of a distance; (b) a quantitative method to assess the performance of a clustering algorithm in conjunction with the intrinsic separation ability of a distance function. The proposed procedure is more informative than the ones available in the literature due to the adopted tools. Indeed, the first one allows to map distances and clustering solutions as graphical objects on a plane, and gives information about the bias of the clustering algorithm with respect to a distance. The second tool is a new external validity index which shows similar performances with respect to the state of the art, but with more flexibility, allowing for a broader spectrum of applications. In fact, it allows not only to quantify the merit of each clustering solution but also to quantify the agglomerative or divisive errors due to the algorithm.
Conclusions
The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance d_{2}, Pearson correlation d_{ r } and mutual information d_{ MI }. Based on the results of the experiments, we have that the Euclidean and Pearson correlation distances have a good intrinsic discrimination ability. Conversely, the mutual information distance does not seem to offer the same flexibility and versatility as the other two distances. Apparently, that is due to well known problems in its estimation. since it requires that a dataset must have a substantial number of features to be reliable. Nevertheless, taking into account such a fact, together with results presented in Priness et al., one receives an indication that d_{ MI } may be superior to the other distances considered in this study only in conjunction with clustering algorithms specifically designed for its use. In addition, it results that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the discriminative ability of the distances considered in this study with respect to clustering. The methodology has a range of applicability that goes well beyond microarray data since it is independent of the nature of the input data. The only requirement is that the input data must have the same format of a "feature matrix". In particular it can be used to cluster ChIP-seq data.
Keywords
Background
Recently, medical and biological research has been deeply influenced by the advent of high throughput technologies such as microarrays and RNA-seq platforms. They enable the acquisition of data that are fundamental for research in several areas of the biological sciences such as understanding biological systems and diagnosis (e.g. [1]). A fundamental aspect of microarray data analysis consists of clustering gene expression data [2, 3]. However, its application to post-genomic data has revealed to be rather ad hoc. That is the reason why there is hardly any consensus on the best distance function and clustering algorithm to be used for the different types of post-genomic data. As a consequence, the common practice is to use several different clustering techniques to analyse a dataset, and to resort to visual inspection and prior biological knowledge to select what is considered the most "appropriate" result. Clearly, this data analysis is highly subjective, implying obvious risks. Those observations have motivated Handl et al. [4] to write a seminal paper with the intent to show to both bioinformatics researchers and end-users some of the fundamental aspects of the clustering methodology. The main message of that paper is that clustering can be considered as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm and (3) choice of a methodology to estimate the statistical significance of clustering solutions. Points (2) and (3) lead into two well established and rich research areas in data analysis ranging from statistics to Computer Science. Although computational methods for the analysis of microarray data have witnessed an exponential growth, a few contributions have been given in trying to assess their merits [5]. As a result, the need for a comprehensive evaluation of the whole analysis process for microarray data is being recognized and a few benchmarking studies have appeared [6–8]. Unfortunately, point (1) has been hardly investigated regarding this new type of data and very few results on this topic are available (see [2, 9, 10] and references therein).
In this paper, we address point (1) by introducing a new qualitative and quantitative method to describe and assess the discriminative ability of a distance function alone and in conjunction with a clustering algorithm. Moreover, the methodology is also able to give indications about the bias of clustering algorithms with respect to distances. It is worth recalling that very little is known about this latter point, one of the difficulties being a fair comparison between the performance of a distance function and a clustering algorithm measured in terms of their classification ability. This point is discussed in detail in the Methods section. The overall methodology that is introduced here makes use of the ROC plane and the ROC curve [11] in order to define the new external clustering validation index BMI and the new CROC curve. The net effect is the delivery of a methodology that rigorously uses external knowledge in order to assess the performance of a distance function while granting a fair comparison with clustering solutions generated by a clustering algorithm. It is worth mentioning that previous approaches to this problem presented the shortcomings of being based only on internal indices [10], i.e., homogeneity and separation: Indeed, it is well known that external validation is more accurate than the internal one [12]. The remainder of this paper is organized as follows. The experimental set-up we have used and the results are presented in the next section. Then, some conclusions and directions of future research are offered next. Finally, the Methods section describes in detail the new methodology to assess the intrinsic separation ability of three distance functions, and its use in conjunction with clustering algorithms.
Results and discussion
Experimental setup
Datasets
Technically speaking, a gold solution GS for a dataset is a partition of the data in a number of classes known a priori. Membership of a class is established by assigning the appropriate class label to each element. This means that the partition of the dataset in classes is based on some external knowledge that leaves no ambiguity on the actual number of classes and on their composition in terms of class memberships. Moreover, is also important to state that there exist two main kinds of gold solution datasets, i.e., (i) the ones for which an priori division in to classes of the dataset is known; (ii) and the ones for which high quality partitions have been inferred by analyzing the data. Dudoit and Fridlyand [13] elegantly make clear that difference in a related study and we closely follow their approach here.
Each dataset is a matrix, in which each row corresponds to an element to be clustered and each column to an experimental condition. The nine datasets, together with the acronyms used in this paper, are reported next. For conciseness, we mention only some relevant facts about them. The interested reader can find additional information in Dudoit and Fridlyand [13] for the Lymphoma and NCI60 datasets, Di Gesú et al. [14] for the CNS Rat, Leukemia and Yeast datasets and in Monti et al. [15], for the remaining ones.
CNS Rat: It is a 112 × 17 data matrix, obtained from the expression levels of 112 genes during a rat's central nervous system development. The dataset was studied by Wen et al. [16] and they suggested a partition of the genes into six classes, four of which are composed of biologically, functionally-related genes. This partition is taken as the gold solution, which is the same one used for the validation of FOM [17].
Gaussian3: It is a 60 × 600 data matrix. It is generated by having 200 distinctive features out of the 600 assigned to each cluster. There is a partition into three classes and that is taken as the gold solution. The data simulates a pattern whereby a distinct set of 200 genes is up-regulated in one of the three clusters, and down-regulated in the remaining two.
Gaussian5: It is a 500 × 2 data matrix. It represents the union of observations from 5 bivariate Gaussians, 4 of which are centered at the corners of the square of side length λ, with the 5th Gaussian centered at (λ/2, λ/2). A total of 250 samples, 50 per class, were generated, where two values of λ are used, namely, λ = 2 and λ = 3, to investigate different levels of overlapping between clusters. There is a partition into five classes and that is taken as the gold solution.
Leukemia: It is a 38 × 100 data matrix, where each row corresponds to a patient with acute leukemia and each column to a gene. The original microarray experiment consists of a 72 × 6817 matrix, due to Golub et al. [18]. In order to obtain the current dataset, Handl et al. [4] extracted from it a 38 × 6817 matrix, corresponding to the learning set in the study of Golub et al. and, via preprocessing steps, they reduced it to the current dimension by excluding genes that exhibited no significant variation across samples. The interested reader can find details of the extraction process in Handl et al.. For this dataset, there is a partition into three classes and that is taken as the gold solution. It is also worthy of mention that Leukemia has become a benchmark standard in the cancer classification community [19].
Lymphoma: It is a 80 × 100 data matrix, where each row corresponds to a tissue sample and each column to a gene. The dataset comes from the study of Alizadeh et al. [20] on the three most common adult lymphoma tumors. There is a partition into three classes and it is taken as the gold solution. The dataset has been obtained from the original microarray experiments, consisting of an 80 × 4682 data matrix, following the same preprocessing steps detailed in Dudoit and Fridlyand [13].
NCI60: It is a 57 × 200 data matrix, where each row corresponds to a cell line and each column to a gene. This dataset originates from a microarray study in gene expression variation among the sixty cell lines of the National Cancer Institute anti-cancer drug screen [21], which consists of a 61 × 5244 data matrix. There is a partition of the dataset into eight classes, for a total of 57 cell lines, and it is taken as the gold solution. The dataset has been obtained from the original microarray experiments as described by Dudoit and Fridlyand [13].
Novartis: It is a 103 × 1000 data matrix, where each row corresponds to a tissue sample and each column to a gene. The dataset comes from the study of Su et al. [22] on four distinct cancer types. There is a partition into four classes and we take that as the gold solution.
Simulated6: It is a 60 × 600 data matrix. It consists of a 600-gene by 60-sample dataset. It can be partitioned into 6 classes with 8, 12, 10, 15, 5, and 10 samples respectively, each marked by 50 distinct genes uniquely up-regulated for that class. In addition, a list of 300 noise genes (i.e., genes having the same distribution within all clusters) are included. In particular, such genes are generated with decreasing differential expression and increasing variation, following the same distribution. Finally, the first block of 50 genes of the list is assigned to cluster 1, the second block to cluster 2 and so on. This partition into 6 classes is taken as the gold solution.
Yeast: It is a 698 × 72 data matrix, studied by Spellman et al. [23] whose analysis suggests a partition of the genes into five functionally-related classes, which is taken as the gold solution and which has been used by Shamir and Sharan for a case study on the performance of clustering algorithms [24].
Distances
- 1.
δ(x, y) ≥ 0 (non-negativity);
- 2.
δ(x, y) = δ(y, x) (symmetry);
- 3.
δ(x, x) = 0;
In the case of microarray data, $X$ = ${\mathbb{R}}^{m}$, i.e. each data point $\overrightarrow{x}$ is a vector in m-dimensional space. Note that a dataset X is a finite subset of $X$, |X| = n. One can categorize distance functions according to three broad classes: geometric, correlation-based and information-based. Functions in the first class capture the concept of physical distance between two objects. They are strongly influenced by the magnitude of change in the measured components of vectors $\overrightarrow{x}$ and $\overrightarrow{y}$, making them sensitive to noise and outliers. Functions in the second class capture dependencies between the coordinates of two vectors. In particular, they usually have the benefit of capturing positive, negative and linear relationships between two vectors. Functions in the third class are defined via well known quantities in information theory such as entropy and mutual information [25]. They have the advantage of capturing statistical dependencies between two discrete data points, even if they are not linear. Unfortunately, when one tries to apply them to points in ${\mathbb{R}}^{m}$, a suitable discretization process must be carried out, known as binning, which usually poses some non-trivial challenges. For our experiments, we have considered the Euclidean distance, the Pearson correlation and Mutual Information since they are excellent representatives of the three categories described above. Indeed, they have been shown to be the most suitable for microarray data [9]. For the convenience of the reader, they are defined in the Methods section.
In what follows, we refer to distance and dissimilarity functions with the generic term distance functions.
Algorithms and hardware
In our experiments, we have chosen K-means among Partitional Methods, and Average Link, Complete Link and Single Link among the Hierarchical Methods clustering algorithms. The details of those algorithms are not reported here and the interested reader can find a detailed description of them in [26]. Of course, each of the above mentioned algorithms has already been used for data analysis of microarray data, e.g. [14, 27–29]. All experiments were performed on several state-of-the-art PCs.
Evaluating the performance of distance functions via the BMI index and the CROC curve
- (A)
Assessment of the intrinsic separation ability of a distance. That is, how well a distance discriminates independently of its use within a clustering algorithm.
- (B)
Assessment of the predictive clustering algorithm ability of a distance. That is, which distance function grants the best performance when used within a clustering algorithm.
- (C)
The interplay between (A) and (B).
A ROC plane is a plane where y = TPR and x = FPR = 1 - TNR, and it is useful to measure a classification in terms of TPR and FPR rates, once having established to represent with 0 the positive class. Note that, since a classifier assigns data items to classes, the TPR represents the percentage of item pairs correctly assigned to different classes, while the FPR is the percentage of item pairs incorrectly assigned to different classes. In the ROC plane, it is possible to define the ROC curve, which is a two-dimensional visualization of TPR versus FPR for increasing threshold values. Indeed, the area under this curve (AUC for short) is defined in the range [0,1], where a value of 0.5 corresponds to the performance of a classifier with a random assignment rule, while the closer is AUC to one, the better is the performance of the classifier. The CROC curve of a distance is a transformation of the ROC curve in which each point corresponds to a proper clustering solution.
- (C.1)
showing how to map a clustering solution into the ROC plane (see subsection Clustering solutions, ROC plane and the BMI index)
- (C.2)
introducing a distance between a clustering solution and GS (see subsection A procedure to compare distance functions and clustering algorithms via ROC analysis);
- (C.3)
showing how (C.1) and (C.2) can be used to fairly compare the intrinsic ability of distance functions and of a clustering algorithms to identify "structure" in a dataset (see subsection A procedure to compare distance functions and clustering algorithms via ROC analysis).
The BMI takes values in the range [0,1]. Moreover, the closer the value of the index is to zero, the better the agreement between a partition and the GS. Therefore, in contrast to the most popular external validation indices [27], which need to be maximized, BMI needs to be minimized.
Results
The BMI can be regarded as the core of the methodology, and here it has been used in multiple ways: to assess a distance, to assess an algorithm, or purely as an external validation index. In the following, we give the result details for each one of these uses.
BMI-values
d _{2} | d _{ r } | d _{ MI } | |
---|---|---|---|
CNS Rat | 0.6804 | 0.6875 | 0.6692 |
Gaussian3 | 0.7170 | 0 | 0.7102 |
Gaussian5 | 0.2358 | 0.5424 | - |
Leukemia | 0.3498 | 0.2559 | 0.3000 |
Lymphoma | 0.3509 | 0.3385 | 0.7028 |
NCI60 | 0.4699 | 0.4699 | 0.5643 |
Novartis | 0.4260 | 0.4240 | 0.4183 |
Simulated6 | 0.5022 | 0.8150 | 0.7456 |
Yeast | 0.6647 | 0.6750 | 0.6677 |
BMI vs R_{ A }
d _{2} | d _{ r } | d _{ MI } | |
---|---|---|---|
CNS Rat | -0.1397 | -0.2682 | -0.3476 |
Gaussian3 | -0.2207 | -0.997 | 0.3336 |
Gaussian5 | -0.9918 | -1 | - |
Leukemia | -0.9512 | -0.9830 | -0.9754 |
Lymphoma | -0.9498 | -0.9465 | -0.3409 |
NCI60 | -0.6241 | -0.6060 | -0.6485 |
Novartis | -0.8998 | -0.8787 | -0.8750 |
Simulated6 | -0.9249 | -0.9800 | -0.4720 |
Yeast | -0.5121 | -0.5106 | -0.6246 |
BMI vs F
d _{2} | d _{ r } | d _{ MI } | |
---|---|---|---|
CNS Rat | -0.4590 | -0.1684 | -0.5910 |
Gaussian3 | -0.5031 | -0.8714 | -0.5371 |
Gaussian5 | -0.5518 | -1 | - |
Leukemia | -0.8155 | -0.8246 | -0.8068 |
Lymphoma | -0.6329 | -0.5915 | -0.5896 |
NCI60 | -0.86139 | -0.8533 | -0.8529 |
Novartis | -0.793199 | -0.7194 | -0.8283 |
Simulated6 | -0.9419 | -0.9373 | -0.4966 |
Yeast | -0.4808 | -0.3151 | 0.5448 |
BMI vs FM
d _{2} | d _{ r } | d _{ MI } | |
---|---|---|---|
CNS Rat | -0.3408 | -0.0507 | -0.5335 |
Gaussian3 | -0.4027 | -0.91840 | -0.4383 |
Gaussian5 | -0.6297 | -1 | - |
Leukemia | -0.8701 | -0.8628 | -0.8453 |
Lymphoma | -0.6969 | -0.6624 | -0.5338 |
NCI60 | -0.6801 | -0.6429 | -0.7264 |
Novartis | -0.8194 | -0.7661 | -0.8230 |
Simulated6 | -0.9280 | -0.9255 | -0.4584 |
Yeast | -0.4183 | -0.2481 | -0.4887 |
Conclusions
In this paper we have presented a procedure to asses the discriminative ability of a distance for data clustering. Such procedure is based on the BMI, a new external validation index that has the versatility to be used to asses a distance, to asses an algorithm, or purely as an external validation index. We have applied the overall methodology on 9 datasets, in the case of the Euclidean, Pearson and Mutual Information distances. Some of the computed results agree with other state of the art external validation indices, but with respect to them our procedure is more informative since it can shed light on the bias of the clustering algorithm with respect to a distance. An important thing to stress about the proposed methodology is that, although it was validated in the context of gene expression data obtained by microarray technologies, it is worth pointing out that the proposed methodology is "generic", i.e, it can be applied to other kind of data (e.g. RNA-seq). As a future direction of investigation, we intend to extend the proposed methodology to study a very challenging problem in the field of data analysis, i.e., the quantification of the intrinsic complexity of a dataset, defined as: the difficulty for a clustering algorithm to find the correct partition of a dataset.
Methods
Definition of distance functions
We now formally define the distances used in this paper.
where $\overrightarrow{x}$ = (x_{1},...,x_{ m }), $\overrightarrow{y}$ = (y_{1},...,y_{ m }).
where $\overrightarrow{x}$ and $\overrightarrow{y}$ are the sample means of $\overrightarrow{x}$ and $\overrightarrow{y}$.
where p_{ i } = P(X = x_{ i }) and p_{ j } = P(Y = y_{ j } ) are the marginal probability mass functions (p.m.f. for short) and p_{ ij } = P(X = x_{ i },Y = y_{ j } ) the joint p.m.f. When dealing with such a distance, the problem is the estimation of the marginal and joint p.m.f., which involves a discretization of the data values, usually done by using binning and histogram based procedures [31].
Definition of external indices
Recall from [27] that an external index measures how well a clustering solution computed by an algorithm agrees with the gold solution for a given dataset. Formally, let C = {c_{1},...,c_{ r }} be the partition of the items in dataset X into r clusters, corresponding to the gold solution for that dataset. Let P = {p_{1},...,p_{ t }} be an analogous partition, possibly produced by a clustering algorithm.
An external index measures the level of agreement of the two partitions. External indices are usually defined via a r × t contingency table T, where T_{ ij } represents the number of items in both c_{ i } and p_{ j }, 1 ≤ i ≤ r and 1 ≤ j ≤ t.
where T_{ i. } = |c_{ i }| and T_{ .j } = |p_{ j } |.
Note that there is a little difference in the range of values of the three indices: while the FM and the F indices can assume a value in the range [0,1], the R_{ A } may be negative [32]. All three indices need to be maximized, that is, for each of them, the closer the index is to one, the better the agreement between the two partitions.
The BMI index and the CROC curve
The ROC plane can be used to estimate the similarity between a reference partition and a generic one as follows. The reference partition is mapped to the point (0, 1) in the ROC plane, corresponding to perfect classification. Analogously, the generic partition is mapped to a point in the ROC plane, depending on the number of "misclassified" elements with respect to the reference partition. Then, a distance measure between such a point and (0, 1) gives an indication about the similarity of the partitions. The BMI is the Euclidean distance between those two points. Moreover, the mapping of partitions into the ROC plane at the base of the BMI can be used to assess the intrinsic discriminative ability of a distance function for clustering by generating the CROC curve and by considering the closest point to (0, 1) on this curve as a clustering solution associated to the distance function. Then, the BMI between this point and (0,1) gives the required estimate. In the following subsections, we give details about the BMI and the CROC.
Clustering solutions, ROC plane and the BMI index
- 1.
Compute the connectivity matrix J_{ s } for s.
- 2.
Starting from J_{ s }, compute the confusion matrix with respect to GS.
- 3.
Use that confusion matrix to compute TPR and FPR for s. Those two variables naturally identify a point into the ROC plane, associated to s.
A few remarks are in order. The above approach naturally leads to measure a clustering solution in terms of TPR and FPR rates. As anticipated, the point into the ROC plane associated with GS is P_{ GS } = (0,1).
Given a clustering solution s, let P_{ s } = (x, y) be the point in the ROC plane corresponding to it.
The performance of s is proportional to the proximity of P_{ s } to P_{ GS }, as we now explain. Let E_{ m } be the Misclassification error rate defined as the sum between FPR (x) and False negative rate (FNR = 1 - y). That is, E_{ m } is the L1 metric (d_{1}) computed between P_{ s } and P_{ GS }, i.e., d_{1}(P_{ GS }, P_{ s }) = |x + 1 - y|. Then, the closer P_{ s } and P_{ GS } are with respect to d_{1}, the better the clustering solution with respect to E_{ m }.
It is worth pointing out that P_{ s } gives a measure of the agglomerative and divisive tendency of a generic clustering algorithm. Indeed, the greater the x value, the more divisive the clustering algorithm is. Analogously, the smaller the y value, the more agglomerative the clustering algorithm is. Indeed, we can actually devise an index that measures such a tendency.
where the weights α and β allow to tune the importance between balance and misclassification.
Among all the possible weight combinations, a natural choice for BMI is to set α and β in order to take into an equal account the misclassification error rate E_{ m } and the balancing error rate E_{ b }. This corresponds to the setting α = β = 0.5, and it is of interest and relevance here to notice that in this case BMI corresponds to d_{2}(P_{ GS },P_{ s }). That is, the L_{2} (Euclidean) metric between the points P_{ GS } and P_{ s }. This means that the closer P_{ s } and P_{ GS } are with respect to d_{2}, the better the clustering solution, in equal measure (α = β = 0.5) between misclassification error rate E_{ m } and balancing error rate E_{ b }.
Operationally, once fixed α = β = 0.5, if one wants to compute the BMI of a clustering algorithm producing a clustering solution with x = FPR and y = TPR, respectively, one needs only to compute the Euclidean distance between the points P_{ s } and P_{ GS } in the ROC plane. It is obvious that such a technique can also be used to compare the performance of several clustering solutions by considering the Euclidean distances between the associated points into the ROC plane and P_{ GS }.
A procedure to compare distance functions and clustering algorithms via ROC analysis
- 1.
Compute the ROC curve for a distance function d.
- 2.
Calculate the CROC curve starting from the ROC curve computed in the previous point.
- 3.
Find the best point into the CROC curve, i.e., the point with the lowest value of BMI, and mark it.
- 4.
Map one or more clustering solutions in the ROC plane (as described in subsection Clustering solutions, ROC plane and the BMI index) and mark the corresponding points.
- 5.
Rank the performance of each marked points in the ROC plane, as described in subsection Clustering solutions, ROC plane and the BMI index.
Declarations
The publication costs for this article were funded by the corresponding author's institution
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 1, 2013: Computational Intelligence in Bioinformatics and Biostatistics: new trends from the CIBB conference series. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S1.
Declarations
Authors’ Affiliations
References
- Stanford Microarray DataBase. [http://smd.stanford.edu/]
- D'haeseleer P: How does gene expression cluster work. Nat Biotechnol. 2005, 23: 1499-1501. 10.1038/nbt1205-1499.View ArticlePubMedGoogle Scholar
- Speed TP: Statistical analysis of gene expression microarray data. 2003, Chapman & Hall/CRCView ArticleGoogle Scholar
- Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005, 21: 3201-3212. 10.1093/bioinformatics/bti517.View ArticlePubMedGoogle Scholar
- Mehta T, Tanik M, Allison D: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nature genetics. 2004, 36: 943-947. 10.1038/ng1422.View ArticlePubMedGoogle Scholar
- Freyhult E, Landfors M, Önskog J, Hvidsten T, Rydén P: Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering. BMC Bioinformatics. 2010, 11: 503-10.1186/1471-2105-11-503.PubMed CentralView ArticlePubMedGoogle Scholar
- Giancarlo R, Scaturro D, Utro F: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics. 2008, 9: 462-10.1186/1471-2105-9-462.PubMed CentralView ArticlePubMedGoogle Scholar
- Giancarlo R, Utro F: Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms for Molecular Biology. 2011, 6: 1-10.1186/1748-7188-6-1.PubMed CentralView ArticlePubMedGoogle Scholar
- Giancarlo R, Lo Bosco G, Pinello L: Distance functions, clustering algorithms and microarray data analysis. Lecture Notes in Computer Science, Volume 6073. 2010Google Scholar
- Priness I, Maimon O, Ben-Gal I: Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics. 2007, 8: 1-12. 10.1186/1471-2105-8-1.View ArticleGoogle Scholar
- JA Hanley BM: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982, 143: 29-36. 10.1148/radiology.143.1.7063747.View ArticleGoogle Scholar
- Jain A, Dubes R: Algorithms for Clustering Data. Englewood Cliffs: Prentice-Hall 1988Google Scholar
- Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology. 2002, 3: RESEARCH0036.PubMed CentralView ArticlePubMedGoogle Scholar
- Di Gesú V, Giancarlo R, Lo Bosco G, Raimondi A, Scaturro D: Genclust: a genetic algorithm for clustering gene expression data. BMC Bioinformatics. 2005, 6: 289-10.1186/1471-2105-6-289.PubMed CentralView ArticlePubMedGoogle Scholar
- Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning. 2003, 52: 91-118. 10.1023/A:1023949509487.View ArticleGoogle Scholar
- Wen X, Fuhrman S, Michaels GS, Carr GS, Smith DB, Barker JL, Somogyi R: Large scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA. 1998, 95: 334-339. 10.1073/pnas.95.1.334.PubMed CentralView ArticlePubMedGoogle Scholar
- Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics. 2001, 17: 309-318. 10.1093/bioinformatics/17.4.309.View ArticlePubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeeck M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
- Brunet JP, Tamayo P, Golub T, Mesirov J: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA. 2004, 101: 4164-4169. 10.1073/pnas.0308531101.PubMed CentralView ArticlePubMedGoogle Scholar
- Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson JJ, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403: 503-511. 10.1038/35000501.View ArticlePubMedGoogle Scholar
- NCI 60 cancer microarray project. [http://genome-www.stanford.edu/NCI60]
- Su A, Cooke M, Ching K, Hakak Y, Walker J, Wiltshire T, Orth A, Vega R, Sapinoso L, Moqrich A, Patapoutian A, Hampton G, Schultz P, Hogenesch J: Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA. 2002, 99: 4465-4470. 10.1073/pnas.012025199.PubMed CentralView ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297. 10.1091/mbc.9.12.3273.PubMed CentralView ArticlePubMedGoogle Scholar
- Shamir R, Sharan R: Algorithmic approaches to clustering gene expression data. Current Topics in Computational Biology. Edited by: Jiang T, Smith T, Xu Y, Zhang MQ, Cambridge, Ma.: MIT Press. 2003, 120-161.Google Scholar
- Cover TM, Thomas JA: Elements of Information Theory. New York City: Wiley-Interscience, 1991.Google Scholar
- Jain AK, Murty MN, Flynn PJ: Data clustering: a review. ACM Computing Surveys. 1999, 31: 264-323. 10.1145/331499.331504.View ArticleGoogle Scholar
- Giancarlo R, Scaturro D, Utro F: A tutorial on computational cluster analysis with applications to pattern discovery in microarray data. Mathematics in Computer Science. 2008, 1: 655-672. 10.1007/s11786-007-0025-3.View ArticleGoogle Scholar
- Giancarlo R, Utro F: Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Computer Science. 2012, 428: 58-79.View ArticleGoogle Scholar
- Xu Y, Olman V, Xu D: Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning tree. Bioinformatics. 2002, 18: 536-545. 10.1093/bioinformatics/18.4.536.View ArticlePubMedGoogle Scholar
- Giancarlo R, Lo Bosco G, Pinello L, Utro F: The three steps of clustering in the post-genomic era: a synopsis. Computational Intelligence Methods for Bioinformatics and Biostatistics, Volume 6685 of Lecture Notes in Computer Science. Edited by: Rizzo R, Lisboa P. 2011, Springer Berlin/Heidelberg, 13-30.Google Scholar
- BWSilverman: Density Estimation for Statistics and Data Analysis (Chapman & Hall/CRC Monographs on Statistics & Applied Probability). 1986, Chapman and Hall/CRCView ArticleGoogle Scholar
- Yeung KY: Cluster analysis of gene expression data. PhD thesis. 2001, University of WashingtonGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.