- Open Access
Merged consensus clustering to assess and improve class discovery with microarray data
© Simpson et al; licensee BioMed Central Ltd. 2010
- Received: 24 August 2010
- Accepted: 3 December 2010
- Published: 3 December 2010
One of the most commonly performed tasks when analysing high throughput gene expression data is to use clustering methods to classify the data into groups. There are a large number of methods available to perform clustering, but it is often unclear which method is best suited to the data and how to quantify the quality of the classifications produced.
Here we describe an R package containing methods to analyse the consistency of clustering results from any number of different clustering methods using resampling statistics. These methods allow the identification of the the best supported clusters and additionally rank cluster members by their fidelity within the cluster. These metrics allow us to compare the performance of different clustering algorithms under different experimental conditions and to select those that produce the most reliable clustering structures. We show the application of this method to simulated data, canonical gene expression experiments and our own novel analysis of genes involved in the specification of the peripheral nervous system in the fruitfly, Drosophila melanogaster.
Our package enables users to apply the merged consensus clustering methodology conveniently within the R programming environment, providing both analysis and graphical display functions for exploring clustering approaches. It extends the basic principle of consensus clustering by allowing the merging of results between different methods to provide an averaged clustering robustness. We show that this extension is useful in correcting for the tendency of clustering algorithms to treat outliers differently within datasets. The R package, clusterCons, is freely available at CRAN and sourceforge under the GNU public licence.
- Cluster Algorithm
- Cluster Structure
- Cluster Number
- Consensus Cluster
- Classify Gene Expression
The need to classify observations into groups based on shared properties is common to the analysis of many types of quantitative and qualitative biological data. One of the most common applications of classification is the segregation of high throughput gene expression measurements into groups based on specific criteria (e.g. co-expression, profile-shape over a time course, partitioning between patient categories). This is usually achieved by the application of clustering techniques in which the distance between features (e.g. genes) are calculated from the numerical data (e.g. gene expression values) and used to partition the data into discrete groups. An extensive range of methods have been developed for clustering data (for reviews see [1–4]). The plethora of both methods and parameters under which clustering can be performed presents a significant problem in selecting a clustering approach that is well suited to the data type. In addition the data needs to be processed in a way that provides the best opportunity to isolate well-defined and meaningful clusters from the data. The importance of being able to assess both the method and the parameters in a systematic and comparable manner is significant medically (e.g. patient classification, drug efficacy testing and treatment), more generally in bioscience research (e.g. classifying gene and protein groupings, prioritising candidate lists, pathway and network topological analysis) and in many other fields (e.g. financial systems, network communications, demographics). When clustering we want to know how many clusters there are and how confident we are that the clusters and their members are those best supported by the data.
The clustering of biological data is most commonly performed in an unsupervised manner since the classes to which the features belong are not known in advance. Many studies have focused on high throughput gene expression data where the expression of tens of thousands of genes are measured simultaneously and compared across multiple conditions. Indeed there have been approximately 50,000 published microarray studies in the last decade . In these cases the high-dimensionality, noise and small condition number of the data sets makes cluster identification problematic. To aid in this task a range of metrics are used to assess the results of clustering experiments, mainly based on calculations of cluster compactness (intra-cluster variation), cluster separation (inter-cluster variation) including external, internal and relative criteria  and validity indices [7, 8]. These measurements generally assess how well defined and separated clusters are without considering their stability or the confidence with which members can be assigned to any one cluster. A range of resampling approaches have been developed to quantify clustering tendency, stability and validity both for entire sets of clusters and for members of clusters [9–20]. Of these methods, Monti et al.  is the only one to develop a generalised, model independent resampling based methodology to assess cluster stability. This method, consensus clustering, can be applied to any clustering approach that produces a deterministic classification output.
We now report the development of an extended implementation of consensus clustering based on the methodology of Monti et al.  in the widely used statistical programming language R . This extension allows the comparison and visualisation of the results of clustering experiments using any number of different clustering algorithms and parameters within a unified framework. In addition it provides methods to merge results to improve the quality of classifications. The premise of merging clustering results is that confidence in the classification is increased if similar results are produced using two or more methodologically different clustering algorithms (or in fact by using the same algorithm under very different conditions). We demonstrate the use of this merge consensus clustering methodology in simulated gene expression cases, canonical gene expression data from the leukaemia data set of Golub et al.  and temporal gene expression profiles captured from the development of the peripheral nervous system (PNS) of the fruitfly Drosophila melanogaster (unpublished data). We show that merge consensus clustering improves the quality of clustering results and provide quantitative measures of cluster and cluster membership robustness. These measures can be used to select the best methods and parameters for clustering a data set and allow the user to make informed decisions about the validity and composition of the resulting clusters. The clusterCons package has been developed to work with the clustering methods provided by the R package cluster, including agnes (agglomerative hierarchical), diana (divisive hierarchical), pam (partitional) and k-means. This is achieved via simple wrappers, which can be extended by the user to provide access to other clustering methods in R itself or in external applications called via R. The clusterCons package is easy to use and allows the user to perform clustering, robustness quantification and visualisation in the R environment facilitating the simple integration of analyses and exploiting the statistical and visualisation power of R.
The process of consensus clustering begins by randomly selecting a proportion of rows from the data and then clustering the subset using the currently specified clustering algorithm and parameters. This sampling and clustering is repeated many times to test the effect of removing features on the clustering result. The clusters produced by each iteration are stored in connectivity and indicator matrices which are later used to calculate a consensus clustering result. Features that are commonly found in the same cluster are, by definition, reliable cluster members, whereas those whose co-clustering is less frequent or dependent on the presence of other features are less reliable. The consensus clustering result is used to calculate cluster and membership robustness. We extend the consensus clustering method of Monti et al.  to merge consensus clustering in which we perform consensus clustering with many different clustering algorithms and/or parameters. The resulting consensus matrices are then merged by weighted averaging to produce a merge consensus matrix. This matrix can be used as a distance matrix in subsequent clustering experiments and to re-calculate cluster and membership robustness. The advantage of the merge consensus matrix is that it mitigates for the different classification properties of clustering algorithms, with some being more susceptible to outliers or particular types of data structure.
Calculating the consensus clustering result
Calculating cluster and membership robustness
It is important to note that the consensus matrix itself is not sufficient to calculate cluster and membership robustness. We need in addition to have I k for all clusters K where k ∈ K which we generate by clustering the whole dataset under the same conditions as for the re-sampled data. This is the reference clustering result whose robustness we are testing in the re-sampling procedure. It is possible to use the consensus matrix itself as a distance matrix for clustering directly which would generate a cluster structure from which you could generate the index set I k for all clusters. In this paper we use the former approach as it allow us to explicitly define the distance method and clustering conditions used and to compare the re-sampled consensus directly to it.
Merging consensus clustering results
Having developed methods for executing multi-algorithmic and multi-condition clustering for the consensus clustering approach of Monti et al.  we wanted to explore the effects of combining the consensus clustering results to see if these merged results mitigated for problems that can be associated with some types of clustering algorithms. For example, hierarchical clustering algorithms are highly sensitive to outliers whereas partitional ones tend to be relatively insensitive . We implemented a straight forward approach merged consensus clustering whereby we used weighted averaging to combine the consensus matrices from clustering experiments using different algorithms and/or conditions while only merging results for experiments with the same number of clusters. This means that we produce a merged consensus matrix for each k value assessed during the resampling procedure. In the absence of additional prior information, equal weighting was applied to the combination of consensus matrices, but this can be specified directly by the user as a vector of weight values = [0 - 1] if they wish to bias the merge towards any particular algorithm and/or parameter set. As with the original consensus clustering approach we need to provide a clustering structure in order to calculate cluster and membership robustness from a merged consensus matrix. We do this explicitly because each experimental condition can potentially produce a different clustering structure on which to assess robustness. When comparing a consensus result to a merge consensus result we use identical reference clustering structures to allow the effects of merging to be directly compared.
Estimating the cluster number
The true cluster number (k) of a data set can be estimated by finding the value of k at which there is the greatest change in area under the cumulative density function (CDF) calculated from the consensus matrix across a range of possible values of k.
where x i is the current element of the CDF and m is the number of elements.
If every iteration of a consensus clustering experiment clusters the same features together (i.e. the clustering is perfectly consistent) then the consensus matrix elements will be either 1 or 0 and the resulting AUC = 1. By calculating the AUC for each consensus matrix we can quantitatively compare different clustering results and benchmark them against a perfectly consistent clustering result. We can extend this method to estimate the true cluster number k by consensus clustering, varying only the cluster number. We calculate the quantity ΔK, which is the change in AUC as we vary k, and define the optimal k value as that which coincides with a peak in ΔK.
General procedure of a clusterCons analysis run
The clusterCons package takes as input the data to be clustered as a numeric matrix where each row is labelled with a unique identifier (e.g. gene id) and each column a unique condition identifier (e.g. patient id, time-point). The user then specifies the clustering algorithms to be used, either from a pre-defined set (agnes, diana, k-means, pam, hclust, apcluster) or user defined and, optionally, customised running parameters such as cluster number range, iteration number and sampling proportion. The package then carries out consensus clustering and returns consensus matrix objects for each specified set of algorithms and parameters. These consensus matrix objects can be used directly as distance matrices or to quantify cluster and membership robustness. The user can also specify whether they would like to generate a merge consensus matrix for each value of k. This merge matrix is generated by (weighted or un-weighted) averaging of the consensus matrices by providing an optional vector of weights = [0 - 1] and is designed to mitigate for extremes in consensus values that can be created by the sensitivity of some algorithms to particular data structures. The merge consensus matrices can also be used as distance matrices themselves in new clustering experiments. They can also be used to re-calculate cluster and membership robustness using as reference the clustering structures produced by the original consensus clustering experiment.
When an estimation of the true cluster number is required, consensus clustering objects from a range of k values are used to calculate the AUC and ΔK values. The ΔK values are then plotted against k in a "delta-K" plot to identify the peaks visually and estimate the true value of k.
clusterCons has been implemented with R version 2.10.0 as a package and successfully tested on Linux, Windows and Mac OS workstations. Execution times are dependent on the size, complexity and range of consensus runs and the power of the computer. As a guide, an example run with 45 clustering conditions, 100 iterations and a data matrix of 500 elements executed in 30 minutes on an entry level workstation (3 GB RAM, dual-core 1.60 GHz processors). Where faster execution times are needed it is possible to run clusterCons from within a batch script and each iteration farmed out as a separate process on a multi-processor facility. We routinely run larger experiments as batch arrays on the Edinburgh Array and Compute Data Facility (ECDF) a 1456 processor HPC compute cluster .
Simulated expression data
These opposing observations reveal a potential for significant errors in class discovery even with this simple data set. If we merge the consensus matrices resulting from the pam and agnes clusterCons runs into a single matrix by averaging them, we can then use the merge matrix to calculate cluster and membership robustness using either the pam or agnes clustering structures as a reference (Figure 2C 'merge' panels).
This effectively allows the user to blend the output of the two different clustering methodologies and provide a more balanced representation of the true robustness of the clustering schema and membership. We can see that without significantly affecting the membership robustness of the pam clustering (which correctly identified the 4 profiles) we now have a much improved estimate of membership robustness where all of the divergent spike profiles are weighted down (open triangles). Even more importantly, for the agnes clustering structure, cluster 3 (the fused cluster of profiles 3 and 4) is heavily penalised in terms of membership robustness. The ability to identify the minor outlier (only 1 of 4 measurements deviates from the true profile shape) in cluster 1 is retained, but diminished (as this is not identified by pam). This analysis shows that by applying a consensus merge methodology we can isolate outliers in a quantitative manner and assess how well different clustering algorithms partition the data.
In addition to calculating cluster and membership robustness, we used clusterCons to estimate the correct number of clusters in the data, by creating a "delta-K" plot  (Figure 2D). This plot was generated by running clusterCons with multiple algorithms over a range of possible cluster number values. Perfectly robust clustering generates consensus matrices with elements being either 1 or 0 as the same feature pairs are always found together in the same cluster. We created an empirical cumulative density plot from the value sorted elements of the consensus matrix and then calculated the area under the curve (AUC) which for perfect clustering is equal to 1. By calculating the change in the AUC as cluster number varies we identified which cluster number coincided with the greatest improvement in AUC and thus best estimated the cluster number. Figure 1D shows the "delta-K" plot of the run for each of five algorithms and for the merged consensus matrix. The inability of agnes to correctly classify the profiles is revealed as a prediction of k = 5 for the optimal cluster number. In contrast, all of the other matrices including the merge (which includes the agnes consensus matrix data) correctly predict k = 4.
We now apply our consensus clustering methodology to two biological problems: classifying leukaemia patients and identification of developmentally co-regulated genes using microarray data sets.
Classification of leukaemia patients
Classification and class discovery methods are commonly used to stratify patients into groups using either quantitative measures (e.g. gene expression, protein, metabolite levels) or indexed qualitative or semi-quantitative measures (e.g. symptoms, severity, treatment). Identifying disease associated genes provides an opportunity to improve diagnosis, treatment and understanding of the disease and has been widely used in oncology , neurology  and cardiology . To test the utility of consensus merge clustering in patient class discovery using gene expression data we used the leukaemia gene expression data set of Golub et al.  which contains profiles of patients suffering from either acute myeloid leukaemia (aml) or acute lymphoblastic leukaemia (all). This data set is publicly available and is easily obtainable as an R data object within the CRAN hopach package .
We next assessed the performance of merge consensus clustering by averaging the consensus matrices produced by each of the algorithms into a single merge consensus matrix and then calculating the adjusted membership robustness values for each algorithm using the corresponding clustering structures from the original experiments (Figure 3 'merge' panels). This allows the direct comparison of membership robustness values for each patient between the consensus and merge consensus values as the same clustering structure is used to make the calculations for both. As further illustration of the unsuitability of pam to the classification of this data set we now see that merge consensus clustering produces almost identical membership robustness values for both clusters for 19 out of the 27 all patients. In contrast, the merge robustness plots for both agnes and k-means are almost identical and in both cases patients 2 and 12 fail to be robustly placed into either class (i.e. their membership robustness values for clusters 1 and 2 are very close to each other). This is crucial for the purposes of trying to determine whether the class of the patient has been reliably determined. Comparing the membership robustness values for patients 2 and 12 between merge and consensus plots we see that the averaging of the consensus matrices has pulled the values close to each other which increases our uncertainty about the correct class for these patients. In the absence of merge clustering (and in a real life situation where the class is not known before hand) we would have confidently assigned patients 2,12 and 17 to the all group in the case of agnes and patients 2 and 12 to the aml group in the case of k-means.
Identifying expression modules during Drosophila peripheral nervous system development
Microarray gene expression experiments are increasingly being used to look at temporal expression patterns at the organism, tissue and cellular level during development [32–35] as well as other more general multi-condition scenarios. To illustrate the utility of applying a merge consensus clustering methodology to this type of problem we used clusterCons to classify gene expression profiles during the development of the peripheral nervous system (PNS) of the fruitfly, Drosophila melanogaster. We measured gene expression using the Affymetrix GeneChip Drosophila genome 2.0 array using mRNA isolated from flow sorted cell populations that were highly enriched in cells of the early PNS (TIS and APJ, data available from the gene expression omnibus (GEO) accession GSE21520). The data set covers four experimental conditions, three stages of PNS development in wild type flies (conditions 1-3) and one in flies mutant for the transcription factor atonal which is required for the normal formation of the PNS (condition 4) . We wanted to identify groups of genes that shared discrete expression profiles across the four conditions as a starting point to discovering co-regulated genes. We first selected all probe-sets whose expression in cells of the early PNS were highly enriched compared to control cells in any of the conditions (fdr < = 1%, ratio > = 2).
This produced a list of 526 probe-sets. This list was further reduced to the 200 probe-sets that had the highest expression variance across conditions to maximise the chance of producing discrete and informative profiles. The final pre-classification transformation was to unitise the expression matrices so that the classification was made on the basis of the shape of the expression profile and not the magnitude.
These results allow us to rank clusters based on their stability during re-sampling, prioritise expression profiles and rank cluster members by their robustness values. We can prioritise genes to include for further investigation and those to eliminate based on low stability within the cluster providing a quantitative method to remove cluster members reducing noise within the profile. We completed our analysis by casting the merge matrix produced from consensus clustering with all three algorithms onto the k-means clustering structure and calculating cluster and membership robustness. The merge clustering produced an improvement in the membership robustness of clusters 2,4 and 6 (mr = 0.87, 0.85 and 0.79) and only a small reduction in those of clusters 1 and 5 (mr = 0.93 and 0.90), whereas cluster 3 was essentially unaffected (mr = 0.63). The refinement of the consensus clustering results by merging results in significant changes in the membership robustness of several probe-sets (Figure 6B). Notably, in cluster 1 there is now an outlier with mr = 0.66 (probe-set 1638314 at, previous mr = 0.99). To determine the basis for this change we plotted the expression profiles for all probe-sets in cluster 1 (Figure 6B). In this plot, probe-set 1638314-at (black line, open triangle) has a very different relative expression level at condition 2. Similar analysis across the other merge clustering results revealed that merge clustering not only identified outliers when single algorithm consensus clustering did not, but also re-classified some probe-sets as not being outliers. For example, in cluster 5 (Figure 5A) the most extreme outlier had its membership robustness increase from mr = 0.69, with consensus clustering, to mr = 0.77, with merge consensus clustering, despite a corresponding drop in cluster robustness from cr = 0.97 to cr = 0.90. Also for cluster 6 the two most extreme outliers showed decreases in membership robustness from mr = 0.54 to mr = 0.48 and mr = 0.70 to mr = 0.59 respectively with little change in cluster robustness (cr = 0.74 to cr = 0.79).
In this example we used merge consensus clustering to refine consensus clustering results to better represent the stability of the co-expressed gene clusters (improved estimation of cluster robustness) and identify ill fitting or 'noisy' members of the profiles (improved estimation of membership robustness).
Together these refinements provide the opportunity to quantify the performance of clusters and members using a hybrid approach that takes advantage of the classification features of different clustering algorithms. This allows for the prioritisation of clusters (profiles) and elements (probe-sets) in a quantitative rather than qualitative manner and is a framework for filtering clustering results to maximise the signal to noise ratio.
We have extended the consensus clustering approach of Monti et al.  to allow merge consensus clustering and demonstrated its use with simulated and real gene expression data sets. We find that merge consensus clustering is effective in integrating consensus clustering results in a way that helps in the refinement of data classification and the identification of outliers. Crucially, this approach aids the selection of appropriate clustering algorithms and parameters and mitigates for the differential sensitivities of clustering algorithms to different data structures. Although we have demonstrated the benefit of merge consensus clustering for classifying gene expression data, it can be used to classify any data that can be represented numerically and should prove useful in the refinement and quantitative assessment of classification problems in general.
We have implemented merge consensus clustering as an R package, clusterCons. The package is fully documented, simple to use, freely available from the Comprehensive R package Archive Network (CRAN)  and easy to install using the resident package handling tools of R. We also include the latest version of the software and a user guide as additional files with this article (see Additional files 1 and 2, respectively). It provides methods to perform consensus clustering using any number of clustering algorithms and parameters. The resulting consensus matrices can be used as corrected distance matrices, to calculate cluster and membership robustness, estimate the optimal cluster number and to generate visualisations of the clustering structures. The merge consensus clustering approach extends the use of consensus matrices to integrate the results of consensus clustering experiments.
Project name: clusterCons
Project home page: http://sourceforge.net/projects/clustercons
Operating system(s): Platform independent
Programming language: R
License: GNU GPL
Any restrictions to use by non-academics: none
Funding was provided by the Wellcome Trust (TIS, JDA, APJ) under grant number 077266. This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF). http://www.ecdf.ed.ac.uk/. The ECDF is partially supported by the eDIKT initiative http://www.edikt.org.uk.
- Gollub J, Sherlock G: Clustering Microarray Data. Methods in Enzymology 2006, 411: 194–213. 10.1016/S0076-6879(06)11010-1View ArticlePubMedGoogle Scholar
- Kerr G, Ruskin HJ, Crane M, Doolan P: Techniques for clustering gene expression data. Computers in biology and medicine 2008, 38(3):283–293. 10.1016/j.compbiomed.2007.11.001View ArticlePubMedGoogle Scholar
- Do JHH, Choi DK: Clustering approaches to identifying gene expression patterns from DNA microarray data. Molecules and cells 2008, 25(2):279–288.PubMedGoogle Scholar
- Frades I, Matthiesen R: Overview on techniques in cluster analysis. Methods in molecular biology 2010, 593: 81–107. full_textView ArticlePubMedGoogle Scholar
- Halkidi MV, Batistakis Y: On clustering validation techniques. Journal of Intelligent Information Systems 2001., 2–3(17):Google Scholar
- Dunn JC: Well-Separated Clusters and Optimal Fuzzy Partitions. Cybernetics and Systems 1974, 4: 95–104. 10.1080/01969727408546059Google Scholar
- Bertoni A, Valentini G: Model order selection for bio-molecular data clustering. BMC bioinformatics 2007., 8(Suppl 2): 10.1186/1471-2105-8-S2-S7Google Scholar
- Jain AK, Moreau JV: Bootstrap technique in cluster analysis. Pattern Recogn 1987, 20(5):547–568. 10.1016/0031-3203(87)90081-1View ArticleGoogle Scholar
- Levine E, Domany E: Resampling method for unsupervised estimation of cluster validity. Neural Comput 2001, 13(11):2573–2593. 10.1162/089976601753196030View ArticlePubMedGoogle Scholar
- Tibshirani R, Walther G, Hastie T: Estimating the Number of Clusters in a Dataset via the Gap Statistic. Journal of the Royal Statistical Society, Series B 2000, 63: 411–423.View ArticleGoogle Scholar
- Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome biology 2002., 3(7): 10.1186/gb-2002-3-7-research0036Google Scholar
- Dudoit S, Fridlyand J: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 2003, 19(9):1090–1099. 10.1093/bioinformatics/btg038View ArticlePubMedGoogle Scholar
- Monti S, Tamayo P, Mesirov J, Golub T: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 2003, 52(1–2):91–118. 10.1023/A:1023949509487View ArticleGoogle Scholar
- Tseng GC, Wong WH: Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 2005, 61: 10–16. 10.1111/j.0006-341X.2005.031032.xView ArticlePubMedGoogle Scholar
- Suzuki R, Shimodaira H: Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 2006, 22(12):1540–1542. 10.1093/bioinformatics/btl117View ArticlePubMedGoogle Scholar
- Kim SYY, Lee JWW: Ensemble clustering method based on the resampling similarity measure for gene expression data. Statistical methods in medical research 2007, 16(6):539–564. 10.1177/0962280206071842View ArticlePubMedGoogle Scholar
- Gana Dresen IM, Boes T, Huesing J, Neuhaeuser M, Joeckel KH: New resampling method for evaluating stability of clusters. BMC bioinformatics 2008., 9: 10.1186/1471-2105-9-42Google Scholar
- Milagre ST, Maciel CDD, Pereira JCC, Pereira AA: Fuzzy cluster stability analysis with missing values using resampling. International journal of bioinformatics research and applications 2009, 5(2):207–223. 10.1504/IJBRA.2009.024038View ArticlePubMedGoogle Scholar
- Seiler M, Huang CC, Szalma S, Bhanot G: ConsensusCluster: a software tool for unsupervised cluster discovery in numerical data. Omics: a journal of integrative biology 2010, 14: 109–113. 10.1089/omi.2009.0083View ArticlePubMedGoogle Scholar
- Computing, R: A Language And Environment For Statistical Computing[http://www.r-project.org]
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 2006, 22(19):2405–2412. 10.1093/bioinformatics/btl406View ArticlePubMedGoogle Scholar
- ECDF - The Edinburgh Compute and Data Facility[http://www.ecdf.ed.ac.uk]
- Maechler M, Rousseeuw P, Struyf A, Hubert M: Cluster Analysis Basics and Extensions.[http://www.cran.r-project.org/web/packages/cluster/]
- Kaufman L, Rousseeuw P: Finding groups in data: an introduction to cluster analysis. Wiley series in probability and mathematical statistics. Applied probability and statistics, Wiley; 2005.Google Scholar
- Camp RL, Neumeister V, Rimm DL: A decade of tissue microarrays: progress in the discovery and validation of cancer biomarkers. Journal of clinical oncology 2008, 26(34):5630–5637. 10.1200/JCO.2008.17.3567View ArticlePubMedGoogle Scholar
- Greenberg SA: DNA microarray gene expression analysis technology and its application to neurological disorders. 2001, 57(5):755–761.Google Scholar
- Sharma UC, Pokharel S, Evelo CT, Maessen JG: A systematic review of large scale and heterogeneous gene array data in heart failure. Journal of molecular and cellular cardiology 2005, 38(3):425–432. 10.1016/j.yjmcc.2004.12.016View ArticlePubMedGoogle Scholar
- van der Laan MJ, S PK: A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference 2003, 117(2):275–303. 10.1016/S0378-3758(02)00388-9View ArticleGoogle Scholar
- MacQueen J: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1967, 281–297.Google Scholar
- Shimogori T, Lee DA, Miranda-Angulo A, Yang Y, Wang H, Jiang L, Yoshida AC, Kataoka A, Mashiko H, Avetisyan M, Qi L, Qian J, Blackshaw S: A genomic atlas of mouse hypothalamic development. Nature neuroscience 2010, 13(6):767–775. 10.1038/nn.2545View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang P, Zucchelli M, Bruce S, Hambiliki F, Stavreus-Evers A, Levkov L, Skottman H, Kerkelä E, Kere J, Hovatta O: Transcriptome profiling of human pre-implantation development. PloS one 2009., 4(11):Google Scholar
- Tzur G, Israel A, Levy A, Benjamin H, Meiri E, Shufaro Y, Meir K, Khvalevsky E, Spector Y, Rojansky N, Bentwich Z, Reubinoff BE, Galun E: Comprehensive gene and microRNA expression profiling reveals a role for microRNAs in human liver development. PloS one 2009., 4(10): 10.1371/journal.pone.0007511Google Scholar
- Leung YFF, Ma P, Link BA, Dowling JE: Factorial microarray analysis of zebrafish retinal development. Proceedings of the National Academy of Sciences of the United States of America 2008, 105(35):12909–12914. 10.1073/pnas.0806038105View ArticlePubMedPubMed CentralGoogle Scholar
- Jarman AP, Grau Y, Jan LY, Jan YN: atonal is a proneural gene that directs chordotonal organ formation in the Drosophila peripheral nervous system. Cell 1993, 73(7):1307–1321. 10.1016/0092-8674(93)90358-WView ArticlePubMedGoogle Scholar
- clusterCons at CRAN[http://cran.r-project.org/web/packages/clusterCons/index.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.