 Methodology article
 Open Access
 Published:
MULTIK: accurate classification of microarray subtypes using ensemble kmeans clustering
BMC Bioinformatics volume 10, Article number: 260 (2009)
Abstract
Background
Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance.
Results
We present a clusternumberbased ensemble clustering algorithm, called MULTIK, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple kmeans runs by varying the number of clusters and identifies clusters that manifest the most robust comemberships of elements. In addition to the original algorithm, we newly devised the entropyplot to control the separation of singletons or small clusters. MULTIK, unlike the simple kmeans or other widely used methods, was able to capture clusters with complex and highdimensional structures accurately. MULTIK outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real geneexpression data sets.
Conclusion
The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors.
Background
Groups that exhibit similar patterns in largescale genomic data sets have provided primary biological information. In this regard, identification of natural clusters and their membership has excited a great deal of interest in functional genomics and clinical research. Indeed, unsupervised clustering methods applied to microarray data enabled predictions of unknown gene functions (if applied to genes) [1, 2] and suggested the existence of subtypes of disease (if applied to samples) [3–6]. The task of cluster identification heavily depends on the property of clusters that are of interest, e.g., compactness, connectedness, and spatial separation. Each clustering algorithm has pros and cons for different shapes of clusters, which in turn informs the choice of an appropriate clustering strategy [7].
We are interested in establishing subclasses among microarray samples that might enable specified clinical treatments. In this problem, the data points are distributed in a very high dimensional (hundreds or thousands) space and the geometry of their clusters is mostly uncharacterized, which make it difficult to choose an appropriate clustering method. However, the most widely used clustering methods for this problem have been the hierarchical agglomerative or kmeans clustering algorithms [4, 5, 8, 9] that mainly focus on clusters with compact shapes.
When using these methods, various internal measures that represent compactness and spatial separation of clusters have been developed and compared to identify clusters and their members in data sets [9–12]. Each of the measures, however, is prone to specific biases [7], and their tests were mainly focused on the ability to identify the number of clusters and not on the accuracy of classification, which is mainly attributed to the clustering strategies used.
One important line of effort to improve the clustering strategy is the development of ensemble or consensus clustering techniques. These methods amalgamate multiple clustering runs to capture robust cluster structures by using multiple clustering algorithms [13, 14], perturbing data values with noise [15, 16], using different gene subsets [17–19], or choosing the number of clusters randomly [20, 21] and then extract consensus cluster structures. Among them, two of the methods [16, 19] have been tested intensively on microarray sample classification and were compared favorably with previous methods.
In this article, we firstly apply a clusternumberbased ensemble technique for microarray sample classification and compare the performance with previously used methods. The advantage of this approach over the single clustering or other methods is the ability to capture complex geometric structures. The rationale is that since some large cluster numbers are chosen during the clustering process, comemberships among detailed local structures are strengthened. See also [22] for developments of related algorithms. Specifically, we use the multiple kmeans clustering by Ashlock et al. [20], dubbed MULTIK, that provides most simple and intuitive procedure for partitioning data. In addition to the original algorithm, we newly devised an entropybased analysis of clusters, called the entropyplot, that monitors the distribution of cluster sizes during the partition process. The entropyplot helps prevent singletons or very small clusters from forming separate clusters, of particular utility when analyzing highdimensional and noisy real expression data.
MULTIK, though it is the simplest among existing ensemble clustering methods, exhibited remarkable performance, surpassing previously used methods in our tests, which suggests its ability to classify complex geometric structures is an important factor for microarray sample classification. In particular, MULTIK demonstrated perfect classification for five (or six) gene expression data sets out of eight that we tested.
Results
Algorithm
We begin by describing the MULTIK algorithm. MULTIK is performed by applying the wellknown kmeans algorithm repeatedly. Euclidean distance is always used to measure the dissimilarity between two data points unless stated otherwise. Let S = {x_{1}, x_{2},..., x_{ N }} be the data set distributed in R^{n}. The algorithm constructs an edgeweighted graph from the output of multiple instances of the kmeans algorithm. MULTIK is largely composed of the following two steps.
Step 1
Apply the kmeans algorithm on the data M times. In each instance, we randomly sample a number k_{ i }, for the number of clusters, from a predetermined distribution D. On each pair of nodes (edge) that belong to the same cluster, we add one to an edge weight (these weights are initialized to zero). After we repeat the process M times, we obtain a weighted graph structure on the data.
Step 2
Now, we go back to the reverse direction by unweighting the graph M times. In each iteration, we reduce a unit weight for all the edges with positive weights simultaneously. Through this reverse process, the initial graph is divided into smaller and smaller subgraphs (clusters). At any point the connected components of the graph are the clusters. The plot between the discrete (reverse) time normalized by M and the number of divided subgraphs, as we call cutplot, provides the information on the natural number of clusters. If a flat region of the cutplot is long, we regard the corresponding cluster structure is robust and hence natural. We choose the longest interval in the cutplot for the predicted number of clusters except for the onecluster interval. The weighted graph in Step 1 is equivalent to the averaged comembership matrix used in other ensemble clustering algorithms [16, 19]. The convergence of MULTIK (in probability) is addressed in Methods.
Entropyplot
Entropy can be used as a measure of randomness or information in a set of clusters [23] and is defined in terms of a discrete random event x, with possible states 1,..., k as: H(x) = ΣP_{ i }log_{2}P_{ i }, where P_{ i }= Prob (x = i th state), i = 1, 2,..., k. Let S be a data set with N elements and X = {C_{1}, C_{2},... C_{ k }} be a set of nonoverlapping clusters. The empirical probability that C_{ i }includes a given data point is C_{ i }/N, and so the entropy of the clustering X is
This entropy measure informs us how the data points are distributed as clusters. The cutplot summarizes the hierarchical structure of clusters that form as the cutvalue (wheretocut in the cutplot) is changed. However, the cutplot alone does not distinguish between roughlyequal and substantially unequal divisions of clusters within the hierarchy. The entropyplot is a more informative plot that better summarizes the cluster structure. This plot displays the Shannonentropy of the empirical distribution of points into clusters as a function of the cutvalue. In each position where the cutplot jumps, the entropyplot jumps as well. The difference lies in the size of the jumps. In the cutplot, any division of a cluster yields a jump of height one; the entropyplot has variable height jumps which give the relative informativeness of the partitioning of clusters. An even division of a large cluster is highly informative while the separation of a single point is minimally informative. When working with clean and lowdimensional data, it is not too difficult to detect the separation of small clusters consisting of one or a few points by inspection. When dealing with noisy or highdimensional data such as gene expression profiles, the entropyplot is of substantial utility in highlighting the significant divisions within the cluster structure. As a convenience for the user, the partition of small clusters, those for which the increase in entropy is less than a threshold, may be suppressed. This yields a cleaner and more easily interpreted summary of the hierarchical cluster structure. Users that wish to see the unmodified cluster structure may reduce the threshold value that triggers suppression. However, we applied a fixed threshold in this paper to compare the performance of MULTIK with other algorithms. See Additional file 1 for further explanation of entropyplot.
MULTIK parameters
Since MULTIK is an ensemble learning algorithm, it requires some parameters or thresholds. The following initial setting is suggested from our rough estimation and empirical tests, but works well for analyzing realworld expression data. Although we suggest varying them around the given values in an explorative study, we used the following setting for the purpose of comparison throughout this study.
The distribution for the cluster numbers, D is simply chosen to be uniform on an interval between two integers in our study. We used the interval (min(5, [N/4]), min(20, [N/2])) for D, where N is the number of samples and [x] represents rounding x to the nearest integer. We applied the minimum function here because it may be unreasonable to expect five or more classes for very small number of samples. The number of clustering runs M is fixed at 200 in this study. This number is sufficiently large for convergence of the algorithm. Lastly, we used a threshold for the entropyplot such that if a jump in the entropyplot is smaller than 0.1/(increased number of clusters), we suppress the separation of the corresponding cluster and merge the two adjacent intervals in the cutplot. 'increased number of clusters' in the denominator accounts for the case when multiple singleton or small sets are separated simultaneously from a cluster.
Experiments
We compare the performance of four kinds of clustering algorithms: MULTIK, hierarchical clustering (average linkage), kmeans, and GCC, a genesubset based ensemble clustering [19] for classifying data points on various simulated and real expression data sets. We used code from the R package for hierarchical and kmeans methods as well as for computing Silhuette Width and gap statistic. We tested the two versions GCC algorithms and denote them GCCc and GCCk that employ correlation and kmeans clustering, respectively. These are all distancebased clustering methods. Another important class of algorithms that we did not consider is modelbased clustering [24–26]. Most modelbased methods, however, are designed mainly for gene clustering and may not be reliable for sample clustering because in most cases, the number of samples is not sufficiently high to fit very highdimensional models. For example, EMMIXGENE [25] reduces the number of genes when clustering samples, which suffers from significant information loss, and the class prediction is highly affected by the genes chosen.
We predict the number of classes in a data set as well as the clustermembership of the data points in each algorithm, and then assess the agreement between the predicted and the known partitions using the adjusted Rand index (ARI) [27, 28]. Let P_{ I }= {P_{1•}, P_{2•},..., P_{l•}} and P_{ J }= {P_{•1}, P_{•2},..., P_{•m}} be two partitions of a data set D. Let N_{i•}and N_{•j}be the number of elements in P_{i•}and P_{•j}, respectively and [N_{ ij }], be a l × m matrix where N_{ ij }represents the number of elements in P_{i•}∩ P_{•j}, then the ARI is computed as follows:
where N is the number of data points, i.e. the sum of [N_{ ij }]. This index addresses the degree of agreement between two partitions with possibly different numbers of clusters. ARI has a value between 0 and 1, which mean a random matching and the perfect agreement between the two partitions, respectively. The index is said to be 'adjusted' because its formulation compensates for the fact that, when there are more than two members of a partition, a majority of the pairs of data items are in distinct classes. We applied the gap statistic and Silhouette Width to hierarchical and kmeans algorithms to predict the number of clusters, and used the inherent indicators for MULTIK and GCC, the cut & entropy plot and the modified Rand index, respectively.
Most existing clustering algorithms are good at finding compact clusters, but not those interwoven among others. Combining multiple kmeans runs, MULTIK aims to find connected components in a data set that are spatially separated among others. Intuitive examples that characterize MULTIK follow.
Comparison for geometrically complicated clusters
We considered three data sets composed of clusters with geometrically complicated structures and named them Donut & Ball, Horse Shoe, and Spiral, respectively. Their shapes and the corresponding cutplots are shown in Figure 1. All the three data sets have 1,000 data points. Although these data sets seem to have little relevance with gene expression data, they may abstract the geometric complexity of microarray data sets and clearly reveal the advantage of MULTIK algorithm. The test results are shown in Table 1. The optimal number of clusters by MULTIK is determined at the longest interval of the cutplot (except k = 1 case), and then the corresponding partitions are naturally derived from Step 2. MULTIK correctly predicted the number of clusters and classified the clusters accurately with ARI value 1 in all the three examples, while the other methods yielded poor predictions. Since most existing clustering algorithms and indicators for the optimal number of clusters are focused on compactness of clusters, they could not identify complicated structures that focused on connectedness and spatial separation of clusters as the above examples.
The ability of MULTIK to trace the complex geometric structure is reminiscent of the nonlinear dimension reduction technique, called ISOMAP [29], which uses the shortest paths between data points to approximate the geodesic distances between points, and may suggest the applicability of MULTIK in nonlinear dimension reduction problem.
Comparison for high dimensional and noisy clusters
Now, we compare the clustering algorithms on high dimensional and noisy synthetic data sets that imitate microarray samples. In the first model, called H2, we considered two clusters in 100 dimensions. Each cluster is chosen to have 25 or 50 observations that were independently drawn from normal distributions N(0_{100}, I_{100}) and N(0.5_{100}, I_{100}), respectively, where α_{ p }denotes the 1 by p vector of α's and I_{ p }denotes the p by p identity (covariance) matrix. The two clusters may overlap in each dimension. In the second model, called H3, we considered three clusters in 300 dimensions (genes). Each cluster had 50 observations. We divided the 300 genes into ten groups each of which equally had 30 genes. In each block (a gene group in a cluster), all the 30 dimensional samples were commonly drawn from a normal distribution N(α_{30}, I_{30}) independently where α is randomly chosen from {0.5, 0, 0.5} in each block. The block structures represent gene sets with coexpression patterns that are commonly up or down regulated under specific experimental conditions.
In each model, we tested the algorithms on randomly generated ten data sets and averaged the ARI values. The test results are summarized in Table 2. In both models, MULTIK showed the highest accuracy. The GCC methods yielded rather good predictions for the H3 model, but performed very poorly for the H2 model. Hierarchical methods performed very poorly for all the cases, which is mostly attributed to the failure in predicting the correct number of clusters. kmeans algorithm demonstrated the second best accuracy except for the case of the Silhuette indicator applied to the H3 model.
Classification tests for real expression data sets
We tested the clustering algorithms on eight microarray data sets as summarized in Table 3. All the mRNA samples in each data set are assigned class labels from laboratory analyses of the tumor samples. These labels establish the known (goldstandard) partitions on the data points. We chose 300 genes with higher variances in each data set for data clustering (or partitioning). For randomized algorithms, MULTIK, GCC and kmeans clustering, we repeated running the algorithms five times and used the median ARI and the corresponding clusters.
Figure 2 shows the cutplots and the entropyplots for the tested data sets demonstrating how entropyplots can amend the predictions of cutplots. Both plots are represented by monotonically increasing step functions and share the jumping points. Each jump in the plots represents separation of a cluster from the former cluster structure (a partition of data points). If a jump in the entropyplot is smaller than a threshold (0.1/increased number of clusters), we regard the separation of a cluster is negligibly small and suppress the separation. Accordingly, we merge the corresponding adjacent intervals in the cutplot. The cutplots shown in Figure 2 are those before such modifications. After modifying the separation of small clusters, the final cluster number and memberships are uniquely determined at the longest cutplot interval (except k = 1). This criterion is commonly applied in comparing the performance of MULTIK with other algorithms. However, in an explorative study, we recommend investigating the next long cutplot intervals and the associated cluster structures as well.
The test results are summarized in Table 4. The ARI values in this table represent the agreements between the predicted and the gold standard partitions. MULTIK overall performed best in both predicting the number of clusters and the accuracy of classification. To our surprise, MULTIK perfectly classified five of them. The cutplot of the leukemia data [6] (Fig. 2(a)) had intermediately long flat intervals at two, three, and four clusters. However, at the second jump, the entropyplot showed a very small increase (0.09), which was caused by the separation of a singleton set. Therefore, we suppressed the partitioning of the singleton set, which merged the second and third intervals. This consequently indicated two major subclasses, which perfectly matched to the two known leukemia classes, ALL(47) and AML(25). By the third jump in the cutplot, the ALL class was again clearly divided into two known subtypes, ALLB(38) and ALLT(9). Although the modified cutplot indicated two major subclasses, MULTIK was able to unveil further known subtypes clearly.
We then tested MULTIK on a randomized data set. We randomly permuted each gene's profile of the leukemia data independently. The resulting cutplot and entropyplot are shown in Figure 3, where no meaningful intervals or jumps are found. This permutation test demonstrates the existence of cluster structure in the real data.
In analyzing the lymphoma data set [30], the cutplot had the longest interval at three clusters (except for the one cluster interval) (Fig. 2(b)). The first jump caused a major increase in the entropy value (0.79) and the second jump a minor increase (0.21), both of which were meaningful values (>0.1/increased number of clusters). In the first jump, the 89 samples were clearly divided into two classes, DLBCL(68) and CLL(12)FL(9) groups. In the second jump, the latter group was again clearly divided into CLL and FL groups. The split of the small subgroup FL caused a relatively small increase in the entropyplot.
While the entropyplot modified the predictions of the number of clusters for the leukemia, St. Jude (Fig. 2(f)) [31], and the normal II data sets (Fig. 2(h)) [32], the cutplot alone correctly identified the number of classes for the other data sets. However, the entropyplot still provides important information on the impacts of new subdivisions in the clusters. For example, the cutplot of the thyroid I data [33] had two similarly long intervals at two and three clusters (even though the former is slightly longer) (Fig. 2(d)). Between them, however, the entropyplot indicated much higher impact of the former subdivision, and hence two major subclasses. Nevertheless, in the second jump, the cancer class was divided into two subsets with six and three elements so that the latter interval might indicate cancer subtypes.
The colon data had a large difference in the sizes of the two classes such that the cancer class had 100 samples while the normal class had only five samples. Even in such a case, MULTIK clearly separated the two classes. The hierarchical clustering algorithm showed a good performance but the other methods yielded very poor classification rates. The thyroid II data was the most clearly separated so that MULTIK and kmeans algorithm as well as the hierarchicalSilhuette methods clearly separated the classes. However, GCC methods failed to indicate the correct numbers and hence yielded low classification rates. On the other hand, for the St. Jude data set, GCC methods performed the second best, while the usual clustering methods performed very poorly. The cutplot of St. Jude initially indicated six classes but was modified to five after suppressing the separation of a singleton set, though the corresponding jump is too small to be recognized in the entropy graph.
For the last two examples, we chose two data sets with many classes. The normal I data [34] had twelve classes sampled from normal human tissues each of which equally had two samples. The twelve classes in this data set were relatively clearly separated among others so that MULTIK, GCCk, and hierarchicalSilhuette method clearly identified the known twelve classes, and kmeans also exhibited good classification rates. The normal II data had thirteen classes, and MULTIK and GCCk showed similarly good performances.
Tests for real expression data with fixed known number of classes
As seen in the leukemia example, the 'known' classes also had a hierarchical structure so that it is rather controversial to define 'gold standard subclasses' because they are merely representing the current level of our knowledge. Moreover, since most algorithms other than MULTIK failed to indicate the correct number of clusters in many data sets, it is not clear at this point to address the accuracy of the clustering strategies themselves. In other words, the tested algorithms could have yielded better performance combined with other possible indicators for the number of clusters. For these reasons, we investigate the performance of the clustering strategies themselves by specifying the 'known' number of clusters in each algorithm. The test results are shown in Table 5. In this analysis, most clustering strategies improved the ARI values for some data sets, but MULTIK still was the best method for all the data sets so that it perfectly classified six data sets. As we have analyzed for the leukemia data, MULTIK perfectly classified the three known subclasses, while the other methods still misclassified some samples. The thyroid II data set was most clearly classified so that all the algorithms identified the underlying two classes precisely.
The hierarchical method, if not perfect, has been known to classify the leukemia data quite well [7], but the ARI value was poor (0.4723) in this test. This was caused by the separation of a singleton set, and hence we tried four clusters and assigned the singleton set to the nearest cluster among the other three clusters. This postprocessing yielded a much better classification rate of 0.7680 for the leukemia data set. This also illustrates why data processing such as entropyplot is required. Some other data sets were also better classified by this process so that the lymphoma and normal II data sets had improved ARI values 1.0000 and 0.6653, respectively. However, the modified hierarchical clustering did not outperform MULTIK in any data set.
Analysis of breast cancer data without goldstandard known subtypes
Breast cancer has been frequently investigated of its subtypes using gene expression profiles. Different subtypes predicted from hierarchical clustering of expression profiles exhibited different clinical prognoses, which suggests breast cancer is separable into distinct disease. Sorlie et al. [35] compiled 122 breast tissue samples as well as ~500 genes intrinsic to the disease to predict five cancer subtypes. They also extracted five core sample groups that are most highly correlated in each subgroup. We reanalyzed the breast cancer data set from Sorlie et al. [35]. Because the previous subtypes had been inferred over the 'intrinsic' genes, we used the core sample groups in each subtype as the 'silver' standard. Because many kinds of disease are uninformative of such 'intrinsic genes', it is important to reproduce the previous result without functional information of genes. Therefore, we chose 300 highvariance genes in a fully unsupervised manner and compared the performance of the hierarchical clustering and MULTIK. To compare with the previous prediction, we chose four large entropy jumps to partition the data into five clusters. Four of the previous subtypes largely agreed with the MULTIK clusters but two of them (Luminal A and ERBB) were merged into one cluster with overall ARI = 0.3704. On the other hand, the hierarchical clustering completely failed to reproduce the previous subtypes and yielded ARI = 0.0803 at maximum (when we chose 25 clusters). In this case, very small clusters were continuously separated from large one cluster as we lower the treecut value of the hierarchical clustering. See Additional file 2 for related data.
Detecting outliers in MULTIK
MULTIK basically assigns every point to a cluster. However, outliers in each cluster can be identified by computing the average distance of each data point to other points in the cluster. For a cluster C_{ i }= {x_{1}, x_{2},..., x_{ c }}, let d_{ j }, j = 1,..., c, be the average distance of x_{ j }to other points in the cluster, and mean_{ i }and std_{ i }be their mean and standard deviation. We regard a data point x_{ j }as an outlier if d_{ j } mean_{ i }> α·std_{ i }where α is a positive constant. Using this scheme, we analyzed the clusters for the St. Jude data set that showed a relatively low ARI (0.8015) in our test. When we set α = 2 and 1.5, we identified nine and 22 outliers in total, respectively. Excluding those data points, we obtained increased ARI values of 0.8398 and 0.8670, respectively. This implies class assignment by MULTIK can be improved by removing outliers.
Overall, MULTIK exhibited consistently good performance, while the performance of the other methods varied much depending on the data set, the clustering algorithm employed, or the indicator function chosen.
Discussion
Identifying subclasses of diseases using microarray data is a clinically important and computationally challenging problem. The basic assumption of the problem is that distinct subtypes, if any, are separated among others in a high dimensional sample space, and hence can be identified through computational methods: Although the differences in each dimension may be small, they will achieve clear separations if accumulated in a very high dimensional space: The simulation tests for H2 and H3 have been designed in this perspective. Indeed, as shown in the test examples, the ordinary clustering methods successfully identified the known subclasses in some data sets. To improve the performance, ensemble or resampling based clustering techniques have been developed [16, 18, 20, 21, 36]. Ensemble learning techniques have been widely used in genomic data analysis such as prediction of proteinprotein interactions, alphamembrane proteins [37], protein fold pattern recognition [38], learning the structure of genetic networks from timeseries expression data [39] as well as microarray data classification [36, 40].
In this article, we presented a clusternumberbased ensemble clustering algorithm, MULTIK, and suggested using it for unsupervised classification of microarray samples. Unlike other widely used clustering methods, MULTIK was able to identify clusters with complicated geometric structures as well as high dimensional and noisy clusters. It demonstrated outstanding performance in various simulated and real expression data sets for subtype classification. We note that the GustafsonKessel (GK) clustering algorithm [41] also targets clusters with noncompact shapes, but GK method mainly focuses on linear cluster structures and tends to cause a numerical problem in computing the eigenstructure of covariance matrices for highdimensional data. Moreover, GK method itself does not suggest the optimal number of clusters.
The average linkage hierarchical and kmeans clustering methods are designed to capture compact or relatively simple clusters. However, the geometric features of the microarray clusters are hardly characterized because they reside in a very high dimensional space and are affected by various sources of noise as well as potential gene interactions. Our tests showed that previous clustering methods that focused on compact clusters yielded poor predictions in many data sets. On the other hand, the suggested method exhibited significant superiority and perfectly classified five (or six) real expression data sets out of eight, while the other methods perfectly classified at most two. We infer the flexibility of MULTIK in both geometric complexity and highdimensionality enabled the accurate cclassification of gene expression data.
MULTIK provides two forms of useful exposition of cluster structure, the cutplot and the entropyplot, that inform the hierarchical structure and the natural number of clusters. This pair of indicators is much more informative than other internal indicators most of which suggest only the number of clusters. As shown in the above examples, the cutplot and entropyplot give a portrait of the overall cluster organization in a complementary manner, which provides researchers with a rich source of information to decide what the appropriate clusters are. Indeed, use of this pair of indicators outperformed other widely used indicators in various tests.
One possible weak point with MULTIK is the existence of the free parameter relative to D, the distribution on the number of clusters. However, the algorithm showed reliable performances with the rule suggested in this study. To the authors' knowledge, most ensemble learning methods include free parameters, whose basic principle is that the ensemble methodologies improve the performance for a wide range of the free parameters.
An important related topic about sample classification is the gene selection problem. The performance of clustering usually varies more or less depending on the gene subsets chosen. We have commonly used same number of genes with high variance. One possible method in our ensemble context is simultaneously randomizing the number of high variance genes as well as the number of clusters in constructing the weighted graph. However, this approach has not facilitated some meaningful improvements in our experiments (data not shown). Further extensive tests and investigation on gene selection problem is required.
Conclusion
We found the geometric complexity is most important feature of clusters for accurate classification of microarray samples, which has been often overlooked by other clustering methods. MULTIK exploits the geometric information of clusters very well since it applies ensemble clustering by varying the number of clusters. With its high performance and simplicity, we expect MULTIK will become a useful method to uncover the subtypes of disease from expression profiles.
Methods
Formal statement of the MULTIK algorithm
Inputs:

1)
A set S of N points in R^{n}

2)
A number M of clustering runs to perform.

3)
A distribution D of numbers of clusters.
Outputs:

1)
A category function CAT: S → Z.

2)
A cut plot f_{ cut }: [0, 1] → Z
Details:
Initialize an N × N matrix W of pairwise connection strengths to contain all zeros.
Repeat M times
Select an integer d from D
kmeans cluster S with d clusters.
For each {i, j ∈ S × S with i, j in the same cluster.
Increment W [i] [j]
end For
end Repeat
Normalize W [i] [j] by dividing each entry of W [i] [j] by M
For l equals 0 to M
Construct graph G with V(G) = S, E(G) pairs of points for which W [i] [j] >l/M
Compute number of connected components c of G
Add the point (l/M, c) to f_{ cut }
Compute
end For
For x with l/M < x <(l+1)/M, f(x) = f(l/M) and H(x) = H(l/M)
Build a new graph on S with edges where W [i] [j] >C
C is chosen such that f = f(c) ≠ 1}{f = f(C) ≠ 1} has the longest length.
Enumerate the connected components of this graph.
CAT [i] is the number of the connected component containing i.
Convergence of the MULTIK algorithm
MULTIK is a stochastic algorithm and the cutplot is its ad hoc outcome. We show here the convergence of the algorithm and cutplot. Suppose that i, j ∈ S. Let p_{ ij }be the probability that i and j will be in the same cluster if we pick a number of clusters d from the distribution D and then cluster S using d clusters. Since D is a distribution on a finite set (more clusters than data items is nonsensical) and the number of possible outcomes of MULTIK is also finite, there is no problem with the existence of p_{ ij }, which is a welldefined probability. Let W* be the graph on vertex set S with the edge weight p_{ ij }between i and j. All the zeroweight edges are removed. Let f*(x) be the cutplot derived from W* as a standard cutplot is derived from W. Using the unnormalized version of W[i][j], we obtain lim_{M → ∞}. Therefore, as we increase M, MULTIK creates graph W that are successively better approximations to W*. Likewise, cutplot f(x) approximates to f*(x).
References
 1.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273–3297.
 2.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci USA 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863
 3.
Andre F, Pusztai L: Molecular classification of breast cancer: implications for selection of adjuvant chemotherapy. Nat Clin Pract Oncol 2006, 3(11):621–632. 10.1038/ncponc0636
 4.
Perou CM, Sorlie T, Eisen MB, Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, et al.: Molecular portraits of human breast tumours. Nature 2000, 406(6797):747–752. 10.1038/35021093
 5.
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al.: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 2003, 100(14):8418–8423. 10.1073/pnas.0932692100
 6.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–537. 10.1126/science.286.5439.531
 7.
Handl J, Knowles J, Kell DB: Computational cluster validation in postgenomic data analysis. Bioinformatics 2005, 21(15):3201–3212. 10.1093/bioinformatics/bti517
 8.
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, MullerHermelink HK, Smeland EB, Giltnane JM, et al.: The use of molecular profiling to predict survival after chemotherapy for diffuse largeBcell lymphoma. N Engl J Med 2002, 346(25):1937–1947. 10.1056/NEJMoa012914
 9.
Dudoit S, Fridlyand J: A predictionbased resampling method for estimating the number of clusters in a dataset. Genome Biol 2002, 3(7):RESEARCH0036. 10.1186/gb200237research0036
 10.
Milligan G, Cooper M: An examination of procedures for determining the number of clusters in a data set. Psychometrika 1985, 50: 159–179. 10.1007/BF02294245
 11.
Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc B 2001, 63: 411–423. 10.1111/14679868.00293
 12.
BenHur A, Elisseeff A, Guyon I: A stability based method for discovering structure in clustered data. Pac Symp Biocomput 2002, 7: 6–17.
 13.
Strehl A, Ghosh J: Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 2002, 3: 583–617. 10.1162/153244303321897735
 14.
Swift S, Tucker A, Vinciotti V, Orengo C, Liu X, Kellam P: Consensus clustering and functional interpretation of geneexpression data. Genome Biol 2003, 5: R94. 10.1186/gb2004511r94
 15.
Mc Shane LM: Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 2002, 18: 1462–1469. 10.1093/bioinformatics/18.11.1462
 16.
Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resamplingbased method for class discovery and visualization of gene expression microarray data. Mach Learn 2003, 52: 91–118. 10.1023/A:1023949509487
 17.
Smolkin M, Ghosh D: Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics 2003, 4: 36. 10.1186/14712105436
 18.
Bertoni A, Valentini G: Ensembles based on random projections to improve the accuracy of clustering algorithms. Neural Nets, (WIRN 2005), LNCS 2005, 3931: 31–37. full_text
 19.
Yu Z, Wong HS, Wang H: Graphbased consensus clustering for class discovery from gene expression data. Bioinformatics 2007, 23(21):2888–2896. 10.1093/bioinformatics/btm463
 20.
Ashlock DA, Kim EY, Guo L: Multiclustering: avoiding the natural shape of underlying metrics. In ANNIE: 2005. Volume 15. ASME press; 2005:453–461.
 21.
Fred ALN, Jain AK: Robust data clustering. Proc IEEE CS Conf Computer Vision and Pattern Recognition 2003, 2: 128–133.
 22.
Kuncheva LI, Vetrov DP: Evaluation of stability of kmeans cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal Mach Intell 2006, 28(11):1798–1808. 10.1109/TPAMI.2006.226
 23.
Shannon CE: A mathematical theory of communication. Bell Syst Tech J 1948, 27: 379–413.
 24.
Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression data. Bioinformatics 2002, 18(9):1194–1206. 10.1093/bioinformatics/18.9.1194
 25.
McLachlan GJ, Bean RW, Peel D: A mixture modelbased approach to the clustering of microarray expression data. Bioinformatics 2002, 18(3):413–422. 10.1093/bioinformatics/18.3.413
 26.
Qin ZS: Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics 2006, 22(16):1988–1997. 10.1093/bioinformatics/btl284
 27.
Hubert L, Arabie P: Comparing partitions. Journal of classification 1985, 2: 193–218. 10.1007/BF01908075
 28.
Milligan G, Cooper M: A study of comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research 1986, 21: 441–458. 10.1207/s15327906mbr2104_5
 29.
Tenenbaum JB, de Silva V, Langford JC: A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290: 2319–2322. 10.1126/science.290.5500.2319
 30.
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al.: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature 2000, 403(6769):503–511. 10.1038/35000501
 31.
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, et al.: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 2002, 1(2):133–143. 10.1016/S15356108(02)000326
 32.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al.: Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 2001, 98(26):15149–15154. 10.1073/pnas.211566398
 33.
He H, Jazdzewski K, Li W, Liyanarachchi S, Nagy R, Volinia S, Calin GA, Liu CG, Franssila K, Suster S, et al.: The role of microRNA genes in papillary thyroid carcinoma. Proc Natl Acad Sci USA 2005, 102(52):19075–19080. 10.1073/pnas.0509603102
 34.
Yanai I, Benjamin H, Shmoish M, ChalifaCaspi V, Shklar M, Ophir R, BarEven A, HornSaban S, Safran M, Domany E, et al.: Genomewide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 2005, 21(5):650–659. 10.1093/bioinformatics/bti042
 35.
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al.: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 2003, 100(14):8418–8423. 10.1073/pnas.0932692100
 36.
Dudoit S, Fridlyand J: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 2003, 19: 1090–1099. 10.1093/bioinformatics/btg038
 37.
Martelli PL, Fariselli P, Casadio R: An ENSEMBLE machine learning approach for the prediction of allalpha membrane proteins. Bioinformatics 2003, 19: i205i211. 10.1093/bioinformatics/btg1027
 38.
Shen HB, Chou KC: Ensemble classifier for protein fold recognition. Bioinformatics 2006, 22: 1717–1722. 10.1093/bioinformatics/btl170
 39.
Nam D, Yoon SH, Kim JF: Ensemble learning of genetic networks from timeseries expression data. Bioinformatics 2007, 23: 3225–3231. 10.1093/bioinformatics/btm514
 40.
Qiu P, Wang ZJ, Liu KJR: Ensemble dependence model for classification and prediction of cancer and normal gene expression data. Bioinformatics 2005, 21: 3114–3121. 10.1093/bioinformatics/bti483
 41.
Kim DW, Lee KH, Lee D: Detecting clusters of different geometrical shapes in microarray gene expression data. Bioinformatics 2005, 21(9):1927–1934. 10.1093/bioinformatics/bti251
Acknowledgements
This work was supported by the Korea Research Council of Fundamental Science & Technology (KRCF), Grant No. CRESEARCH0809NIMS. The third author would like to acknowledge the support of the National Science and Engineering Council of Canada (NSERC) for this work. EYK and DN were also supported by KRF grant funded by the Korean government (MEST) with No. 20090072588 and 20090077754, respectively.
Author information
Additional information
Authors' contributions
EYK codeveloped the algorithm and codes and analyzed data. SYK codesigned the study and aided data analysis. DA codeveloped the algorithm and wrote the Additional file 1. DN codesigned and supervised the study, analyzed data, and drafted the manuscript. All the authors contributed to the writing of the main manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Cluster Structure
 Cluster Strategy
 Adjusted Rand Index
 Geometric Complexity
 Ensemble Cluster