Volume 9 Supplement 6
Symposium of Computations in Bioinformatics and Bioscience (SCBB07)
Genetic weighted k-means algorithm for clustering large-scale gene expression data
- Fang-Xiang Wu^{1, 2}Email author
https://doi.org/10.1186/1471-2105-9-S6-S12
© Wu; licensee BioMed Central Ltd. 2008
Published: 28 May 2008
Abstract
Background
The traditional (unweighted) k-means is one of the most popular clustering methods for analyzing gene expression data. However, it suffers three major shortcomings. It is sensitive to initial partitions, its result is prone to the local minima, and it is only applicable to data with spherical-shape clusters. The last shortcoming means that we must assume that gene expression data at the different conditions follow the independent distribution with the same variances. Nevertheless, this assumption is not true in practice.
Results
In this paper, we propose a genetic weighted K-means algorithm (denoted by GWKMA), which solves the first two problems and partially remedies the third one. GWKMA is a hybridization of a genetic algorithm (GA) and a weighted K-means algorithm (WKMA). In GWKMA, each individual is encoded by a partitioning table which uniquely determines a clustering, and three genetic operators (selection, crossover, mutation) and a WKM operator derived from WKMA are employed. The superiority of the GWKMA over the k-means is illustrated on a synthetic and two real-life gene expression datasets.
Conclusion
The proposed algorithm has general application to clustering large-scale biological data such as gene expression data and peptide mass spectral data.
Background
Clustering is defined as a process of partitioning a set of objects (patterns) into a set of disjoined groups (clusters). Its goal is to reduce the amount of data by categorizing or grouping similar data items together and obtain useful information. Clustering methods can be divided into two basic types: hierarchical and partitional clustering [1]. Within each type there exists a wealth of subtypes and different algorithms. Hierarchical clustering proceeds successively either by merging smaller clusters into larger ones (bottom-up), or by splitting larger clusters into smaller clusters (top-down). The hierarchical clustering methods differ in the rules used to decide which two small clusters are merged or which large cluster is split. The final result of the algorithm is a binary tree of clusters called a dendrogram, which shows how the clusters are related to each other. By cutting the dendrogram at a desired level, a clustering of objects in a dataset into disjoint groups is obtained.
On the other hand, partitional clustering – k-means, for example – attempts to directly divide a dataset into a number of disjoint groups. All partitional clustering algorithms need as input the number of clusters and a cost (criterion) function to define the quality of a partition. The partitional clustering method aims at optimizing the cost function to minimize the dissimilarity of the objects within each cluster, while maximizing the dissimilarity of different clusters. In general, the partitional clustering algorithms are iterative and hill-climbing, and thus they are sensitive to the choice of the initial partition. Furthermore, since the associated cost functions are nonlinear and multimodal, usually these algorithms converge to a local minimum. The algorithms based on combinatorial optimization such as integer programming, dynamic programming and, branch-bound methods are too expensive since the number of partitions of n objects into k clusters is O(k^{ n }).
Genetic algorithms (GA) [2], inspired by natural evolution of genes, offer heuristic solutions to some optimization problems. The algorithm typically starts with a set of solutions (randomly generated) called the population and creates successive, new generations of the population by genetic operations such as natural selection, crossover, and mutation. Natural selection is performed based on the fitness (related to the cost function) of an individual. For an individual, the better its fitness, the more chances it has to survive in the next generation. Crossover is performed by certain crossover rule and mutation aims at changing an individual by a user-specified mutation probility. The intuition underlying the approach is that each new population will be better than the previous one. Actually it has been proved [3] that a canonical GA converges to the global optimum with probability 1.
A GA is highly dependent on the coding of the solutions (individuals). In the context of weighted k-means, a natural representation of a solution is a pair of variables (partitional string, cluster centroids). The partitional string describes for each object the index of cluster which it belongs to. The cluster centroids are representative objects of the clusters and their attributes are found by averaging the corresponding attributes among the objects in a particular cluster. These two variables depend on each other such that if one of them is given, the other one can be uniquely constructed. Since the cluster centroids generally are real numbers, it might be very difficult to encode them. On the other hand, a direct encoding for partitional strings is a simple problem.
Genetic algorithms have been previously considered for clustering problems [4–11]. Often genetic algorithms are not hybridized with k-means algorithms [5, 6, 9, 11] and thus their rates of convergence were very slow. On the other hand, when GA are hybridized with k-means algorithms [7, 8, 10], the resultant algorithms inherit some drawbacks of unweighted k-means algorithms, for example, that the resultant clusters are spherical-shape. Further, if the inherent structure of the clusters in the data is not spherical shaped, such algorithms can not give the correct results
In this paper, we propose a genetic weighted k-means algorithm (GWKMA). This is a hybrid approach to combining a GA with the weighted k-means algorithm (WKMA) [12, 13] and partially remedies drawbacks of other attempts [4–11]. The GWKMA encode the solutions by partitional strings and employs three genetic operations – natural selection, crossover and mutation – and one WKM operation derived from the weighted k-means algorithm (WKMA).
Methods
WKMA
In a general sense, a k-partitioning algorithm takes as input a set D = {x_{1}, x_{2}⋯, x_{ n }} of n objects and an integer K, and outputs a partition of D into exactly K disjoint subsets D_{1},⋯, D_{ K }. Denote such a partition by Δ. Each of the subsets is a cluster, with objects in the same cluster being somehow more similar to each other than they are to all subjects in other different clusters. One way to make the determination of Δ into a well-defined problem is to define a cost function which measures the clustering quality of any partitions of a dataset.
Obviously, given a partition Δ, the value of J_{ G }(Δ) change with the multiplication of a weighted matrix G. Therefore the weighted matrix must be normalized. In this study the determinant of G is set to be 1, i.e.
(det(G)) = 1 (4)
For fixed G = I in equation (1), condition (4) is satisfied automatically, and equations (1) and (3) become the cost function and optimal objective of a traditional k-means algorithm, respectively.
Equating the derivative to zero and using the auxiliary condition (4) lead to
W = λG^{-1}(det(G)) = λG^{-1}
where $W={\displaystyle \sum _{k=1}^{K}{W}_{k}}$ and ${W}_{k}={\displaystyle \sum _{{x}_{i}\in {D}_{k}}({x}_{i}-{\overline{m}}_{k})}\text{'}({x}_{i}-{\overline{m}}_{k})$ is the within-group variance of cluster k ((k = 1,⋯, K), and
λ = (det(W))^{1/d}
Finally, we have
G = W^{-1}(det(W))^{1/d} (7)
Note that W is dependent on partition Δ. To avoid ambiguousness, denote W induced by Δ as W(Δ). Substituting (7) into (1) leads to J(Δ) = d(det(W(Δ)))^{1/d}. As d is a constant for a given dataset, the cost function of a weighted k-mean clustering is reduced to
J(Δ_{ o }) = (det(W(Δ)))^{1/d} (8)
There are O(k^{ n }) different partitions of n objects into exactly k clusters [1]. It is impractical to using an exhaustive search for the solution to clustering a large-size gene expression dataset. To overcome this problem, a heuristic approach is usually considered. The basic idea in the heuristic approach is to randomly select an initial partition and then move objects between groups if such moves make J significantly smaller.
Now consider how the cost function J changes when an object x currently in cluster D_{ i }is tentatively moved to a different cluster D_{ j }. Let Δ = (D_{1},⋯, D_{ k }), Δ' = (D_{1},⋯, D_{ i }\{x},⋯D_{ k }), and Δ" = (D_{1},⋯, D_{ i }\{x},⋯, D_{ j }∪ {x},⋯, D_{ k }) (i ≠ j). Obviously the condition for successfully moving x from D_{ i }into D_{ j }is
det(W(Δ")) < det(W(Δ)) (10)
since det(A + βy'y) = det(A)(1 + βyA^{-1} y') for any d × d invertible matrix A, any d – dimensional row vector y, and any number β.
GWKMA
Encoding
In the literature [5, 6, 8], solutions (individuals) are encoded by the centers of clusters. Note that the centers of clusters are real numbers for general cluster tasks and the encoding of the real number in GA algorithms is hard and may degrade the accuracy of the solutions.
We use a partitional string to express a solution to a clustering. A partitional string is an integer string over the set {1,⋯, K}, on which each position corresponds to an object and the number in a position represents the cluster to which the corresponding object is assigned. Thus, the search space consists of all integer strings s_{Δ} with length n over the set {1,⋯, K}. A population is expressed by a set of partitional strings representing its individuals (solutions), denoted by ${\tilde{\Delta}}^{\ast}$ or Δ*. One may set some additional conditions to refine the search space. For example, to avoid a singular clustering, one may impose the constraint that each element in the set {1,⋯, K} appears at least once in s_{Δ}.
The advantage of encoding the centers of clusters is that the resultant clusters from GA clustering are convex. GWKMA encodes the solutions (individuals) by integer strings (partitional strings). This simplifies the encoding of GA and does not degrade the accuracy of the solutions. Since GWKMA includes the weighted k-means operator, the resultant clusters from GWKMA are still convex.
Selection operator
where N (odd positive integer) is the number of individuals in a population, s_{Δi}is the partitional string of individual i, and F(s_{Δi}) represents the fitness value of individual i in the current population. Fitness here is defined as
F(S_{Δ}) = TJ - J(s_{Δ}) (15)
and where $\overline{m}=\frac{1}{n}{\displaystyle \sum _{x\in D}x}$. It is evident that TJ ≥ J(s_{Δ}) for any s_{Δ} in the problem. Note that there are (N-1)/2+1 individuals in ${\tilde{\Delta}}^{\ast}$.
Crossover operator
Δ* = Crossover(${\tilde{\Delta}}^{\ast}$, N). The intention of the crossover operation is to create new (and hopefully better) individuals from two selected parent individuals. In GWKMA, of two parent individuals, one always is the first individual that is the optimal individual found over time, and another is one of the selected (N-1)/2 individuals from the parent population other than the first individual. In this paper, the crossover operator adopts the single-point crossover method for simplicity. Note that after the crossover operation population Δ* has N individuals.
Mutation operator
Δ* = Mutation(Δ*, Pm, K, N). Each position in a coding string is randomly selected with a user-set mutation probability Pm, and the number in the selected position is uniformly randomly replaced by another integer from the set {1,⋯ K}. In other work [14], such a mutation depends on the distance of the corresponding object from the cluster centoids. Actually such a complex strategy is not necessary because the WKM operator will be used. To avoid any singular partition (containing an empty cluster), the mutation operator also randomly assigns one object to a cluster which is empty after all genetic operations.
WKM operator
[Δ*, J(Δ*)] = WKM(Δ*, X, K, N): The WKM operator is obtained by calling WKMA for each individual s_{Δ} in population Δ*. It is sufficient to run the repeat loop in AKMA for several times. Note that J(Δ*) is an N-dimensional vector, each component of which corresponds to the value of the cost function of an individual in population Δ*. In other works [7–9], several different k-means operators were employed, and their functions are similar to that of WKMA. However, those k-means algorithms are neither iteratively optimal nor weighted.
Evaluation
The term "evaluation of a clustering method" usually refers to the ability of a given method to recover true clusters in a dataset. There have been several attempts to evaluate a clustering method on theoretical grounds [14, 15]. Since a clustering result can be considered as a partition of objects into a number of groups, for evaluating a clustering method it is necessary to define a measure of agreement between two partitions of the same dataset. In the clustering literature, measures of agreement between partitions are referred to as external indices. Several such indices have been described [15, 16]. This paper adopts the adjusted Rand index (ARI).
Contingency table for two partitions of n objects
v _{1} | v _{2} | v _{5} | Total | ||
---|---|---|---|---|---|
u _{1} | n _{11} | n _{12} | ⋯ | n _{1s} | n _{1.} |
u _{2} | n _{21} | n _{22} | ⋯ | n _{2s} | n _{2.} |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | |
u _{ r } | n _{r 1} | n _{r 2} | ⋯ | n _{ rs } | n _{r.} |
Total | n _{.1} | n _{.2} | ⋯ | n _{.s} | n_{..} = n |
The ARI is an adjusted Rand index [18] in that its expected value is 1 when they matched perfect and 0 when the two partitions are selected at random. Accordingly, the large value of ARI indicates the two partitions are highly in agreement. To investigate the sensitivity of the partition clustering methods to initial partitions, the clustering method is run with numerous different initial partitions. Then the average ARI (AARI) of all pair-wise resultant clusterings is calculated. This AARI indicates the sensitivity of the clustering method to initial partitions. The larger the value of AARI, the more insensitive (better) the clustering method is to initial partitions.
To evaluate the quality of the clusters, we propose a measure of internal consistency based on the singular value decomposition (SVD) of each cluster. To define internal consistency, suppose we are given a partition of our n × m dataset X into K disjoint clusters, where m is the number of time points and n is the number of genes. For the j th cluster (j = 1,⋯, K) we have a matrix X_{ j }of microarray measurements, where the rows are genes and the columns are time points, so that X_{ j }is a n_{ j }× m matrix, where n_{ j }is the number of genes in the j th cluster. Using the SVD, we decompose ${X}_{j}={U}_{j}{S}_{j}{V}_{j}^{T}$, where U_{ j }and V_{ j }are orthogonal matrices and S_{ j }is a diagonal matrix whose entries describe the importance of the columns of U_{ j }and V_{ j }. The matrix X_{ j }V_{ j }= U_{ j }S_{ j }contains the projections of the rows (genes) of X_{ j }onto the basis V_{ j }. The entries of S_{ j }(singular values) give the relative importance of the rows of V_{ j }. If the first entry of S_{ j }is much larger than the second entry then we know that most of the information in the rows of X_{ j }is captured by a single dimension. We thus define the internal consistency of the j th cluster to be the ratio of the first and second singular values in S_{ j }. The internal consistency provides a measure of how well a single dimension can describe all genes. We can evaluate the quality of a clustering with K clusters by the average internal consistency (AICo) of the K clusters. The high value of the AICo indicates the good quality of the clusters
Results
This section uses a synthetic and two real-life gene expression datasets to investigate the performance of the GWKMA in terns of AARI and AICo, while compared with the widely used k-means.
Synthetic dataset (SYN)
A synthetic dataset is generated by the sine function modeling cyclic behaviour of genes employed by Yeung, et al. [19]. Let x_{ ij }be the simulated expression level of gene i and time point j in the dataset and be modeled by x_{ ij }= λ_{ j }* φ(i, j)(1 + α_{ ij }), where φ(i, j) = sin(2πj/8 - w_{k(i)}+ ε_{ ij }). λ_{ j }is the amplitude control at time j, which is chosen according to the standard normal distribution. φ(i, j) models the cyclic behaviour of genes. Each cycle is assumed to span 12 time points. Different clusters are represented by different phase shifts, and w_{k(i)}represents a phase shift for gene i in cluster k, which is chosen according to the uniform distribution on interval [0, 2π]. The random variable ε_{ ij }represents the noise of gene synchronization, which is chosen according to the normal distribution with the mean of zero and the standard deviation of 0.3. α_{ ij }represents the error of gene i at time j, which is chosen according to the normal distribution with the mean of zero and the standard deviation of 0.4. Using the model above, a synthetic dataset is generated consisting of expression levels of 600 genes at 12 time points. These 600 genes belong to six clusters, each of which contains 100 genes.
Two real-life datasets
The first real-life dataset is a subset of gene expression profiles over 11 time points collected during the process of bacterial cell division [20], and contain 431 gene expression profiles with the standard deviation greater than 0.5 and no missing data points, denoted by BAC in this paper. The second dataset is a subset of gene expression profiles over 7 time points collected during the developmental program of sporulation in budding [21], and contains 529 gene expression profiles with the standard deviation greater than 1.0, and no missing data points, denoted by SPO. These two original datasets are publicly available from the Stanford microarray database [22] at http://genome-www5.stanford.edu/.
In the experiments conducted in this study, the number of generations is set to be GEN = 15, the population size = 21, and the mutation probability Pm = 0.10. The Matlab™ software package was used to conduct our experiments. Both AARI and AICo are computed over a variety of the numbers of clusters and for a number of the running results of both GWKMA and the traditional k-means.
The superior quality of clustering from the GWKMA can be explained as follows. The k-means method assumes that 1) all attributes (data at time points) of objects (genes) are independent and 2) the standard deviations of all attributes over all objects are equal.
Conclusion
In this study, a genetic weighted k-means algorithm (GWKMA) is proposed which is a hybrid algorithm of the weighted k-means algorithm and a genetic algorithm. GWKMA was run on one synthetic and two real-life gene expression datasets. The results of the computational experiments show that the GWKMA performs better than the k-means in terms of the cluster quality (AARI) and the clustering sensitivity to initial partitions (AICo).
In real-life datasets, the assumptions for the k-means are typically not satisfied. The weighted k-means does not needs the assumptions for the k-means. However, like the k-means, the weighted k-means is also sensitive to initial partitions. The proposed GWKMA possesses the merits of both genetic algorithm and the weighted k-mean algorithm, and thus overcomes the disadvantages of the k-means and the weighted k-means. In addition, the proposed algorithm is generic and could have applications to clustering large-scale biological data such as gene expression data and peptide mass spectral data.
Declarations
Acknowledgements
This work is supported by Natural Sciences and Engineering Research Council of Canada (NSERC). The author benefited greatly from discussions with Drs W.J. Zhang and A.J. Kusalik at the early stage of this research.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S6.
Authors’ Affiliations
References
- Hartigan J: Clustering Algorithms. 1975, New York, Wiley PressGoogle Scholar
- Reeves CR, Rowe JE: Genetic Algorithms: Principles and Perspectives: A Guide To GA Theory. 2003, Boston: Kluwer Academic PublishersGoogle Scholar
- Rudolph G: Convergence analysis of canonical genetic algorithms. IEEE Transactions on Neural Networks. 1994, 5: 86-101. 10.1109/72.265964.View ArticleGoogle Scholar
- Krishna K, Murty MM: Genetic K-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics. 1999, 29: 433-439. 10.1109/3477.764879.View ArticleGoogle Scholar
- Hall LO, Ozyurt IB, Bezdek JC: Clustering with a genetically optimized approach. IEEE Transactions on Evolutionary Computation. 1999, 3: 103-112. 10.1109/4235.771164.View ArticleGoogle Scholar
- Maulik U, Bandyopadhyay S: Genetic algorithm-based clustering technique. Pattern Recognition. 2000, 33: 1455-1456. 10.1016/S0031-3203(99)00137-5.View ArticleGoogle Scholar
- Franti P, Kivijärvi J, Kaukoranta T, Nevalainen O: Genetic algorithms for large-scale clustering problems. The computer Journal. 1997, 40: 547-554. 10.1093/comjnl/40.9.547.View ArticleGoogle Scholar
- Scheunders P: A genetic c-means clustering algorithm applied to color image quantization. Pattern Recognition. 1997, 30: 859-866. 10.1016/S0031-3203(96)00131-8.View ArticleGoogle Scholar
- Al-Sultan KS, Khan MM: Computational experience on four algorithms for the hard clustering problem. Pattern Recognition letters. 1996, 17: 295-308. 10.1016/0167-8655(95)00122-0.View ArticleGoogle Scholar
- Wu FX, Zhang WJ, Kusalik AJ: A genetic k-means clustering algorithm applied to gene expression data. Proceedings of the Sixteenth Canadian Conference on Artificial Intelligence, Halifax, Canada. 2003, 520-526.Google Scholar
- Tseng LY, Yang SB: A genetic clustering algorithm for data with non-spherical-shape clusters. Pattern Recognition. 2000, 33: 1251-1259. 10.1016/S0031-3203(99)00105-3.View ArticleGoogle Scholar
- Duda RO, Hart PE, Stork DG: Pattern Classification. 2001, New York: Wiley PressGoogle Scholar
- Spath H: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. 1975, West Sussex: Ellis Horwood LimitedGoogle Scholar
- Estivill-Castro V: Why so many clustering algorithms – A position paper. SIGKDD Explorations. 2002, 4: 65-73. 10.1145/568574.568575.View ArticleGoogle Scholar
- Theodoridis S, Koutroumbas K: Pattern recognition. 1999, San Diego: Academic PressGoogle Scholar
- Dudoit S, Fridlyland J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 2002, 3 (7): RESEARCH0036-10.1186/gb-2002-3-7-research0036.PubMed CentralView ArticlePubMedGoogle Scholar
- Hubert L, Arabie P: Comparing partitions. Journal of Classification. 1985, 2: 193-218. 10.1007/BF01908075.View ArticleGoogle Scholar
- Rand WM: Objective criteria for the evaluation of clustering methods. Journal of the American statistical Association. 1971, 66: 864-850. 10.2307/2284239.View ArticleGoogle Scholar
- Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Model-based clustering and data transformations for gene expression data. Bioinformatics. 2001, 17: 977-987. 10.1093/bioinformatics/17.10.977.View ArticlePubMedGoogle Scholar
- Laub MT, McAdams HH, Feldblyum T, Fraser CM, Shapiro L: Global analysis of the genetic network controlling a bacteria cell cycle. Science. 2000, 290: 2144-2148. 10.1126/science.290.5499.2144.View ArticlePubMedGoogle Scholar
- Chu S, DeRisi J, Eisen MB, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science. 1998, 282: 699-705. 10.1126/science.282.5389.699.View ArticlePubMedGoogle Scholar
- Sherlock G, Hernandez-Boussard T, Kasarskis A, Binkley G, Matese JC, Dwight SS, Kaloper M, Weng S, Jin H, Ball CA, Eisen MB, Spellman PT, Brown PO, Botstein D, Cherry JM: The Stanford Microarray Database. Nucleic Acids Research. 2001, 29: 152-155. 10.1093/nar/29.1.152.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.