Incremental genetic Kmeans algorithm and its application in gene expression data analysis
 Yi Lu^{1},
 Shiyong Lu^{1},
 Farshad Fotouhi^{1},
 Youping Deng^{2}Email author and
 Susan J Brown^{3}
DOI: 10.1186/147121055172
© Lu et al; licensee BioMed Central Ltd. 2004
Received: 10 March 2004
Accepted: 28 October 2004
Published: 28 October 2004
Abstract
Background
In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis. With the help of clustering algorithms such as Kmeans, hierarchical clustering, SOM, etc, genes are partitioned into groups based on the similarity between their expression profiles. In this way, functionally related genes are identified. As the amount of laboratory data in molecular biology grows exponentially each year due to advanced technologies such as Microarray, new efficient and effective methods for clustering must be developed to process this growing amount of biological data.
Results
In this paper, we propose a new clustering algorithm, Incremental Genetic Kmeans Algorithm (IGKA). IGKA is an extension to our previously proposed clustering algorithm, the Fast Genetic Kmeans Algorithm (FGKA). IGKA outperforms FGKA when the mutation probability is small. The main idea of IGKA is to calculate the objective value Total WithinCluster Variation (TWCV) and to cluster centroids incrementally whenever the mutation probability is small. IGKA inherits the salient feature of FGKA of always converging to the global optimum. C program is freely available at http://database.cs.wayne.edu/proj/FGKA/index.htm.
Conclusions
Our experiments indicate that, while the IGKA algorithm has a convergence pattern similar to FGKA, it has a better time performance when the mutation probability decreases to some point. Finally, we used IGKA to cluster a yeast dataset and found that it increased the enrichment of genes of similar function within the cluster.
Background
In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis (see [1] for an excellent survey). With the advancement in Microarray technology, it is now possible to observe the expression levels of thousands of genes simultaneously when the cells experience specific conditions or undergo specific processes. Clustering algorithms are used to partition genes into groups based on the similarity between their expression profiles. In this way, functionally related genes are identified. As the amount of laboratory data in molecular biology grows exponentially each year due to advanced technologies such as Microarray, new efficient and effective methods for clustering must be developed to process this growing amount of biological data.
Among the various clustering algorithms, Kmeans [2] is one of the most popular methods used in gene expression data analysis due to its high computational performance. However, it is well known that Kmeans might converge to a local optimum, and its result is subject to the initialization process, which randomly generates the initial clustering. In other words, different runs of Kmeans on the same input data might produce different solutions.
A number of researchers have proposed genetic algorithms [3–6] for clustering. The basic idea is to simulate the evolution process of nature and evolve solutions from one generation to the next. In contrast to Kmeans, which might converge to a local optimum, these genetic algorithms are insensitive to the initialization process and always converge to the global optimum eventually. However, these algorithms are usually computationally expensive which impedes the wide application of them in practice such as in gene expression data analysis.
Recently, Krishna and Murty proposed a new clustering method called Genetic Kmeans Algorithm (GKA) [7], which hybridizes a genetic algorithm with the Kmeans algorithm. This hybrid approach combines the robust nature of the genetic algorithm with the high performance of the Kmeans algorithm. As a result, GKA will always converge to the global optimum faster than other genetic algorithms.
In [8], we proposed a faster version of GKA, FGKA that features several improvements over GKA including an efficient evaluation of the objective value TWCV (Total WithinCluster Variation), avoiding illegal string elimination overhead, and a simplification of the mutation operator. These improvements result that FGKA runs 20 times faster than GKA [9]. In this paper, we propose an extension to FGKA, Incremental Genetic Kmeans Algorithm (IGKA) that inherits all the advantages of FGKA including the convergence to the global optimum, and outperforms FGKA when the mutation probability is small. The main idea of IGKA is to calculate the objective value TWCV and to cluster centroids incrementally. We then propose a Hybrid Genetic Kmeans Algorithm (HGKA) that combines the benefits of FGKA and IGKA. We show that clustering of microarray data by IGKA method has more tendencies to group the genes with the same functional category into a given cluster.
Results
Our experiments were conducted on a Dell PowerEdge 400SC PC machine with 2.24G Hz CPU and 512 M RAM. Three algorithms, FGKA, IGKA and HGKA algorithm were implemented in C language. GKA has convergence pattern similar to FGKA and IGKA, but its time performance is worse than FGKA, see [9] for more details. In the following, we compare the time performance of FGKA and IGKA along different mutation probabilities, and then we compare the convergence property of four algorithms, IGKA, FGKA, Kmeans and SOM (Self Organizing Map). At the end, we check how we can combine IGKA and FGKA algorithm together to obtain a better performance.
Data sets
The two data sets used to conduct our experiments are serum data, fig2data, introduced in [11]and yeast data, chodata, introduced in [2]. The fig2data data set contains expression data for 517 genes. Each gene has 19 expression data ranges from 15 minutes to 24 hours. In other words, the number of features D is 19. According to [11], 517 genes can be divided into 10 groups. The chodata is a yeast dataset, composed of expression data for 2907 genes and the expression data for each gene ranges 0 minutes to 160 minutes, which means that the number of features D is 15. According to the description in [2], the genes can be divided into 30 groups. Since the IGKA is a stochastic algorithm, for each experiment in this study, we obtain the results by averaging 10 independent run of the program. The mutation probability, the generation number, the population number all affect the performance and convergence of FGKA and IGKA. The detailed discussion of the parameters setting can be found in [8]. In this paper, we simply adopt the result in [8], the population number is set to 50, and the generation number is set to 100. These parameter setting are safe enough to guarantee the algorithm converge to the optima.
Comparison of IGKA with FGKA on time performance
Comparison of IGKA with FGKA, Kmeans and SOM on convergence
Comparison of different algorithms on TWCV convergence with two data sets. Four algorithms, IGKA, FGKA, Kmeans and SOM are experimented on the two data set, the fig2data, and chodata. The TWCVs of IGKA and FGKA algorithm are obtained by averaging 10 individual runs while the generation number is set to 100, the population number is set to 50, the mutation probability is set to 0.005 for fig2data, and 0.0005 for chodata. The TWCV of Kmeans algorithm is obtained by averaging 20 individual runs. The TWCV of SOM is obtainedby 8 individual runs with different setting on X and Y dimension. The IGKA and FGKA algorithms have better TWCV convergence than the Kmeans and SOM.
Algorithms  Fig2data  Chodata 

IGKA (Average of 10 individual runs with generation 100, population 50, mutation probability 0.005 in fig2data, and 0.0005 in chodata)  4991.53889  16995.7 
FGKA (Average of 10 individual runs with generation 100, population 50, mutation probability 0.005 in fig2data, and 0.0005 in chodata)  4992.13889  16995.4 
Kmeans (Average of 20 individual runs)  5154.21434  17374.6758 
SOM (Average of 8 individual runs with different setting)  24805.3661  21660.9049 
Combination of IGKA with FGKA
Discussion
Distribution of ORF function categories in the clusters. Chodata set was clustered using IGKA algorithm. We identified the gene distribution of different functional categories into different clusters. The function categories were divided according to MIPS (Mewes et al., 2000). The total number of ORFs in each function category was indicated in parentheses. The cluster number to which the genes were grouped is denoted as "Cluster" column. The ORF number in each cluster is denoted as "Total". The ORF number within each functional category is denoted as "Function ORFs". The percentage of the ORF number within functional category of each cluster in the total ORF number of each cluster is denoted as "Percentage (%)".
Cluster  MIPS functional category  Total  Function ORFs  Percentage(%) 

1  Mitotic cell cycle and cycle control(352)  86  24  27.9 
Budding, cell polarity, filament form(170)  8  9.3  
3  Organization of mitochondrion(366)  156  111  71.2 
Respiration(88)  10  6.4  
Nitrogen and sulpur metabolism(67)  9  5.6  
16  Organization of nucleus(774)  133  50  37.6 
17  Ribosome biogenesis(215)  88  50  56.8 
Organization of cytoplasm(554)  31  35.2  
18  Organization of mitochondrion(366)  184  105  57.1 
25  DNA synthesis and replication(94)  164  23  14 
DNA recombination and DNA repair(153)  11  6.7  
Lipid and fatty isoprenoid metabolism(213)  9  5.5  
29  Organization of nucleus chromosome(44)  93  14  15 
Amino acid metabolism(204)  12  12.9  
30  TCA pathway or Krebs cycle(25)  92  7  7.6 
Ccompound, carbohydrate metabolism(415)  14  15.2 
Most interestingly, we found a remarkable enrichment of ORFs for the functional category of organization of mitochondria. They are mainly located in two clusters: cluster 3 and cluster 18. Cluster 3 has 156 ORFs in total, and 111 ORFs belong to the category, resulting in a very high percentage, 71.2%. Cluster 18, has 184 ORFs in total, in which there are 105 ORFs belonging to the category and the percentage is 57.1%. The percentage of ORFs within the same function category is only 18.8% in the previous paper. It looks that our IGKA method is more likely to increase the degree of enrichment of the genes within functional categories, and to make more biological sense. We also found a new function category: lipid and fatty isoprenoid metabolism distributed in cluster 25, which was not listed in Tavazoie's paper.
Conclusions
In this paper, we propose a new clustering algorithm called Incremental Genetic Kmeans Algorithm (IGKA). IGKA is an extension of FGKA, which in turn was inspired by the Genetic Kmeans Algorithm (GKA) proposed by Krishna and Murty. The IGKA inherits the advantages of FGKA, and it outperforms FGKA when the mutation probability is small. Since both FGKA and IGKA might outperform each other, a hybrid approach that combines the benefits of them is very desirable. Our experimental results showed that not only the performance of our algorithm is improved but also the clustering result with gene expression data has some interesting biological discovery.
Methods
The problem of clustering gene expression data consists of N genes and their corresponding N patterns. Each pattern is a vector of D dimensions recording the expression levels of the genes under each of the D monitored conditions or at each of the D time points. The goal of IGKA algorithm is to partition the N patterns into userdefined K groups, such that this partition minimizes the Total WithinCluster Variation (TWCV, also called squareerror in the literature), which is defined as follows.
IGKA maintains a population (set) of Z coded solutions, where Z is a parameter specified by the user. Each solution, also called a chromosome, is coded by a string a_{1}...a_{ N }of length N, where each a_{ n }, which is called an allele, corresponds to a gene expression data pattern and takes a value from {1, 2, ..., K} representing the cluster number to which the corresponding pattern belongs. For example, a_{1}a_{2}a_{3}a_{4}a_{5}= "33212" encodes a partition of 5 patterns in which, patterns and belong to cluster 3, patterns and belong to cluster 2, and pattern belongs to cluster 1.
Definition (Legal strings, Illegal strings)
Given a partition S_{ z }= a_{1} ....a_{ N }, let e(S_{ z }) be the number of nonempty clusters in S_{ z }divided by K, e(S_{ z }) is called legality ratio. We say string S_{ z }is legal if e(S_{ z }) = 1, and illegal otherwise.
Hence, an illegal string represents a partition in which some clusters are empty. For example, given K = 3, the string a_{1}a_{2}a_{3}a_{4}a_{5} = "23232" is illegal because cluster 1 is empty.
Selection operator
We use the socalled proportional selection for the selection operator in which, the population of the next generation is determined by Z independent random experiments. Each experiment randomly selects a solution from the current population (S_{1}, S_{2}, ..., S_{ z }) according to the probability distribution (p_{1}, p_{2}, ..., p_{ K }) defined by (z = 1,...Z), where F(S_{ z }) denotes the fitness value of solution S_{ z }with respect to the current population and will be defined in the next paragraph.
Various fitness functions have been defined in the literature [10] in which the fitness value of each solution in the current population reflects its merit to survive in the next generation. In our context, the objective is to minimize the Total WithinCluster Variation (TWCV). Therefore, solutions with smaller TWCV s should have higher probabilities for survival and should be assigned with greater fitness values. In addition, illegal strings are less desirable and should have lower probabilities for survival, and thus should be assigned with lower fitness values. We define fitness value of solution S_{ z }, F(S_{ z }) as
where TWCV_{ max }is the maxim TWCV that has been encountered till the present generation, F_{ min }is the smallest fitness value of the legal strings in the current population if they exist, otherwise F_{ min }is defined as 1. The definition of fitness function in GKA [7] paper inspired our definition, but we incorporate the idea of permitting illegal strings by defining the fitness values for them.
The intuition behind this fitness function is that, each solution will have a probability to survive by being assigned with a positive fitness value, but a solution with a smaller TWCV has a greater fitness value and hence has a higher probability to survive. Illegal solutions are allowed to survive too but with lower fitness values than all legal solutions in the current population. Illegal strings that have more empty clusters are assigned with smaller fitness values and hence have lower probabilities for survival. The reason we still allow illegal solution survive with low probability is that we believe the illegal solution may mutate to a good solution and the cost of maintain the illegal solution is very low.
We assume that the TWCV for each solution S_{ z }(denoted by S_{ z }.TWCV) and the maximum TWCV (denoted by TWCV_{ max }), have already been calculated before the selection operator is applied.
Mutation operator
Given a solution (chromosome) that is encoded by a_{1} ....a_{ N }, the mutation operator mutates each allele a_{ n }(n = 1, ..., N) to a new value a_{ n }(a_{ n }might be equal to a_{ n }) with probability MP respectively and independently, where 0 <MP < 1 is a parameter called the mutation probability that is specified by the user. The mutation operator is very important to help reach better solutions. From the perspective of the evolutional theory, offsprings produced by mutations might be superior to their parents. More importantly, the mutation operator performs the function of shaking the algorithm out of a local optimum, and moving it towards the global optimum.
Recall that in solution a_{1} ....a_{ N }, each allele a_{ n }corresponds to a pattern and its value indicates the number of the cluster to which belongs. During mutation, we replace allele a_{ n }by a_{ n }' for n = 1,...,N simultaneously, where a_{ n }is a number randomly selected from (1,....,K) with the probability distribution (p_{1}, p_{2}, ..., p_{ K }) defined by:
Kmeans operator
In order to speed up the convergence process, one step of the classical Kmeans algorithm, which we call Kmeans operator (KMO) is introduced. Given a solution that is encoded by a_{1} ....a_{ N }, we replace a_{ n }by a_{ n }' for n = 1,...,N simultaneously, where a_{ n }' is the number of the cluster whose centroid is closest to in Euclidean distance. More formally,
In the following, we first present FGKA algorithm that is proposed in [9]. We then describe the motivation for IGKA based on the idea of incremental calculation of TWCV and centroids. Finally, we present a hybrid approach that combines the benefits of FGKA and IGKA.
Fast Genetic KMeans Algorithm (FGKA)
FGKA shares the same flowchart of IGKA given in Figure 1. It starts with the initialization of population P_{0} with Z solutions. For each generation P_{ i }, we apply the three operators, selection, mutation and Kmeans operator sequentially which generate population , , and P_{i + 1}respectively. This process is repeated for G iterations, each of which corresponds to one generation of solutions. The best solution so far is observed and recorded in S_{ o }before the selection operator. S_{ o }is returned as the output solution when FGKA terminates.
Incremental Genetic KMeans Algorithm (IGKA)
Similarly, in order to obtain the new TWCV, we can maintain a difference value TWCV^{Δ} that denotes the difference between old TWCV and new TWCV for one solution. It is obvious that TWCV^{Δ} is attributed from the difference of new WCV_{ k }and old WCV_{ k }for cluster k. However, WCV_{ k }has to be calculated from scratch since is changed. In this way, TWCV can be updated incrementally as well. Since the calculation of TWCV dominates all iterations, our incremental update of TWCV will have a better performance when mutation probability is small (which implies a small number of alleles changes). However, if the mutation probability is large, too many alleles change their cluster membership, the maintenance of Z_{ k }^{Δ} and becomes expensive and IGKA becomes inferior to FGKA in performance, as confirmed in the experimental study.
Hybrid Genetic KMeans Algorithm (HGKA)
The above discussion presents a dilemma – both FGKA and IGKA are likely to outperform each other: when the mutation probability is smaller than some threshold, IGKA outperforms FGKA; otherwise, FGKA outperforms IGKA.
The key idea of HGKA is to combine the benefits of FGKA and IGKA. However, it is very difficult to derive the threshold value, which is dataset dependant. In addition, the running times of all iterations will vary as solutions converge to the optimum. We propose the following solution: we periodically run one iteration of FGKA followed by one iteration of IGKA while monitoring their running times, and then run the winning algorithm for the following iterations until we reach another competition point.
Comparison of different algorithms on performance, convergence and stability. Five apporaches are compared based on time performance, convergence and stability. The Kmeans algorithm has better time performance than any other genetic algorithms, but it suffers from converging to local optimum and initialization dependent. Among the four genetic clustering approaches, Hybrid approach always has better time performance while FGKA performs well when the mutation probability is big, and IGKA performs well when the mutation probability is small. IGKA and FGKA outperform GKA. The convergence of four genetic algorithms has similar results, and all four are independent from the initialization.
Kmeans  GKA  FGKA  IGKA  Hybrid  

Time  Fastest  Slow  Good when the mutation  Good when the mutation  Good 
Performance  probability is large  probability is small  
Convergence  Worse  Good  Good  Good  Good 
Stability  Unstable  Stable  Stable  Stable  Stable 
Availability and requirements
IGKA algorithm is available at http://database.cs.wayne.edu/proj/FGKA/index.htm. The source code and database scheme are freely distributed to academic users upon request to the authors.
List of abbreviations
 WCV:

WithinCluster Variation
 TWCV:

Total WithinCluster Variation
 IGKA:

Incremental Genetic Kmeans Algorithm
 FGKA:

Fast Genetic Kmeans Algorithm
 HGKA:

Hybrid Genetic Kmeans Algorithm
 ORF:

Open Reading Frame.
Declarations
Acknowledgements
We thank Mr. Jun Chen for helping us in dividing the gene function categories. The project described was supported by NIH grant P20 RR16475 from the BRIN Program of the National Center for Research Resources.
Authors’ Affiliations
References
 Shamir R, Sharan R: approaches to clustering gene expression data. In Current Topics in Computational Biology. Edited by: Jiang T, Smith T, Y. Xu and Zhang MQ. , MIT press; 2001.Google Scholar
 Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22: 281–285. 10.1038/10343View ArticlePubMedGoogle Scholar
 Bhuyan JN, Raghavan VV, Elayavalli VK: Genetic algorithm for clustering with an ordered representation: ; San Mateo, CA, USA.; 1991.Google Scholar
 Hall LO, B. OI, Bezdek JC: Clustering with a genetically optimized approach. IEEE Trans on Evolutionary Computation 1999, 3: 103–112. 10.1109/4235.771164View ArticleGoogle Scholar
 Maulik U, Bandyopadhyay S: Genetic algorithm based clustering technique. Pattern Recognition 2000, 1455–1465. 10.1016/S00313203(99)001375Google Scholar
 Jones D, Beltramo M: partitioning problems with genetic algorithms: ; San Mateo, CA, USA. ; 1991.Google Scholar
 Krishna K, Murty M: Genetic Kmeans algorithm. IEEE Transactions on Systems, Man and Cybernetics  Part B: Cybernetics 1999, 29: 433–439. 10.1109/3477.764879View ArticleGoogle Scholar
 Lu Y, Lu S, Fotouhi F, Deng Y, Brown S: FGKA: A Fast Genetic Kmeans Algorithm: March 2004. 2004.Google Scholar
 Lu Y, Lu S, Fotouhi F, Deng Y, Brown S: Fast genetic Kmeans algorithm and its application in gene expression data analysis. Detroit, Wayne State University; 2003.Google Scholar
 Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson JJ, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO: The transcriptional program in the response of human fibroblasts to serum. Science 1999, 283: 83–87. 10.1126/science.283.5398.83View ArticlePubMedGoogle Scholar
 de Hoon MJ, Imoto S, Nolan J, Miyano S: Open source clustering software. Bioinformatics 2004, 20: 1453–1454. 10.1093/bioinformatics/bth078View ArticlePubMedGoogle Scholar
 Mewes HW, Frishman D, Gruber C, Geier B, Haase D, Kaps A, Lemcke K, Mannhaupt G, Pfeiffer F, Schuller C, Stocker S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2000, 28: 37–40. 10.1093/nar/28.1.37PubMed CentralView ArticlePubMedGoogle Scholar
 Goldberg D: Genetic Algorithms in Search: Optimization and Machine Learning. MA, AddisonWesley; 1989.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.