 Methodology article
 Open access
 Published:
Incremental genetic Kmeans algorithm and its application in gene expression data analysis
BMC Bioinformatics volumeÂ 5, ArticleÂ number:Â 172 (2004)
Abstract
Background
In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis. With the help of clustering algorithms such as Kmeans, hierarchical clustering, SOM, etc, genes are partitioned into groups based on the similarity between their expression profiles. In this way, functionally related genes are identified. As the amount of laboratory data in molecular biology grows exponentially each year due to advanced technologies such as Microarray, new efficient and effective methods for clustering must be developed to process this growing amount of biological data.
Results
In this paper, we propose a new clustering algorithm, Incremental Genetic Kmeans Algorithm (IGKA). IGKA is an extension to our previously proposed clustering algorithm, the Fast Genetic Kmeans Algorithm (FGKA). IGKA outperforms FGKA when the mutation probability is small. The main idea of IGKA is to calculate the objective value Total WithinCluster Variation (TWCV) and to cluster centroids incrementally whenever the mutation probability is small. IGKA inherits the salient feature of FGKA of always converging to the global optimum. C program is freely available at http://database.cs.wayne.edu/proj/FGKA/index.htm.
Conclusions
Our experiments indicate that, while the IGKA algorithm has a convergence pattern similar to FGKA, it has a better time performance when the mutation probability decreases to some point. Finally, we used IGKA to cluster a yeast dataset and found that it increased the enrichment of genes of similar function within the cluster.
Background
In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis (see [1] for an excellent survey). With the advancement in Microarray technology, it is now possible to observe the expression levels of thousands of genes simultaneously when the cells experience specific conditions or undergo specific processes. Clustering algorithms are used to partition genes into groups based on the similarity between their expression profiles. In this way, functionally related genes are identified. As the amount of laboratory data in molecular biology grows exponentially each year due to advanced technologies such as Microarray, new efficient and effective methods for clustering must be developed to process this growing amount of biological data.
Among the various clustering algorithms, Kmeans [2] is one of the most popular methods used in gene expression data analysis due to its high computational performance. However, it is well known that Kmeans might converge to a local optimum, and its result is subject to the initialization process, which randomly generates the initial clustering. In other words, different runs of Kmeans on the same input data might produce different solutions.
A number of researchers have proposed genetic algorithms [3â€“6] for clustering. The basic idea is to simulate the evolution process of nature and evolve solutions from one generation to the next. In contrast to Kmeans, which might converge to a local optimum, these genetic algorithms are insensitive to the initialization process and always converge to the global optimum eventually. However, these algorithms are usually computationally expensive which impedes the wide application of them in practice such as in gene expression data analysis.
Recently, Krishna and Murty proposed a new clustering method called Genetic Kmeans Algorithm (GKA) [7], which hybridizes a genetic algorithm with the Kmeans algorithm. This hybrid approach combines the robust nature of the genetic algorithm with the high performance of the Kmeans algorithm. As a result, GKA will always converge to the global optimum faster than other genetic algorithms.
In [8], we proposed a faster version of GKA, FGKA that features several improvements over GKA including an efficient evaluation of the objective value TWCV (Total WithinCluster Variation), avoiding illegal string elimination overhead, and a simplification of the mutation operator. These improvements result that FGKA runs 20 times faster than GKA [9]. In this paper, we propose an extension to FGKA, Incremental Genetic Kmeans Algorithm (IGKA) that inherits all the advantages of FGKA including the convergence to the global optimum, and outperforms FGKA when the mutation probability is small. The main idea of IGKA is to calculate the objective value TWCV and to cluster centroids incrementally. We then propose a Hybrid Genetic Kmeans Algorithm (HGKA) that combines the benefits of FGKA and IGKA. We show that clustering of microarray data by IGKA method has more tendencies to group the genes with the same functional category into a given cluster.
Results
Our experiments were conducted on a Dell PowerEdge 400SC PC machine with 2.24G Hz CPU and 512 M RAM. Three algorithms, FGKA, IGKA and HGKA algorithm were implemented in C language. GKA has convergence pattern similar to FGKA and IGKA, but its time performance is worse than FGKA, see [9] for more details. In the following, we compare the time performance of FGKA and IGKA along different mutation probabilities, and then we compare the convergence property of four algorithms, IGKA, FGKA, Kmeans and SOM (Self Organizing Map). At the end, we check how we can combine IGKA and FGKA algorithm together to obtain a better performance.
Data sets
The two data sets used to conduct our experiments are serum data, fig2data, introduced in [11]and yeast data, chodata, introduced in [2]. The fig2data data set contains expression data for 517 genes. Each gene has 19 expression data ranges from 15 minutes to 24 hours. In other words, the number of features D is 19. According to [11], 517 genes can be divided into 10 groups. The chodata is a yeast dataset, composed of expression data for 2907 genes and the expression data for each gene ranges 0 minutes to 160 minutes, which means that the number of features D is 15. According to the description in [2], the genes can be divided into 30 groups. Since the IGKA is a stochastic algorithm, for each experiment in this study, we obtain the results by averaging 10 independent run of the program. The mutation probability, the generation number, the population number all affect the performance and convergence of FGKA and IGKA. The detailed discussion of the parameters setting can be found in [8]. In this paper, we simply adopt the result in [8], the population number is set to 50, and the generation number is set to 100. These parameter setting are safe enough to guarantee the algorithm converge to the optima.
Comparison of IGKA with FGKA on time performance
As indicated in the implementation section, the mutation probability has great impact on IGKA algorithm. We check the performance impact on IGKA in this section, and the convergence in the next section. Figure 2 shows the time performance results for these two algorithms. We can see that when the mutation probability increases, the running time increases accordingly for both algorithms. However, when the mutation probability is smaller than some threshold (0.005 for fig2data, and 0.0005 for chodata), IGKA has a better performance. Figure 2 also indicates the thresholds vary from one dataset to another. In order to achieve better performance of IGKA in large data set, mutation probability may need to be set to smaller than that in small data set. For example, in larger data set chodata, we should set the mutation probability to 0.0005 to have IGKA outperform FGKA. On the other hand, in order to have IGKA outperform than FGKA, we only need to set the mutation probability to 0.005 in the small data set fig2data. In general, the threshold value depends on the number of patterns and the number of features in the data set. It is easy to understand that the performance gained in IGKA is mainly dependent on how many patterns change their cluster memberships. So, in a large data set, even small number of mutation probability may cause many patterns change their cluster memberships.
Comparison of IGKA with FGKA, Kmeans and SOM on convergence
Figures 3(A) and 3(B) show the convergence of IGKA versus FGKA across different mutation probabilities based on fig2data and chodata, respectively. These two algorithms have similar convergence results. When the mutation probability changes in these two data sets, it has little impact on these two algorithms during the range that is given in Figure 3, except for the case when the mutation probability is too large. It gives an opportunity to choose IGKA with better performance without losing the convergence benefit.
We also make an interesting comparison of IGKA with FGKA, Kmeans and SOM on TWCV convergence. We treat each algorithm as a black box. Two data sets, the fig2data and chodata, are fed into the algorithms, and the clustering results are exported as a text file. We then use an inhouse program to calculate the TWCVs for each result. The experiments on Kmeans and SOM algorithm are conducted on an open source software [12]. As we can see in Table 2, the IGKA and FGKA have almost similar convergence result, and much better than the convergence of Kmeans algorithm. The TWCV convergence of SOM is much worse than the others although these four algorithms all use Euclidian distance as their measurement. The reason why we do not include another popular clustering algorithm, hierarchical clustering algorithm is because it is hard to define the boundary among the nested clusters, which means we cannot simply define the number of cluster before running the program.
Combination of IGKA with FGKA
Figure 4 compares three algorithms, IGKA, FGKA and HGKA, based on the running times for 100 iterations. The mutation probability is set to 0.0001 for all three algorithms. It is clearly that the running time for each iteration of FGKA is much stable than others. On the other hand, the running time for IGKA is much higher than FGKA at the beginning because there are a large number of patterns change their cluster belonging during the Kmeans operator which cause the IGKA spend a lot of computation time. However, the running time for each iteration of IGKA decrease very sharply at late iterations. The HGKA combines the advantage of two algorithms. The turning point when HGKA uses IGKA instead of FGKA as work horse is highly data dependent. In this particular case, we check the computation time every 15 iterations. The result shows that the performance can be really improved by using HGKA when the mutation probability is small.
Discussion
The clustering results of chodata using our IGKA algorithm were evaluated according to the scheme of gene classification of MIPS Yeast Genome Database [13]. We found that genes of similar function were grouped into the same cluster. Table 3 shows 8 main clusters including 16 functional categories of genes. The results are comparable to the data of [2]. The absolute number of ORFs with functional categories in some cluster may not be always higher than Tavazoie's result, but we found that the percentage of the ORF number within functional category of each cluster in the total ORF number of each cluster is usually higher than Tavazoie's result in most cases. For example, they found that there are 40 genes in the functional category of nuclear organization distributed in their cluster 2, in which there are 186 ORFs, so their percentage is 21.5%. But we found there are 50 genes of the same functional category distributed in our cluster 16, in which there are only 133 ORFs, and our percentage is 37.6% that is significantly higher than 21.5%.
Most interestingly, we found a remarkable enrichment of ORFs for the functional category of organization of mitochondria. They are mainly located in two clusters: cluster 3 and cluster 18. Cluster 3 has 156 ORFs in total, and 111 ORFs belong to the category, resulting in a very high percentage, 71.2%. Cluster 18, has 184 ORFs in total, in which there are 105 ORFs belonging to the category and the percentage is 57.1%. The percentage of ORFs within the same function category is only 18.8% in the previous paper. It looks that our IGKA method is more likely to increase the degree of enrichment of the genes within functional categories, and to make more biological sense. We also found a new function category: lipid and fatty isoprenoid metabolism distributed in cluster 25, which was not listed in Tavazoie's paper.
Conclusions
In this paper, we propose a new clustering algorithm called Incremental Genetic Kmeans Algorithm (IGKA). IGKA is an extension of FGKA, which in turn was inspired by the Genetic Kmeans Algorithm (GKA) proposed by Krishna and Murty. The IGKA inherits the advantages of FGKA, and it outperforms FGKA when the mutation probability is small. Since both FGKA and IGKA might outperform each other, a hybrid approach that combines the benefits of them is very desirable. Our experimental results showed that not only the performance of our algorithm is improved but also the clustering result with gene expression data has some interesting biological discovery.
Methods
The problem of clustering gene expression data consists of N genes and their corresponding N patterns. Each pattern is a vector of D dimensions recording the expression levels of the genes under each of the D monitored conditions or at each of the D time points. The goal of IGKA algorithm is to partition the N patterns into userdefined K groups, such that this partition minimizes the Total WithinCluster Variation (TWCV, also called squareerror in the literature), which is defined as follows.
Let
be the N patterns, and X_{ nd }denotes the dth feature of pattern X_{ n }(n = 1,...N). Each partition is represented by a string, a sequence of numbers a_{1}....a_{ N },, where a_{ n }is the number of the cluster that pattern belongs to in this partition. Let G_{ k }denote the kth cluster and Z_{ k }denote the number of patterns in G_{ k }. The centroid c_{ k }= (c_{k 1}, c_{k 2},...,c_{ kD }) of cluster G_{ k }is defined as , (d = 1,2,...D) where SF_{ kd }is the sum of the d th features of all the patterns in G_{ k }. and we use to denote the vector of sum of all patterns in cluster G_{ k }.
IGKA maintains a population (set) of Z coded solutions, where Z is a parameter specified by the user. Each solution, also called a chromosome, is coded by a string a_{1}...a_{ N }of length N, where each a_{ n }, which is called an allele, corresponds to a gene expression data pattern and takes a value from {1, 2, ..., K} representing the cluster number to which the corresponding pattern belongs. For example, a_{1}a_{2}a_{3}a_{4}a_{5}= "33212" encodes a partition of 5 patterns in which, patterns and belong to cluster 3, patterns and belong to cluster 2, and pattern belongs to cluster 1.
Definition (Legal strings, Illegal strings)
Given a partition S_{ z }= a_{1} ....a_{ N }, let e(S_{ z }) be the number of nonempty clusters in S_{ z }divided by K, e(S_{ z }) is called legality ratio. We say string S_{ z }is legal if e(S_{ z }) = 1, and illegal otherwise.
Hence, an illegal string represents a partition in which some clusters are empty. For example, given K = 3, the string a_{1}a_{2}a_{3}a_{4}a_{5} = "23232" is illegal because cluster 1 is empty.
Figure 1 gives the flowchart of IGKA. It starts with the initialization phase, which generates the initial population P_{0}. The population in the next generation P_{i + 1}is obtained by applying genetic operators on the current population P_{ i }. The evolution takes place until a terminating condition is reached. The following genetic operators are used in IGKA: the selection, the mutation and the Kmeans operator.
Selection operator
We use the socalled proportional selection for the selection operator in which, the population of the next generation is determined by Z independent random experiments. Each experiment randomly selects a solution from the current population (S_{1}, S_{2}, ..., S_{ z }) according to the probability distribution (p_{1}, p_{2}, ..., p_{ K }) defined by (z = 1,...Z), where F(S_{ z }) denotes the fitness value of solution S_{ z }with respect to the current population and will be defined in the next paragraph.
Various fitness functions have been defined in the literature [10] in which the fitness value of each solution in the current population reflects its merit to survive in the next generation. In our context, the objective is to minimize the Total WithinCluster Variation (TWCV). Therefore, solutions with smaller TWCV s should have higher probabilities for survival and should be assigned with greater fitness values. In addition, illegal strings are less desirable and should have lower probabilities for survival, and thus should be assigned with lower fitness values. We define fitness value of solution S_{ z }, F(S_{ z }) as
where TWCV_{ max }is the maxim TWCV that has been encountered till the present generation, F_{ min }is the smallest fitness value of the legal strings in the current population if they exist, otherwise F_{ min }is defined as 1. The definition of fitness function in GKA [7] paper inspired our definition, but we incorporate the idea of permitting illegal strings by defining the fitness values for them.
The intuition behind this fitness function is that, each solution will have a probability to survive by being assigned with a positive fitness value, but a solution with a smaller TWCV has a greater fitness value and hence has a higher probability to survive. Illegal solutions are allowed to survive too but with lower fitness values than all legal solutions in the current population. Illegal strings that have more empty clusters are assigned with smaller fitness values and hence have lower probabilities for survival. The reason we still allow illegal solution survive with low probability is that we believe the illegal solution may mutate to a good solution and the cost of maintain the illegal solution is very low.
We assume that the TWCV for each solution S_{ z }(denoted by S_{ z }.TWCV) and the maximum TWCV (denoted by TWCV_{ max }), have already been calculated before the selection operator is applied.
Mutation operator
Given a solution (chromosome) that is encoded by a_{1} ....a_{ N }, the mutation operator mutates each allele a_{ n }(n = 1, ..., N) to a new value a_{ n }(a_{ n }might be equal to a_{ n }) with probability MP respectively and independently, where 0 <MP < 1 is a parameter called the mutation probability that is specified by the user. The mutation operator is very important to help reach better solutions. From the perspective of the evolutional theory, offsprings produced by mutations might be superior to their parents. More importantly, the mutation operator performs the function of shaking the algorithm out of a local optimum, and moving it towards the global optimum.
Recall that in solution a_{1} ....a_{ N }, each allele a_{ n }corresponds to a pattern and its value indicates the number of the cluster to which belongs. During mutation, we replace allele a_{ n }by a_{ n }' for n = 1,...,N simultaneously, where a_{ n }is a number randomly selected from (1,....,K) with the probability distribution (p_{1}, p_{2}, ..., p_{ K }) defined by:
where
is the Euclidean distance between pattern and the centroid c_{ k }of the k th cluster, and . If the k th cluster is empty, then is defined as 0. The bias 0.5 is introduced to avoid dividebyzero error in the case that all patterns are equal and are assigned to the same cluster in the given solution. Our definition of the mutation operator is similar to the one defined in the GKA paper [7]. However, we account for illegal strings, which are not allowed in the GKA algorithm.
The above mutation operator is defined such that (1)
might be reassigned randomly to each cluster with a positive probability; (2) the probability of changing allele value a_{ n }to a cluster number k is greater if is closer to the centroid of the k th cluster G_{ k }; and (3) empty clusters are viewed as the closest clusters to . The first property ensures that an arbitrary solution, including the global optimum, might be generated by the mutation from the current solution with a positive probability; the second property encourages that each is moving towards a closer cluster with a higher probability; the third property promotes the probability of converting an illegal solution to a legal one. These properties are essential to guarantee that IGKA will eventually converge to the global optimum fast.
Kmeans operator
In order to speed up the convergence process, one step of the classical Kmeans algorithm, which we call Kmeans operator (KMO) is introduced. Given a solution that is encoded by a_{1} ....a_{ N }, we replace a_{ n }by a_{ n }' for n = 1,...,N simultaneously, where a_{ n }' is the number of the cluster whose centroid is closest to in Euclidean distance. More formally,
To accommodate illegal strings, we define
= +âˆž if the k th cluster is empty. This definition is different from mutation operator, in which we defined = 0 if the k th cluster is empty. The motivation for this new definition here is that we want to avoid reassigning all patterns to empty clusters. Therefore, illegal string will remain illegal after the application of KMO.
In the following, we first present FGKA algorithm that is proposed in [9]. We then describe the motivation for IGKA based on the idea of incremental calculation of TWCV and centroids. Finally, we present a hybrid approach that combines the benefits of FGKA and IGKA.
Fast Genetic KMeans Algorithm (FGKA)
FGKA shares the same flowchart of IGKA given in Figure 1. It starts with the initialization of population P_{0} with Z solutions. For each generation P_{ i }, we apply the three operators, selection, mutation and Kmeans operator sequentially which generate population , , and P_{i + 1}respectively. This process is repeated for G iterations, each of which corresponds to one generation of solutions. The best solution so far is observed and recorded in S_{ o }before the selection operator. S_{ o }is returned as the output solution when FGKA terminates.
Incremental Genetic KMeans Algorithm (IGKA)
Although FGKA outperforms GKA significantly, it suffers from a potential disadvantage. If the mutation probability is small, then the number of allele changes will be small, and the cost of calculating centroids and TWCV from scratch can be much more expensive than calculating them in an incremental fashion. As a simple example, if a pattern
is reassigned from cluster k to cluster k', then only the centroids and WCV s of these two clusters need to be recalculated. Furthermore, the centroids of these two clusters can be calculated incrementally since the memberships of other patterns have not changed; The TWCV can be calculated incrementally as well since the WCV s of other clusters have not changed. In the following, we describe how we can calculate TWCV and cluster centroids incrementally.
In order to obtain the new centroid
, we maintain the difference values of Z_{ k }^{Î”}, for old solution and new solution when allele changes. With these two values, incremental update of Z_{ k }and can be achieved as Z_{ k }= Z_{ k }+ Z_{ k }^{Î”}, and . Then the new centroids for new solution can be achieved by .
Similarly, in order to obtain the new TWCV, we can maintain a difference value TWCV^{Î”} that denotes the difference between old TWCV and new TWCV for one solution. It is obvious that TWCV^{Î”} is attributed from the difference of new WCV_{ k }and old WCV_{ k }for cluster k. However, WCV_{ k }has to be calculated from scratch since is changed. In this way, TWCV can be updated incrementally as well. Since the calculation of TWCV dominates all iterations, our incremental update of TWCV will have a better performance when mutation probability is small (which implies a small number of alleles changes). However, if the mutation probability is large, too many alleles change their cluster membership, the maintenance of Z_{ k }^{Î”} and becomes expensive and IGKA becomes inferior to FGKA in performance, as confirmed in the experimental study.
Hybrid Genetic KMeans Algorithm (HGKA)
The above discussion presents a dilemma â€“ both FGKA and IGKA are likely to outperform each other: when the mutation probability is smaller than some threshold, IGKA outperforms FGKA; otherwise, FGKA outperforms IGKA.
The key idea of HGKA is to combine the benefits of FGKA and IGKA. However, it is very difficult to derive the threshold value, which is dataset dependant. In addition, the running times of all iterations will vary as solutions converge to the optimum. We propose the following solution: we periodically run one iteration of FGKA followed by one iteration of IGKA while monitoring their running times, and then run the winning algorithm for the following iterations until we reach another competition point.
It has been proved in [8] that FGKA will eventually converge to the global optimum. By using the same flowchart and operators, IGKA and HGKA will also converge to the global optimum. We summarize the comparison of various clustering algorithms in Table 1.
Availability and requirements
IGKA algorithm is available at http://database.cs.wayne.edu/proj/FGKA/index.htm. The source code and database scheme are freely distributed to academic users upon request to the authors.
Abbreviations
 WCV:

WithinCluster Variation
 TWCV:

Total WithinCluster Variation
 IGKA:

Incremental Genetic Kmeans Algorithm
 FGKA:

Fast Genetic Kmeans Algorithm
 HGKA:

Hybrid Genetic Kmeans Algorithm
 ORF:

Open Reading Frame.
References
Shamir R, Sharan R: approaches to clustering gene expression data. In Current Topics in Computational Biology. Edited by: Jiang T, Smith T, Y. Xu and Zhang MQ. , MIT press; 2001.
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22: 281â€“285. 10.1038/10343
Bhuyan JN, Raghavan VV, Elayavalli VK: Genetic algorithm for clustering with an ordered representation: ; San Mateo, CA, USA.; 1991.
Hall LO, B. OI, Bezdek JC: Clustering with a genetically optimized approach. IEEE Trans on Evolutionary Computation 1999, 3: 103â€“112. 10.1109/4235.771164
Maulik U, Bandyopadhyay S: Genetic algorithm based clustering technique. Pattern Recognition 2000, 1455â€“1465. 10.1016/S00313203(99)001375
Jones D, Beltramo M: partitioning problems with genetic algorithms: ; San Mateo, CA, USA. ; 1991.
Krishna K, Murty M: Genetic Kmeans algorithm. IEEE Transactions on Systems, Man and Cybernetics  Part B: Cybernetics 1999, 29: 433â€“439. 10.1109/3477.764879
Lu Y, Lu S, Fotouhi F, Deng Y, Brown S: FGKA: A Fast Genetic Kmeans Algorithm: March 2004. 2004.
Lu Y, Lu S, Fotouhi F, Deng Y, Brown S: Fast genetic Kmeans algorithm and its application in gene expression data analysis. Detroit, Wayne State University; 2003.
Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson JJ, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO: The transcriptional program in the response of human fibroblasts to serum. Science 1999, 283: 83â€“87. 10.1126/science.283.5398.83
de Hoon MJ, Imoto S, Nolan J, Miyano S: Open source clustering software. Bioinformatics 2004, 20: 1453â€“1454. 10.1093/bioinformatics/bth078
Mewes HW, Frishman D, Gruber C, Geier B, Haase D, Kaps A, Lemcke K, Mannhaupt G, Pfeiffer F, Schuller C, Stocker S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2000, 28: 37â€“40. 10.1093/nar/28.1.37
Goldberg D: Genetic Algorithms in Search: Optimization and Machine Learning. MA, AddisonWesley; 1989.
Acknowledgements
We thank Mr. Jun Chen for helping us in dividing the gene function categories. The project described was supported by NIH grant P20 RR16475 from the BRIN Program of the National Center for Research Resources.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
YL carried out the study and drafted the manuscript. SL and FF designed the algorithms. YD designed the whole project, participated in analyzing gene functional data and wrote part of manuscript. SJB corrected English and helped to interpret the data analysis results.
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
About this article
Cite this article
Lu, Y., Lu, S., Fotouhi, F. et al. Incremental genetic Kmeans algorithm and its application in gene expression data analysis. BMC Bioinformatics 5, 172 (2004). https://doi.org/10.1186/147121055172
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121055172