 Proceedings
 Open access
 Published:
Biclustering of microarray data with MOSPO based on crowding distance
BMC Bioinformatics volumeÂ 10, ArticleÂ number:Â S9 (2009)
Abstract
Background
Highthroughput microarray technologies have generated and accumulated massive amounts of gene expression datasets that contain expression levels of thousands of genes under hundreds of different experimental conditions. The microarray datasets are usually presented in 2D matrices, where rows represent genes and columns represent experimental conditions. The analysis of such datasets can discover local structures composed by sets of genes that show coherent expression patterns under subsets of experimental conditions. It leads to the development of sophisticated algorithms capable of extracting novel and useful knowledge from a biomedical point of view. In the medical domain, these patterns are useful for understanding various diseases, and aid in more accurate diagnosis, prognosis, treatment planning, as well as drug discovery.
Results
In this work we present the CMOPSOB (Crowding distance based Multiobjective Particle Swarm Optimization Biclustering), a novel clustering approach for microarray datasets to cluster genes and conditions highly related in subportions of the microarray data. The objective of biclustering is to find submatrices, i.e. maximal subgroups of genes and subgroups of conditions where the genes exhibit highly correlated activities over a subset of conditions. Since these objectives are mutually conflicting, they become suitable candidates for multiobjective modelling. Our approach CMOPSOB is based on a heuristic search technique, multiobjective particle swarm optimization, which simulates the movements of a flock of birds which aim to find food. In the meantime, the nearest neighbour search strategies based on crowding distance and Ïµdominance can rapidly converge to the Pareto front and guarantee diversity of solutions. We compare the potential of this methodology with other biclustering algorithms by analyzing two common and public datasets of gene expression profiles. In all cases our method can find localized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a speciesindependent manner.
Conclusion
The proposed CMOPSOB algorithm is successfully applied to biclustering of microarray dataset. It achieves a good diversity in the obtained Pareto front, and rapid convergence. Therefore, it is a useful tool to analyze large microarray datasets.
Background
With the advent of DNA microarray technology, it is allowable for simultaneously measuring the expression level of thousands of genes under different conditions in a single experiment. In this way the scientific community can collect a huge amount of gene expression datasets. In recent decade years, microarray technique has been widely used in several contexts such as tumour profiling, drug discovery and temporal analysis of cell behaviour [1, 2]. Applications of these microarray data contain the study of gene expression in yeast under different environmental stress conditions and the comparisons of gene expression profiles for tumour from cancer patients. In addition to the enormous scientific potential of DNA microarrays to help in understanding gene regulation and interactions, microarrays have important applications in pharmaceutical and clinical research [1]. By comparing gene expression in normal and disease sells, microarrays may be used to identify disease genes and targets for therapeutic drugs.
Mining patterns from those microarray dataset is an important research problem in bioinformatics and clinical research. These patterns relate to disease diagnosis, drug discovery, protein network analysis, gene regulate, as well as function prediction.
Clustering techniques have been widely applied in gene expression analysis. It can identify set of genes with similar profiles. However, clustering methods assume that related genes have the similar expression patterns across all conditions, which is not reasonable especially when the dataset contains many heterogeneous conditions. These algorithms such as kmeans [2], hierarchical clustering [3], self organizing maps [4] work in the full dimension space, which consider the value of each point in all the dimensions and try to group the similar points together. However, relevant genes are not necessarily related to every condition, so fail to group subset of genes that have similar expression over some but not all conditions. So biclustering is proposed for grouping simultaneously genes set and condition set over which the gene subset exhibit similar expression patterns. Cheng and Church [5] introduce first biclustering to mine genes clusters with respect to a subset of the conditions from microarray data. Up to date, a number of biclustering algorithms for microarray data analysis have been developed such as Î´biclustering [5], pClustering [6], statisticalalgorithmic method for biclustering analysis (SAMBA) [7], spectral biclustering [8], Gibbs sampling biclustering [9], simulated annealing biclustering [10], etc. See [11] for a good survey.
Among the various clustering approaches, many methods are based on local search to generate suboptimal solutions. In recent years heuristics optimization has become a very popular research topic. To order to escape from local minima, many evolutionary algorithms (EA) have been proposed in [12â€“14] to discover global optimal solutions in gene expression data. These methods apply singleobjective EA to find optimal solutions. If a single objective is optimized, the global optimum solution can be found. But in the realworld optimization problem, there are several objectives in conflict with each other to be optimized. These problems with two or more objective functions are called multiobjective optimal problem and require different mathematical and algorithmic tools to solve it. MOEA generates a set of Paretooptimal solutions [12] which is suitable to optimize two or more conflicting objectives such as NSGAII [13], PAES [14] and SPEA2 [15].
When mining biclusters from microarray data, we must optimize simultaneously several objectives in conflict with each other, for example, the size and the homogeneity of the clusters. In this case multiobjective evolutionary algorithms (MOEAs) are proposed to discover efficiently global optimal solution. Among many MOEA proposed the relaxed forms of Pareto dominance has become a popular mechanism to regulate convergence of an MOEA, to encourage more exploration and to provide more diversity. Among these mechanisms, Ïµdominance has become increasingly popular [16], because of its effectiveness and its sound theoretical foundation. Ïµdominance can control the granularity of the approximation of the Pareto front obtained to accelerate convergence and guarantee optimal distribution of solutions. At present, several algorithms [17, 18] adopt MOEAs to discover biclusters from microarray data.
Recently particle swarm optimization (PSO) proposed by Kebnnedy and Eberhart [19, 20] is a heuristicsbased optimization approach simulating the movements of a bird flock finding food. The most attractive of PSO is that there are very few parameters to adjust. So it has been successfully used for both continuous nonlinear and discrete binary singleobjective optimization. With the rapid convergence and relative simplicity, PSO becomes very suitable to solve multiobjective optimization named as multiobjective PSO (MOPSO). In recent years many multiobjective PSO (MOPSO) approaches such as [21, 22] has proposed. The strategy of Ïµdominance and crowding distance [13] are introduced into MOPSO speeding up the convergence and attaining good diversity of solutions [23â€“28]. There are currently over twenty five different proposals of MOPSOs presented in [29]. We propose the algorithm MOPSOB [30] to mine biclusters.
In this paper, we modify the fully connected flight model and incorporate Ïµdominance strategies and crowding distance into MOPSO, and propose a novel MOPSO biclustering framework to find one or more significant biclusters of maximum size from microarray data. Three objectives, the size, homogeneity and row variance of biclusters, are satisfied simultaneously by applying three fitness function in optimization framework. A low mean squared residue (MSR) score of biclusters denotes that the expression levels of each gene within the biclusters are similar over the range of conditions. Using the row variance as fitness function can guarantee that the found biclusters capture the subset of genes exhibiting fluctuating yet coherent trends under subset of conditions, therefore reject trivial biclusters. Therefore, we focus on finding biclusters of maximum size, with mean squared residue lower than a given Î´, with a relatively high genedimension variance.
Results
To determine whether the proposed methodology is able to mining better biclusters from microarray data, we have used two common gene expression datasets. In the next sections we describe an overview of the methodology and the detailed results of its application to the analysis of two real datasets.
CMOPSOB algorithm
In this paper, we incorporate Ïµdominance, crowding distance and the nearest neighbour search approach into MOPSO framework, and propose CMOPSOB algorithm to mine biclusters from the microarray datasets. In the solution space, after the initialization of the particle swarm, each particle keeps track of its position which is associated with the best solution achieved so far. The personal best solution of a particle is denoted by pbest and the best neighbour of a particle by nbest. The global optimal solution of the particle swarm is the best location obtained so far by any particle in the population and is named as gbest. The proposed algorithm consists of iteratively changing the velocity of each particle toward its pbest, nbest and gbest positions. The external archive (denoted as A) records nondominated set of the particle swarm (PS) that is the final optimal solution set. Our algorithm is given in the following different steps.
Initialization of the algorithm
We implement the search of optimal solutions in a discrete binary space inspired by [20]. The value of a particle on each dimension (e.g. x_{ id }presents the value of particle i on dimension d) is only set to zero or one. We define the velocity of particle as the probability which a binary bit changes. For example each v_{ id }represents the probability of bit x_{ id }being the value 1. Therefore v_{ id }must be assigned to the interval [0, 1]. Personal best position of each particle i found so far is maintains in pbest_{ i }whose value of each dimension d (pbest_{ id }) is integers in {0, 1}. Initialization process first initializes the location and velocity of each particle, and then external archive is initialized. Lastly we initialize global bests (gbest) of each particle.
Step1 Initialize the particle swarm (PS) with size S
The particle swarm is initialized with a population of random solutions
For i = 1 to S
For d = 1 to N (the number of dimension)
Initialized x_{ id }and v_{ id }randomly
Endfor
Evaluate the ith particle x_{ i }
pbest_{ i }= x_{ i }(the personal bests for x_{ i }is initialized to be the original position)
nbest_{ i }= x_{ i }(the best neighbours of x_{ i }is initialized to be the original position)
Endfor
Step2 Initialize external archive and the global bests (gbest) of each particle
Nondominated set of initialized PS is constructed depending on Ïµdominance relation, which is reserved in the external archive (A). Then global bests (gbest) for each particle in the PS is selected randomly from A. Lastly, the crowding distance of each particle in A is computed.
Iterative update operation of the algorithm
Each iteration consists of the following three processes. The first is evaluation of each particle. Secondly, the velocity v_{ id }of each particle is updated based on particle i's best previous position (pbest_{ id }), the best neighbour of particle i and the best previous position of all particle (gbest_{ id }). Lastly each particle flies its new best position, and global bests of each particle and external archive is updated.
Step3 Update velocity and location of each particle
In discrete search space, a particle may fly to nearer and farther position of the hypercube by changing various numbers of bits. Thus update process of the particle is implemented to generate new swarm as the following rule.
Where c_{1}, c_{2} and c_{3} are three constants which are used to bias the influence between pbest, nbest and gbest, and we assume c_{1} = c_{2} = c_{3} = 1. The parameter w is inertia weight, and we set w = 0.5. Two parameters r_{1}, r_{2} and r_{3} are random numbers in the range [0, 1]. The parameter pbest_{ id }, nbest_{ id }and gbest_{ id }are integers in {0, 1}, and v_{ id }(as a probability) must be constrained to the interval [0, 1]. The function S (v) is a logistic transformation and rand () is a quasirandom number selected from a uniform distribution in [0, 1].
Step4 Evaluate and update each particle in PS
Each particle in PS has a new location, if the current location is dominated by its personal best, then the previous location is kept, otherwise, the current location is set as the personal best location. If they are mutually nondominated, we select the location with least crowding distance.
Step5 Compute crowding distance and update external archive
Based on Ïµdominance relation, the nondominated set of PS is constructed and combined into the current external archive and then get a bigger set leader. After computing crowding distance in leaders, a new external archive is got by selecting the S particles with least crowding distance.
Step6 Update global bests of each particle
Update the global bests of each particle in PS that are selected randomly from A which mainly aim at searching in whole space for global optimization solutions.
The algorithm iteratively updates position of the particle until userdefined number of generations are generated and lastly converges to the optimal solution, or else, implements iteration go to step3.
Step7 Return the set of biclusters
The particles in external archive A are the optimal solutions that present the set of biclusters.
Testing
Our algorithm CMOPSOB is implemented on two wellknown datasets, yeast and human Bcells expression datasets. To compare the performance of the proposed algorithm with MOEA [17] and MOPSOB [30], a criteria the coverage is defined as the total number of cells in microarray data matrices covered by the found biclusters.
Yeast dataset
Table 1 shows the information of ten biclusters out of the one hundred biclusters found on the yeast dataset. The one hundred biclusters cover 75.2% of the genes, 100% of the conditions and in total 53.8% cells of the expression matrix, while Ref [17] and Ref [30] report an average coverage of 51.34% and 52.4% cells respectively.
Figure 1 depicts sample gene expression profiles for small biclusters (bicluster 69) for the yeast dataset. It shows that 24 genes present a similar behaviour across 15 experimental conditions.
Human Bcells expression dataset
Table 2 shows the information of ten biclusters out of the one hundred found on the human dataset. The one hundred biclusters found on the human dataset cover 38.3% cells of dataset(48.1%of the genes and 100% of the conditions), whereas an average of 20.96% and 36.9% cells are covered in [17] and [30], respectively.
Comparative analysis
Among the different MOEAs algorithms, NSGA2 are the best multiobject optimization algorithm. Mitra and Banka [17] incorporates NSGA2 with local search strategies to solve biclustering problem denoted as NSGA2B(NSGA2 biclustering).
In this section, we compare the proposed CMOPSOB with two well known MOEA biclustering and MOPSOB [30] on the yeast data and the human Bcells expression data. Parameters are set to same as MOPSOB [30]. We use a crossover probability of 0.75 and a mutation probability of 0.03 for NSGA2B. For MOPSOB and CMOPSOB, we set Îµ = 0.01. Comparative analysis of three algorithms is shown in Table 3.
Table 3 shows that the biclusters found by CMOPSOB is characterized by a slightly lower squared residue and a higher bicluster size than those by NSGA2B and MOPSOB on both yeast dataset and human dataset. However when comparing MOPSOB with NSGA2B, we find the biclusters found by MOPSOB have better quality than those found by NSGA2B.
In total it is clear from the above results that the proposed CMOPSOB algorithm performs best in maintaining diversity, achieving convergence.
Analysis of biological annotation enrichment of gene clusters
The gene ontology (GO) project http://www.geneontology.org provides three structured, controlled vocabularies that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a speciesindependent manner. The enrichment of functional annotations in genes contained in biclusters is evaluated using OntoExpress tool [31]. To determine the biological relevance of the biclusters found by CMOPSOB on the yeast dataset in terms of the statistically significant GO annotation database, we feed genes in each bicluster to OntoExpress http://vortex.cs.wayne.edu/Projects.html and obtain a hierarchy of functional annotations in terms of Gene Ontology for each bicluster. Here only categories with pvalues less than 0.01 were considered statistically significant.
The degree of enrichment is measured by pvalues which use a cumulative hypergeometric distribution to compute the probability of observing the number of genes from a particular GO category (function, process and component) within each bicluster. For example, the probability p for finding at least k genes from a particular category within a bicluster of size n is given in (3).
Where m is the total number of genes within a category and g is the total number of genes within the genome [32]. The pvalues are calculated for each functional category in each bicluster to denote how well those genes match with the corresponding GO category.
Table 4 lists the significant shared GO terms (or parent of GO terms) used to describe the set of genes in each bicluster for the process, function and component ontology. For example for cluster C_{16}, we find that the genes are mainly involved in lipid transport. The tuple (n = 23, p = 0.00013) means that out of 96 genes in cluster C_{16}, 23 genes belong to lipid transport process, and the statistical significance is given by the pvalue of 0.00013. Those results mean that the proposed CMOPSO biclustering approach can find biologically meaningful clusters.
Discussion
Although there is the amount of information in large and diverse databases of genomewide expression profiles, mining of meaningful biological knowledge from those datasets remains enormous challenges. Multiobjective evolutionary biclustering is a global search heuristic approach, and demonstrate better performance as compared to existing various greedy biclustering methods proposed in the literature [17]. But this approach need spend too much computation time in order to achieve better convergence and diversity. PSO is initialized with a population of random solutions which the point is same as genetic algorithm (GA). The difference is, however, each potential solution of PSO named as particle is also assigned a randomized velocity, and then flown to the optimal solution in the solution space. Another important difference is the fact that PSO allows individuals to benefit from their past experiences whereas in an evolutionary algorithm, normally the current population is only retained solution of the individuals. This paper introduces a new global search framework for biclustering based on MOPSO approach. Because PSO method does not use the filtering operation such as crossover and mutation and the whole swarm population maintains constant during the search process. So in addition to attaining better convergence and diversity, our approach proposed here offers great advantages over evolutionary methods of biclustering. Our method can also speed up the process of search. In the future, we will adapt various types of biological methods such as immune system to mining biclusters from microarray datasets. At the same time, we will combine the advantage of those evolutionary computations to propose hybrid biclustering methods for biclustering of microarray dataset.
Conclusion
In this work, we have provided a novel multiobjective PSO framework for mining biclusters from microarray datasets. We focus on finding maximum biclusters with lower mean squared residue and higher row variance. Those three objectives are incorporated into the framework with three fitness functions. We apply MOPSO to quicken convergence of the algorithm, and Ïµdominance and crowding distance update strategy to improve the diversity of the solutions. The results on the yeast microarray dataset and the human Bcells expression dataset verify the good quality of the found biclusters, and comparative analysis show that the proposed CMOPSOB is superior to NSGA2B and MOPSOB in terms of diversity, convergence.
Methods
Biclusters
Given a gene expression data matrix D = G Ã— C = {d_{ ij }} (here i âˆˆ [1, n], j âˆˆ [1, m]) is a realvalued n Ã— m matrix, here G is a set of n genes {g_{ 1 }, g_{ 2 }, â‹¯, g_{ n }}, C a set of m biological conditions {c_{ 1 }, c_{ 2 }, â‹¯, c_{ n }}. Entry d_{ ij }means the expression level of gene g_{ i }under condition c_{ j }.
Definition 1 Bicluster
Given a gene expression dataset D = G Ã— C = {d_{ ij }}, if there is a submatrix B = g Ã— c, where g âŠ‚ G, c âŠ‚ C, to satisfy certain homogeneity and minimal size of the cluster, we say that B is a bicluster.
Definition 2 Maximal bicluster
A bicluster B = g Ã— c is maximal if there exists not any other biclusters B' = g' Ã— c' such that g' âŠ‚ G and c' âŠ‚ C,
Definition 3 Dimension mean
Given a bicluster B = g Ã— c, with subset of genes g âŠ‚ G, subset of conditions c âŠ‚ C, d_{ ij }is the value of gene g_{ i }under condition c_{ j }in the dataset D. We denote by d_{ ic }the mean of the ith gene in B, d_{ gj }the mean of the jth condition in B. We also denote by d_{ gc }the mean of all entries in B. These values are defined as follows, where Size (g, c) = gc presents the size of bicluster B.
Definition 4 Residue and mean square residue
Given a bicluster B = g Ã— c, to assess the difference between the actual value of an element d_{ ij }and its expected value, we define by r(d_{ ij }) the residue of d_{ ij }in bicluster B in (7). Therefore the mean squared residue (MSR) of B is defined as the sum of the squared residues to assess overall quality of a bicluster B in (8).
r(d_{ ij }) = d_{ ij } d_{ ic }d_{ gi }+ d_{ gc }
Definition 5 Row variance
Given a bicluster B = g Ã— c, the ith gene variance in B is defined by RVAR (i, c) and the overall genedimensional variance is defined as the sum of all genes variance as follows.
Our target is mining good quality biclusters of maximum size, with mean square residue (MSR) smaller than a userdefined threshold Î´ > 0, which presents the maximum allowable dissimilarity within the biclusters, and with a greater row variance.
Bicluster encoding
Each bicluster is encoded as a particle of the population. Each particle is represented by a binary string of fixed length n+m, where n and m are the number of genes and conditions of the microarray dataset, respectively. The first n bits are associated to n genes, the following m bits to m conditions. If a bit is set to 1, it means that the responding gene or condition belongs to the encoded bicluster; otherwise it does not. This encoding presents the advantage of having a fixed size, thus using simply of standard variation operations [33]. For example a biclusters consists of 16 bits (the first 8 bits corresponding to 8 genes and the following 8 bits to 8 conditions). Therefore, the following binary string:
1010001010100110
presents the individual encoding a bicluster with 3 genes and 4 conditions, and then its size is 3 Ã— 4 = 12. Where  is a symbol used to delimit the bits relative to the rows from the bits relative to the columns.
Nearest neighbour flight
Most MOPSO algorithms adopt the fully connected flight model to propel the swarm particles towards the Pareto optimal front. Zhang et al. [34] includes the lattice model to escape the local optimal. This paper introduces nearest neighbour flight model. Two particles are nearest neighbours if and only if the binary encodes of two particles are only different in one bit. That is to say, two biclusters is only different in a gene or a condition. During search process, each particle searches and flies to the best position (named as nbest) of its nearest neighbours.
Ïµdominance
In this section we define relevant concepts of dominance and Pareto sets. The algorithms presented in this paper assume that all objectives are to be minimized. Objective vectors are compared according to the dominance relation defined below.
Definition 6 Dominance relation
Let f, g âˆˆ R^{m}. Then f is said to dominate g (denoted as f â‰» g), iff
(i) âˆ€ i âˆˆ {1, ...., m}: f_{i} â‰¤ g_{i}
(ii) âˆƒ j âˆˆ {1, ...., m}: f_{j} < g_{j}
Definition 7 Pareto set
Let F âˆˆ R ^{m}be a set of vectors. Then the Pareto set F* of F is defined as follows:
F* contains all vectors g âˆˆ F which are not dominated by any vector f âˆˆ F, i.e.
F : = {g âˆˆ F  âˆ„ f âˆˆ F : f â‰» g}
Vectors in F* are called Pareto vectors of F. The set of all Pareto sets of F is denoted as P*(F).
Definition 8 Ïµdominance
Let f, g âˆˆ R^{m}. Then f is said to Ïµdominate g for some Ïµ > 0, denoted as f â‰»_{Ïµ} g, iff for all i âˆˆ {1, ...., m }
(1 + Ïµ) f_{ i }â‰¤ g_{ i }.
Definition 9 Ïµapproximate Pareto set
Let F âŠ‡ R^{m}be a set of vectors and Ïµ > 0. Then a set F_{Ïµ} is called a Ïµapproximate Pareto set of F, if any vector g âˆˆ F is Ïµdominated by at least one vector f âˆˆ F_{Ïµ}, i.e.
âˆ€ g âˆˆ F : âˆƒ f âˆˆ F_{Ïµ} such that f â‰»_{Ïµ} g
The set of all Ïµapproximate Pareto sets of F is denoted as P_{Ïµ} (F).
Definition 10 ÏµPareto set
Let F âŠ† R^{m}be a set of vectors and Ïµ > 0. Then a set {F}_{\mathrm{\xcf\mu}}^{\xe2\u02c6\u2014} âŠ† F is called an ÏµPareto set of F if

(i)
{F}_{\mathrm{\xcf\mu}}^{\xe2\u02c6\u2014} is an Ïµapproximate Pareto set of F, i.e. {F}_{\mathrm{\xcf\mu}}^{\xe2\u02c6\u2014} âˆˆ P_{Ïµ} (F), and

(ii)
{F}_{\mathrm{\xcf\mu}}^{\xe2\u02c6\u2014} contains Pareto points of X only, i.e. {F}_{\mathrm{\xcf\mu}}^{\xe2\u02c6\u2014}âˆˆ F *
The set of all ÏµPareto set of F is denoted as {P}_{\mathrm{\xcf\mu}}^{\xe2\u02c6\u2014} (F).
Crowding distance
The crowding distance of a particles can estimate the density of particles including the particle [13]. The computation of the crowding distance of particle i is reached by estimating the size of the largest cube surrounding particle i without including any other particle [25]. The calculation of crowding distance of each particle is achieved as the following steps.
Step 1 Calculate the number of nondominated particles in external archive A, s = A.
Step 2 Initialize the distance of each particle i to zero, A [i].distance = 0.
Step 3 Computes the distance of each particle. For each objective m, the following steps are implemented.
Step 3.1 sorting A [] in ascending objective m function values of each particle.
Step 3.2 Set the maximum distance to the boundary points so that they are always selected A [1].distance = A [s].distance= maximum distance.
Step 3.3 The distance of all other particles i = 2 to s1 are the average distance of its two neighbouring solutions computed as follows:
A [i].distance = A [i].distance + (A [i+1].m  A [i1].m)
Here A [i1].m refers to the mth objective function value of the ith individual in the set A.
Fitness function
Our hope is mining biclusters with low mean squared residue, with high volume and genedimensional variance, and those three objectives in conflict with each other are well suited for multiobjective to model. To achieve these aims, we use the following fitness functions.
Where G and C are the total number of genes and conditions in microarray datasets respectively. Size(x), MSR(x) and RVAR(x) denotes the size, mean squared residue and row variance of bicluster encoded by the particle Ã— respectively. Î´ is the userdefined threshold for the maximum acceptable mean squared residue. Our algorithm minimizes those three fitness functions.
Update of ÏµPareto set of the particle
In order to guarantee the convergence and maintain diversity in the population at the same time, we implement updating of ÏµPareto set of the population during selection operation similar to [16]. The following steps conclude a general scheme of the updating algorithm.
Step 1. For each particle x in the swarm X, the set Y contains particles x' which meet the condition that lnx dominate lnx'. Here function f = ln(x) is computed as y_{ i }= âŒŠln x_{ i }/(1 + âˆˆ)âŒ‹ which x_{ i }presents the value of the solution Ã— in objective i, and y denotes an m Ã— 1 vector.
Step 2. From the particle swarm X, exclude Y and those particles x' which meet (i) lnx' = lnx, (ii) x dominate x'.
Datasets and data preprocessing
We apply the proposed multiobjective PSO biclustering algorithm to mine biclusters from two well known datasets. The first dataset is the yeast Saccharomyces cerevisiae cell cycle expression data [35], and the second dataset is the human Bcells expression data [36].
Yeast dataset
The yeast dataset collects expression level of 2,884 genes under 17 conditions. All entries are integers lying in the range of 0â€“600. Out of the yeast dataset there are 34 missing values. The 34 missing values are replaced by random number between 0 and 800, as in [5].
Human Bcells expression dataset
The human Bcells expression dataset is collection of 4,026 genes and 96 conditions, with 12.3% missing values, lying in the range of integers 750650. Like in [5], the missing values are replaced by random numbers between 800800. However, those random values affect the discovery of biclusters [37]. For providing a fair comparison with existing methods we set the same parameter for Î´ as [5], i.e., for the yeast data Î´ = 300, for the human Bcells expression data Î´ = 1200. The two gene expression dataset are taken from [5].
References
Su A, Cooke M, Ching K, Hakak Y, Walker J, Wiltshire T, Orth A, Vega R, Sapinoso L, Moqrich A: Largescale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci U S A. 2002, 99: 44654470.
De Smet F, Mathys J, Marchal K, Thijs G, De Moor B, Moreau Y: Adaptive qualitybased clustering of gene expression profiles. Bioinformatics. 2002, 18: 735746.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A. 1998, 95: 1486314868.
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999, 96: 29072912.
Cheng Y, Church GM: Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol. 2000, 8: 93103.
Wang H, Wang W, Yang J, Yu PS: Clustering by pattern similarity in large data sets. Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 2002, 394405.
Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002, 18 (Suppl 1): S136144.
Dhillon IS: Coclustering documents and words using bipartite spectral graph partitioning. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 2001, 269274.
Sheng Q, Moreau Y, De Moor B: Biclustering microarray data by Gibbs sampling. Bioinformatics. 2003, 19 (Suppl 2): ii196205.
Bryan K, Cunningham P, Bolshakova N, Coll T, Dublin I: Biclustering of expression data using simulated annealing. ComputerBased Medical Systems, 2005. Proceedings. 18th IEEE Symposium on. 2005, 383388.
Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform. 2004, 1: 2445.
Deb K: MultiObjective Optimization Using Evolutionary Algorithms. 2001, John Wiley & Sons, Chichester
Deb K, Pratap A, Agarwal S, Meyarivan T: A fast and elitist multiobjective genetic algorithm: NSGAII. Evolutionary Computation, IEEE Transactions on. 2002, 6: 182197.
Knowles J, Corne D: Approximating the Nondominated Front Using the Pareto Archived Evolution Strategy. Evolutionary Computation. 2000, 8: 149172.
Zitzler E, Laumanns M, Thiele L: SPEA2: Improving the Strength Pareto Evolutionary Algorithm. EUROGEN. 2001, 95100.
Laumanns M, Thiele L, Deb K, Zitzler E: Combining Convergence and Diversity in Evolutionary Multiobjective Optimization. Evolutionary Computation. 2002, 10: 263282.
Mitra S, Banka H: Multiobjective evolutionary biclustering of gene expression data. Pattern Recognition. 2006, 39: 24642477.
Divina F, AguilarRuiz JS: A multiobjective approach to discover biclusters in microarray data. Proceedings of the 9th annual conference on Genetic and evolutionary computation. 2007, 385392.
Kennedy J, Eberhart RC: Particle swarm optimization. Proceedings, IEEE International Conference on Neural Networks. 1995, 19421948.
Kennedy J, Eberhart RC: A discrete binary version of the particle swarm algorithm. Proceedings of IEEE Conference on Systems, Man, and Cybernetics. 1997, 41044109.
Mostaghim S, Teich J: Strategies for finding good local guides in multiobjective particle swarm optimization (MOPSO). Swarm Intelligence Symposium, 2003. SIS'03. Proceedings of the 2003 IEEE. 2003, 2633.
Parsopoulos KE, Vrahatis MN: Particle swarm optimization method in multiobjective problems. Proceedings of the 2002 ACM Symposium on Applied Computing (SAC'2002). 2003, 603607.
Mostaghim S, Teich J: The Role of dominance in Multi Objective Particle Swarm Optimization Methods. Proceedings of the 2003 Congress on Evolutionary Computation (CEC 2003), Camberra Australia, IEEE Press. 2003, 17641771.
Sierra MR, Coello CAC: Improving PSOBased Multiobjective Optimization Using Crowding, Mutation and eDominance. Proceedings of the Third International Conference on Evolutionary MultiCriterion Optimization (EMO 2005). 2005, 505519.
Raquel C, Naval P: An effective use of crowding distance in multiobjective particle swarm optimization. Proceedings of the 2005 conference on Genetic and evolutionary computation. 2005, 257264.
Srinivisan D, Seow TH: Particle swarm inspired evolutionary algorithm (PSEA) for multiobjective optimization problem. Proceedings of Congress on Evolutionary Computation (CEC2003). 2003, 22922297.
Ray T, Liew K: A Swarm Metaphor for Multiobjective Design Optimization. Engineering Optimization. 2002, 34: 141153.
Li X: A Nondominated Sorting Particle Swarm Optimizer for Multiobjective Optimization. LECTURE NOTES IN COMPUTER SCIENCE. 2003, 3748.
ReyesSierra M, Coello CAC: MultiObjective Particle Swarm Optimizers: A Survey of the StateoftheArt. International Journal of Computational Intelligence Research. 2006, 2: 287308.
Liu JW, Li ZJ, Liu FF, Chen YM: MultiObjective Particle Swarm Optimization Biclustering of Microarray Data. IEEE International Conference on Bioinformatics and Biomedicine(BIBM 2008). 2008, 363366.
Khatri P, Draghici S, Ostermeier G, Krawetz S: Profiling Gene Expression Using OntoExpress. Genomics. 2002, 79: 266270.
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics. 1999, 22: 281285.
Divina F, AguilarRuiz JS: Biclustering of expression data with evolutionary computation. IEEE transactions on knowledge and data engineering. 2006, 18: 590602.
Zhang Xh, Meng Hy, Jiao Lc: Intelligent particle swarm optimization in multiobjective optimization. Congress on Evolutionary Computation (CEC'2005). 2005, IEEE Press, Edinburgh, Scotland, UK, 714719.
Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ: A GenomeWide Transcriptional Analysis of the Mitotic Cell Cycle. Molecular Cell. 1998, 2: 6573.
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature. 2000, 403: 503511.
Yang J, Wang W, Wang H, Yu P: Î´clusters: capturing subspace correlation in a large dataset. Data Engineering, 2002. Proceedings. 18th International Conference on. 2002, 517528.
Acknowledgements
This work is supported by Scientific Research Fund of Central South University of Forestry & Technology Grants No.07026B and National Science Foundation of China under Grants No. 60573057.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 4, 2009: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2008. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S4.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JL proposed to use MOPSO methods to mining biclusters from gene expression data and drafted the manuscript. ZL and XH were involved in study design and coordination and revised the manuscript. YC conducted the algorithm design.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Liu, J., Li, Z., Hu, X. et al. Biclustering of microarray data with MOSPO based on crowding distance. BMC Bioinformatics 10 (Suppl 4), S9 (2009). https://doi.org/10.1186/1471210510S4S9
Published:
DOI: https://doi.org/10.1186/1471210510S4S9