A formal concept analysis approach to consensus clustering of multiexperiment expression data
 Anna Hristoskova^{1}Email author,
 Veselka Boeva^{2} and
 Elena Tsiporkova^{3}
https://doi.org/10.1186/1471210515151
© Hristoskova et al.; licensee BioMed Central Ltd. 2014
Received: 17 January 2014
Accepted: 6 May 2014
Published: 19 May 2014
Abstract
Background
Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual studyspecific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them.
Results
We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group.
These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCAenhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multiexperiment study examining the global cellcycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals.
Conclusions
The proposed FCAenhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices.
Keywords
Background
DNA microarray technology offers the ability to screen the expression levels of thousands of genes in parallel under different experimental conditions or their evolution in discrete time points. All these measurements contain information on several aspects of gene regulation and function, ranging from understanding the global cellcycle control of microorganisms [1], to cancer in humans [2, 3]. Gene clustering is one of the most frequently used analysis methods for gene expression data. Clustering algorithms are used to divide genes into groups according to the degree of their expression similarity. These groups suggest the correlation and/or coregulation of the respective genes that possibly share common biological roles.
The combination of data from multiple microarray studies addressing a similar biological question is gaining high importance in the recent years [4–7] due to the ever increasing number and complexity of the available gene expression datasets. The integration and evaluation of multiple datasets yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual studyspecific biases are diminished. A method for integration analysis of the data from multiple experiments is the aggregation of their clustering results into a consensus clustering which emphasizes the common organization in all the datasets and reveals the significant differences among them.
In this work, we present and validate a novel generic approach to consensus clustering based on Formal Concept Analysis (FCA) where microarray data realized under different experimental conditions is integrated into a representative consensus clustering solution. It initially divides the available microarray experiments into groups of related datasets with respect to a predefined criterion and then a consensus clustering algorithm is applied to each group of experiments separately. The rationale behind this is that if the experiments are closely related to one another, then they produce more accurate and robust clustering solution. Next, the clustering solutions produced by the different groups are pooled together and further analyzed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over the whole experimental compendium. FCA produces a concept lattice where each concept represents a subset of genes that belongs to a number of clusters. The concepts compose the final disjoint clustering partition.
The proposed general FCAenhanced consensus clustering approach is experimentally validated. For this purpose, two consensus clustering methods are adapted to incorporate an FCA analysis step. These methods are quite different; the first (Integrative) integrates the partitioning results derived from multiple microarray datasets through a weighted aggregation process, while the second employs a Particle Swarm Optimization (PSO) approach to cluster gene expression data across multiple experiments. The FCA results derived from both methods are analysed with respect to the cluster consistency and biological relevance. It is shown that although both algorithms optimize different clustering characteristics, FCA is able to construct similar consensus clustering solutions that are representative for the whole set of experiments.
In summary, the main contribution of the introduced FCAenhanced consensus clustering technique is that it proposes a general approach to the combination of clustering algorithms with Formal Concept Analysis (FCA) for deriving clustering solutions from multiple gene expression matrices. In addition, the approach is demonstrated to be independent of the selected clustering algorithm. In this way one can use a customized algorithm optimized for the specific characteristics of each group of experiments. The further employment of FCA allows performing a subsequent data analysis, which provides useful insights on the biological role of genes contained in the same FCA concepts.
Related work
Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. However, as emphasized in [7], the investigations of gene expression levels have also generated controversy because of the probabilistic nature of the conclusions and the discrepancies between the results of the studies addressing the same biological question. Subsequently, the authors proposed data analysis and visualization tools for estimating the degree to which the findings of one study are reproduced by others and for integrating multiple studies in a single analysis. These tools were described in the context of studies of breast cancer and it was illustrated that it is possible to identify a substantial biologically relevant subset of the human genome within which the expression levels are reliable. The latter suggests that important biological signals are often preserved or enhanced by multiple experiments.
Another approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering which increases the confidence in the common features in all the datasets and reveals the important differences among them [8]. Methods for the combination of clustering results derived for each dataset separately have been considered in [9–11]. The algorithm proposed in [9] first generates local cluster models and then combines them into a global cluster model of the data. The study in [10] focuses on clustering ensembles, i.e. seeking a combination of multiple partitions that provides improved overall clustering of the given data. The combined partition is found as a solution to the corresponding maximum likelihood problem using the ExpectationMaximization (EM) algorithm in [12]. The authors in [11] consider the problem of combining multiple partitions of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitions. The cluster ensemble problem is formalized as a combinatorial optimization problem in terms of shared mutual information.
In contrast to the foregoing approaches, the generic solution proposed in this paper applies FCA in order to construct a consensus clustering that is representative for all the datasets and in addition, it is independent of the applied clustering algorithm. In order to validate the proposed FCAenhanced approach two consensus clustering algorithms have been developed, Integrative[13] and PSObased[14], and used in the validation process. Note that, a preliminary FCAenhanced version of the PSObased algorithm was initially considered in [15].
In [13] we study two microarray data integration techniques that can be applied to the problem of deriving clustering results from a set of microarray experiments. A cluster integration approach is considered, which combines the information contained in multiple microarray experiments at the level of expression or distance matrices and then applies a clustering algorithm on the combined matrix. Furthermore, a technique for the combination of partitioning results derived from multiple microarray datasets, referred to as Integrative consensus clustering, is introduced. It uses a traditional aggregation schema in order to integrate the different partitioning results into a final partition matrix.
The PSObased approach is used to cluster gene expression data across multiple experiments. In this algorithm, referred to as PSObased consensus clustering, each experiment (dataset) defines a particle which is initialized with a set of k cluster centroids obtained after performing kmeans clustering on the experiment. The final (optimal) clustering solution is found by updating the particles using the information on the best clustering solution obtained by each experiment and the entire set of experiments.
In this article, the Integrative and PSObased consensus clustering approaches are extended incorporating a final FCAstep. The experimental results from their validation suggest that although both algorithms optimize different clustering characteristics, FCA is able to construct similar consensus clustering solutions.
Methods
Partitioning algorithms
Three partitioning algorithms are commonly used for the purpose of dividing data objects into k disjoint clusters [16]: kmeans, kmedians and kmedoids clustering. All three methods start by initializing a set of k cluster centres, where k is preliminarily determined. Subsequently, each object of the dataset is assigned to the cluster whose centre is the nearest, and the cluster centres are recomputed. This process is repeated until the objects inside every cluster become as close to the centre as possible and no further object item reassignments take place. The EM algorithm in [12] is commonly used for that purpose, i.e. to find the optimal partitioning into k groups. The three partitioning methods in question differ in how the cluster centre is defined. In kmeans, the cluster centre is defined as the mean data vector averaged over all objects in the cluster. Instead of the mean, kmedians calculates the median for each dimension in the data vector. Finally, in kmedoids [17], which is a robust version of the kmeans, the cluster centre is defined as the object which has the smallest sum of distances to the other objects in the cluster, i.e., this is the most centrally located point in a given cluster.
Particle swarm optimization
Particle swarm optimization (PSO) is an evolutionary computation method introduced in [18]. In order to find an optimal or nearoptimal solution to the problem, PSO updates the current generation of particles (each particle is a candidate solution to the problem) using the information on the best solution obtained by each particle and the entire population.
The hybrid algorithm proposed in [14] combines kmeans and PSO for deriving a clustering result from a group of n related microarray datasets M_{1},M_{2},…,M_{ n }. Each dataset contains the gene expression levels of m genes in n_{ i } different experimental conditions or time points. In this context, each matrix i is used to generate k cluster centers, which are considered to represent a particle, i.e. the particle is treated as a set of points in an n_{ i }dimensional space. The final (optimal) clustering solution is found by updating the particles using the information on the best clustering solution obtained by each data matrix and the entire set of matrices.
Assume that the ith particle is initialized with a set of k cluster centers^{a}${C}_{i}=\left\{{C}_{1}^{i},{C}_{2}^{i},\dots ,{C}_{k}^{i}\right\}$ and a set of velocity vectors^{b}${V}_{i}=\left\{{V}_{1}^{i},{V}_{2}^{i},\dots ,{V}_{k}^{i}\right\}$ using gene expression matrix M_{ i }. Thus each cluster center is a vector ${C}_{j}^{i}=\left({c}_{j1}^{i},{c}_{j2}^{i},\dots ,{c}_{j{n}_{i}}^{i}\right)$ and each velocity vector is a vector ${V}_{j}^{i}=\left({v}_{j1}^{i},{v}_{j2}^{i},\dots ,{v}_{j{n}_{i}}^{i}\right)$, i.e. each particle i is a matrix (or a set of points) in the k×n_{ i } dimensional space.
The variables φ_{1} and φ_{2} are uniformly generated random numbers in the range [0,1], c_{1} and c_{2} are called acceleration constants whereas w is called inertia weight as defined in [19]. The first part of Equation (1) represents the inertia of the previous velocity, the second part is the cognition part that identifies the personal experience of the particle and the third part represents the cooperation among particles and is therefore named the social component. Acceleration numbers c_{1}, c_{2} and inertia weight w are predefined by the user. Note that the cognition part in the above equation has a modified interpretation. Namely, it represents the private ‘thinking’ (opinion) of the particle based on its own source of information (dataset). Due to this we adapted the social part (see equation (2)) since each particle matrix has a different number of columns (n_{ i }) due to different number of experiment points in each dataset. It was demonstrated in [19] that when w is in the range [0.9,1.2] PSO will have the best chance to find the global optimum within a reasonable number of iterations. Furthermore, w=0.72 and c_{1}=c_{2}=1.49 were found in [20] to ensure good convergence.
 1.
Initialize each particle with k cluster centers obtained as a result of applying the kmeans algorithm to the corresponding data matrix.
 2.
Initialize the personal best clustering solution of each matrix with the corresponding clustering solution found in Step 1.
 3.for iteration = 1 to maxiteration do
 (a)
for i=1to n do (i.e. for all datasets)
 i.
for j=1to m do (i.e. for all genes in the current dataset)
 A.
Calculate the distance of gene g _{ j } with all cluster centers.
 B.
Assign g _{ j } to the cluster that has the nearest center to g _{ j }.
 A.
 ii.
end for
 iii.
Calculate the fitness function for the clustering solution C _{ i }.
 iv.
Update the personal best clustering solution P _{ i }.
 i.
 (b)
end for
 (c)
Find the global best solution P _{ g }.
 (d)
Update the cluster centers according to the velocity updating formula proposed in equation (1).
 (a)
 4.
end for
The PSObased clustering algorithm was first introduced in [21] showing that it outperforms kmeans and a few other stateoftheart clustering algorithms. In this method, each particle represents a possible set of k cluster centroids. The authors in [22] hybridized the approach in [21] with the kmeans algorithm for clustering general datasets. A single particle of the swarm is initialized with the result of the kmeans algorithm while the rest of the swarm is initialized randomly. In [23] a new approach is proposed based on the combination of PSO and Self Organizing Maps and applied it for clustering gene expression data obtaining promising results. Further the study in [24] considers a dynamic clustering approach based on PSO and genetic algorithm. The main advantage of this algorithm is that it can automatically determine the optimal number of clusters and simultaneously cluster the data set with minimal user interference. The downside of all the foregoing approaches is that they are not suitable for consolidating multiple partitions as the conducted clustering analysis is based on a single expression matrix.
Formal concept analysis
Formal Concept Analysis (FCA) [25] is a mathematical formalism allowing to derive a concept lattice from a formal context constituted of a set of objects O, a set of attributes A, and a binary relation defined as the Cartesian product O×A. The context is described as a table which rows correspond to objects and the columns to attributes or properties and a cross in a table cell means that “an object possesses a property”. FCA is used for a number of purposes among which knowledge formalization and acquisition, ontology design, and data mining.
Relying on this subsumption relation ≺, the set of all concepts extracted from a context is organized within a complete lattice. That means that for any set of concepts there is a smallest super concept and a largest sub concept, called the concept lattice.
The FCA or concept lattice approach has been applied for extracting local patterns from microarray data [26, 27] or for performing microarray data comparison [28, 29]. For example, the FCA method proposed in [28] builds a concept lattice from the experimental data together with additional biological information. Each vertex of the lattice corresponds to a subset of genes that are grouped together according to their expression values and some biological information related to the gene function. It is assumed that the lattice structure of the gene sets might reflect biological relationships in the dataset. In [30], a FCAbased method is proposed for extracting groups or classes of coexpressed genes. A concept lattice is constructed where each concept represents a set of coexpressed genes in a number of situations. A serious drawback of the method is the fact that the expression matrix is transformed into a binary table (the input for the FCA step) which leads to possible introduction of biases or information loss. Thus the authors further propose and compare two FCAbased methods for mining gene expression data and show that they are equivalent [31]. The first one relies on interordinal scaling, encoding all possible intervals of attribute values in a formal context that is processed with classical FCA algorithms. The second one relies on pattern structures without a prior transformation, and is shown to be more computationally efficient and to provide more readable results. Notice that all the mentioned FCAbased methods focus solely on optimizing the clustering of a single expression matrix and consequently, they are not suited for the consolidation of multiple partitions.
FCAenhanced consensus clustering algorithm
The problem of deriving clustering results from a set of gene expression matrices can be approached in two different ways: 1) information contained in different datasets may be combined at the level of expression (or similarity) matrices and afterwards clustered; 2) given multiple clustering solutions, one per each dataset, find a consensus (combined) clustering. In this section, a general FCAenhanced consensus clustering algorithm for deriving a clustering result from multiple microarray datasets is proposed which adopts the second approach.
Assume that a particular biological phenomenon is monitored in several highthroughput experiments under n different conditions. Each experiment i (i=1,2,…,n) is supposed to measure the gene expression levels of m_{ i } genes in n_{ i } different experimental conditions or time points. Thus, a set of n different data matrices M_{1},M_{2},…,M_{ n } will be produced, one per experiment.
In contrast to the consensus clustering algorithms discussed in the foregoing sections, where some partitioning (e.g. Integrative and PSOclustering) algorithm is applied to the entire set of experiments in order to produce the final clustering solution, the algorithm proposed herein initially divides the available microarray datasets into groups of related (similar) experiments with respect to a predefined criterion. The rationale behind this is that if the experiments are closely related to one another, then these experiments produce more accurate and robust clustering solution. Thus, the selected consensus clustering algorithm is applied to each group of experiments separately. This produces a list of different clustering solutions, one per each group. Subsequently, these solutions are pooled together and further analyzed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over the whole experimental compendium. FCA produces a concept lattice where each concept represents a subset of genes that belong to a number of clusters. The different concepts compose the final disjoint clustering partition.
 1.
the clustering uses all data by allowing potentially each group of related experiments to have a different set of genes, i.e. the total set of studied genes is not restricted to those contained into all datasets;
 2.
it is better tuned to each experimental condition by identifying the initial number of clusters for each group of related experiments separately depending on the number, composition and quality of the gene profiles;
 3.
the problem with ties is avoided (i.e. a case when a gene is randomly assigned to a cluster because it belongs to more than one cluster) by employing FCA in order to analyze together all the partitioning results and find the final clustering solution representative of the whole experimental compendium.
The distinctive steps of the FCAenhanced clustering algorithm, visualized in Figure 1, are explained in detail below.
Initialization step
Let us consider the aforementioned expression matrices M_{1},M_{2},…,M_{ n } monitoring N different genes in total. The initialization step is the most variable part of the algorithm since it closely depends on the concrete clustering algorithm employed.
The available gene expression matrices are divided into r groups of related (similar) datasets with respect to some predefined criterion, e.g. the used synchronized method or the expression similarity between the matrices. For each group the set of studied genes needs to be restricted to those contained in all datasets of the group, i.e. the number of overlapping genes found across all datasets of the group. Then the number of cluster centres is identified for each group of experiments i (i=1,2,…,r) separately. As discussed in [32, 33], this can be performed by running the kmeans or other clustering algorithm on each data matrix for a range of different numbers of clusters. Subsequently, the quality of the obtained clustering solutions needs to be assessed in some way in order to identify the clustering scheme which best fits the datasets in question. Some commonly used validation measures are the Silhouette Index and Connectivity, presented in the Cluster validation measures section, which are able to identify the best clustering scheme. Finally, the prevailing number of clusters within the concrete group of experiments is selected as representative for the whole group.
Clustering step
The selected consensus clustering (e.g. Integrative, PSObased or other) algorithm is applied to each group of related experiments i (i=1,2,…,r) separately. The latter will generate a list of r different clustering solutions, one per each group. The result is that K (K=k_{1}+…+k_{ r }) different clusters are produced by the different groups. This clustering solution is disjoint in terms of the gene expression profiles produced in the different experiments. However, it is not disjoint in terms of the different participating genes, i.e. there will be genes which will belong to more than one cluster.
FCAbased analysis step
As discussed above, the N studied genes are grouped during the Clustering step into K clusters that are not guaranteed to be disjoint. This overlapping partition is further analysed and refined into a disjoint one by applying FCA. As mentioned above FCA is a principled way of automatically deriving a hierarchical conceptual structure from a collection of objects and their properties. The approach takes as input a matrix (referred to as the formal context) specifying a set of objects and the properties thereof, called attributes. In our case, a formal context consists of the set G of the N studied genes (objects), the set of clusters C=C_{1},C_{2},…,C_{ K } produced by the clustering step (attributes), and an indication of which genes belong to which clusters. Thus, the context is described as a matrix, with the genes corresponding to the rows and the clusters corresponding to the columns of the matrix, and a value of 1 in cell (i, j) whenever gene i belongs to cluster C_{ j }. Subsequently, a formal concept for this context is defined to be a pair (X, Y)such that

X⊆G & Y⊆C & every gene in X belongs to every cluster in Y

for every gene in G that is not in X, there is a cluster in Y that does not contain that gene

for every cluster in C that is not in Y, there is a gene in X that does not belong to that cluster.
The family of these concepts obeys the mathematical axioms defining a concept lattice. The constructed lattice consists of concepts where each one represents a subset of genes belonging to a number of clusters. The set of all concepts partitions the genes into a set of disjoint clusters.
Validation setup
Microarray datasets
 1.
elutriation: three independent biological repeats (elu1, elu2, elu3);
 2.
cdc25 blockrelease: two independent biological repeats, of which one in two dyeswapped technical replicates (cdc251, cdc252.1, cdc252.2) and in addition, one experiment in a sep1 mutant background (cdc25sep1);
 3.
a combination of both methods: elutriation and cdc25 blockrelease (elucdc10) as well as elutriation and cdc10 blockrelease (elucdc25).
Thus, nine different expression test sets are available. In the preprocessing phase the rows with more than 25% missing entries are filtered out from each expression matrix and any other missing expression entries are imputed by the DTWimpute algorithm presented in [34]. In this way nine complete matrices are obtained.
The authors in [1] identified 407 genes as cellcycle regulated subjected to clustering which resulted in the formation of 4 separate clusters. Subsequently, the time expression profiles of these genes are extracted from the complete data matrices and thus nine new matrices are constructed. Note that some of these 407 genes are removed from the original matrices during the preprocessing phase, i.e. each dataset may have a different set of genes. Thus a set of 376 different genes are present in the nine preprocessed datasets in total.
 1.
The genes that are not present in the intersection of the nine preprocessed datasets are removed. The latter produces a subset of 267 genes. Subsequently, the time expression profiles of these genes are extracted from the complete data matrices and thus nine new matrices which form our test corpus 1 are constructed.
 2.
The initial complete datasets are divided into three groups with respect to the used synchronization method. The overlapping genes within each group are as follows: a subset of 286 common genes in the elutriation datasets, a subset of 350 common genes in the cdc25 blockrelease datasets and a subset of 374 common genes in the datasets synchronized by the combination of both methods. For each of the three groups only these common genes are retained. As a result of this nine new matrices which form our test corpus 2 are built. Notice that the nine different dataset contain 374 different genes in total.
The benchmark datasets are normalized by applying a data transformation method as proposed in [35].
Consensus clustering algorithms
In order to validate the proposed general FCAenhanced approach, we applied two different consensus clustering methods: 1) an algorithm integrating multiple partitioning results and 2) a PSObased clustering method.
These two consensus clustering methods approach in a different way the initialization of the cluster centres and the production of the final clustering partition. Both methods initially restrict the studied genes for each group to those contained in all datasets of the group.
The first algorithm, referred to as Integrative, initializes the cluster centres for each group of experiments using the information contained in the datasets of the group in an integrated manner. This step is performed using a matrix constructed by concatenating the expression matrices in each group. The kmeans algorithm is then applied to each expression matrix in the group to generate a set of partition matrices for each group. Utilizing information on the quality of the microarrays, weights are assigned to the experiments and are further used in the integration process in order to obtain more realistic overall partition for each group. The data transformation method proposed in [35] is used to evaluate the quality of the considered microarrays. It is applied to each expression matrix and the number of standardized genes is considered as a quality measure for this matrix. This step results in the assignment of a weight to each expression matrix in the group, i.e. the different experiments in the group will contribute to the final partitioning result to a different extent. A detailed explanation of the Integrative consensus clustering algorithm can be found in [13].
The second consensus clustering method, referred to as PSObased, employs a PSO approach to cluster gene expression data across multiple experiments. Each experiment (dataset) defines a particle which is initialized with a set of k cluster centroids obtained after performing the kmeans clustering algorithm applied over the experiment. The final (optimal) clustering solution for each group of experiments is found by updating the particles using the information on the best clustering solution obtained by each experiment and the entire set of experiments in the group. A detailed explanation of the PSObased consensus clustering algorithm can be found in [14].
Cluster validation measures
 1.
Connectivity: for assessing connectedness;
 2.
Silhouette Index (SI): for assessing compactness and separation properties of a partitioning.
Connectivity
The Connectivity has a value between zero and infinity and should be minimized.
Silhouette index
where a_{ i } represents the average distance of gene i to the other genes of the cluster to which the gene is assigned, and b_{ i } represents the minimum of the average distances of gene i to genes of the other clusters.
The values of the Silhouette Index vary from 1 to 1 and a higher value indicates better clustering results.
Results and discussion
 1.
two consensus partitions over test corpus 1 using respectively the original versions of the Integrative and the PSObased consensus clustering algorithms;
 2.
two times three consensus partitions, one for each group of experiments from test corpus 2, using respectively the grouped versions of the Integrative and the PSObased consensus clustering algorithms as specified in the foregoing section;
 3.
two concept lattices derived by applying FCA on the two sets of group partitions (see above) produced respectively from the grouped versions of the Integrative and the PSObased methods.
Clustering performance
In this section, we evaluate and compare the clustering performance of the two consensus clustering algorithms discussed in the foregoing section on the benchmark datasets described above by using two cluster validation measures: Silhouette Index and Connectivity.
 1.
elutriation datasets: elu1, elu2, elu3;
 2.
cdc25 blockrelease datasets: cdc251, cdc252.1, cdc252.2, cdc25sep1;
 3.
datasets synchronized by the combination of both methods: elucdc10, elucdc25.
Afterwards, the number of cluster centres is identified for each group using the Connectivity measure. The selected optimal number of clusters for the three groups of experiments is as follows: elutriation datasets: k=4; cdc25 blockrelease datasets: k=6, and the combined ones: k=5. As a result 15 different clusters (elutriation: clusters 03, cdc25 blockrelease: clusters 49 and combination of both: clusters 1014) in total are produced by each of the two consensus clustering methods.
Cluster consistency
In this way, d_{ i j } will be equal to 100 in case of full overlap between the clusters i and j, 0 in case of no overlap and between 0 and 100 otherwise.
Figures 5(b), (c), (d) depict the overlap between the consensus clustering assignment of PSObased versus the Integrative clustering algorithms on test corpus 2for each group of experiments separately (respectively elutriation, cdc25 blockrelease and the combined). The best overlap is recorded for the following pairs:

elutriation: (PSO 0, Integrative 1), (PSO 2, Integrative 0), (PSO 3, Integrative 3),

cdc25 blockrelease: (PSO 4, Integrative 5), (PSO 4, Integrative 4), (PSO 5, Integrative 6), (PSO 6, Integrative 9), (PSO 8, Integrative 9), (PSO 9, Integrative 7),

combined: (PSO 10, Integrative 14), (PSO 12, Integrative 14), (PSO 13, Integrative 10) and (PSO 14, Integrative 13).
The high degree of pairwise overlap between the different gene clusters generated by the Integrative and the PSObased consensus clustering algorithms suggests that there is a certain consistency in the resulting clustering solutions. The next section elaborates on this effect by applying FCA on both consensus clustering solutions consolidating the different groups of experiments.
Results from the FCAenhanced Step
The gene partitions produced by the grouped versions of the Integrative and PSObased consensus clustering algorithms on test corpus 2 are further analysed by applying FCA.
All concepts consisting of threeclusters with support above 0.03
PSObased  Integrative 

{1, 6, 10}  {0, 9, 14} 
{0, 5, 13}  {1, 6, 14} 
{1, 6, 12}  {0, 8, 14} 
{2, 4, 10}  {2, 9, 14} 
{0, 5, 10}  {1, 9, 14} 
{1, 8, 12} 
Percentage minimum overlap between the concepts from Table 1
PSObased integrative  {1, 6, 10}  {1, 6, 12}  {0, 5, 10}  {1, 8, 12} 

{0, 9, 14}  35%  35%    25% 
{1, 6, 14}  20%  20%  40%   
{0, 8, 14}  20%  20%     
{2, 9, 14}  20%  20%    20% 
{1, 9, 14}  25%  25%    25% 
Only 6 of the 9 FCA concepts are assigned GO categories by the BiNGO tool:

Integrative{0, 9, 14} contains 41 genes annotated to 26 GO categories (25 have total frequency >0.0%), most of which refer to the regulation of protein kinase activity or regulation of cellcycle process or regulation of metabolic process;

Integrative{0, 8, 14} contains 22 genes annotated to 25 GO categories (22 have total frequency >0.0% and cluster frequency >10.0%), dominated by sister chromatid segregation and betaglucan process regulation categories;

Integrative{1, 6, 14} contains 21 genes annotated to 21 GO categories (15 have total frequency >0.0%), most of which refer to regulation of mRNA stability;

PSO{1, 6, 10} contains 29 genes connected with 19 GO categories (10 have total frequency >0.0%), most of which refer to cellcycle control or regulation of DNA replication or sister chromatid segregation;

PSO{1, 6, 12} contains 18 genes connected with 5 GO categories (only 3 have total frequency >0.0%), all referring to the regulation of sister chromatid cohesion and segregation;

PSO{1, 8, 12} contains 12 genes annotated to 22 GO categories (16 have total frequency >0.0%) dominated by RNA metabolic processing related categories.
It can be observed that the correspondences between the PSObased and Integrative concepts presented in Table 2 are also supported by the above GO categories e.g. sister chromatid segregation, DNA and cellcycle regulation, metabolic processing, etc. These correspondences suggest that although both algorithms optimize different clustering characteristics (SI and Connectivity indices in Figure 4), FCA is able to construct similar consensus clustering lattices that are representative for all the datasets. Evidently, the proposed FCAenhanced approach is a generic consensus clustering technique that is not dependent on the applied clustering algorithm. This means that one can use a customized algorithm suited for the specific characteristic of each dataset group and consolidate the resulting clustering solutions of the involved groups by using FCA.
Conclusions
In this paper we introduced a novel consensus clustering technique which proposes a general approach to the combination of clustering algorithms with Formal Concept Analysis (FCA) for deriving representative clustering solutions from multiple gene expression matrices. This approach involves three distinctive steps: (i) the studied microarray experiments are partitioned into groups of related datasets with respect to a predefined criterion, (ii) a consensus clustering algorithm is applied to each group of experiments separately, (iii) the clustering solutions produced by the different groups are pooled together and further analyzed by employing FCA. The performance of the proposed consensus clustering algorithm is evaluated on a test set of nine time series expression datasets obtained from a study examining the global cellcycle control of gene expression in fission yeast Schizosaccharomyces pombe. In addition, in step (ii) of the proposed approach two different consensus clustering algorithms (Integrative and PSObased) are applied for the validation process. The presented experimental results demonstrate that the proposed FCAenhanced clustering algorithm is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of experiments. In addition, the employment of FCA allows performing a subsequent data analysis, which provides useful insights on the biological role of genes contained in the same FCA concepts. Our future work will focus on further exhaustive analysis of the composition and relationships between the different FCA concepts. Moreover, our longterm aim is to further evaluate the generalisability of FCAenhanced consensus clustering technique by conducting experiments with other clustering algorithms and microarray datasets.
Implementation and availability
The used cluster validation measures and the Integrative consensus clustering algorithm have been implemented in C++. In addition, the PSObased clustering algorithm has been implemented in Java. The publicly available open source machine learning software WEKA [39] is used by this implementation for the particle initialization and for the gene assignment to the different clusters. Finally, FCA is performed by using publicly available tools [40].
Endnotes
^{a} The number of clusters k, is initially identified by analyzing the quality of the obtained clustering solutions generated on the involved datasets for a range of different numbers of clusters.
^{b} The velocity vectors are initialized by zeros.
Declarations
Acknowledgements
Anna Hristoskova would like to thank the Special Research Fund of Ghent University for her PhD grant.
Authors’ Affiliations
References
 Rustici G, Mata J, Kivinen K, Lió P, Penkett CJ, Burns G, Hayles J, Brazma A, Nurse P, Bähler J: Periodic gene expression program of the fission yeast cell cycle. Nat Genet. 2004, 36 (8): 809817. 10.1038/ng1377.View ArticlePubMedGoogle Scholar
 Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson JJr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, et al: Distinct types of diffuse large bcell lymphoma identified by gene expression profiling. Nature. 2000, 403 (6769): 503511. 10.1038/35000501.View ArticlePubMedGoogle Scholar
 Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
 Gilks WR, Tom BDM, Brazma A: Fusing microarray experiments with multivariate regression. Bioinformatics. 2005, 21 (suppl 2): 137143.View ArticleGoogle Scholar
 Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003, 19 (suppl 1): 8490. 10.1093/bioinformatics/btg1010.View ArticleGoogle Scholar
 Zhou XJ, Kao MCJ, Huang H, Wong A, NunezIglesias J, Primig M, Aparicio OM, Finch CE, Morgan TE, Wong WH: Functional annotation and network reconstruction through crossplatform integration of microarray data. Nat Biotechnol. 2005, 23 (2): 238243. 10.1038/nbt1058.View ArticlePubMedGoogle Scholar
 GarrettMayer E, Parmigiani G, Zhong X, Cope L, Gabrielson E: Crossstudy validation and combined analysis of gene expression microarray data. Biostatistics. 2008, 9 (2): 333354.View ArticlePubMedGoogle Scholar
 Filkov V, Skiena S: Integrating microarray data by consensus clustering. Int J Artif Intell Tools. 2004, 13 (4): 863880. 10.1142/S0218213004001867.View ArticleGoogle Scholar
 Johnson E, Kargupta H: Collective, hierarchical clustering from distributed, heterogeneous data. LargeScale Parallel KDD Syst. 1999, 1759: 221244.View ArticleGoogle Scholar
 Topchy A, Jain AK, Punch W: Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell. 2005, 27 (12): 18661881.View ArticlePubMedGoogle Scholar
 Strehl A, Ghosh J: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res. 2003, 3: 583617.Google Scholar
 Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the em algorithm. J Roy Stat Soc B. 1977, 39 (1): 138.Google Scholar
 Kostadinova E, Boeva V, Lavesson N: Clustering of multiple microarray experiments using information integration. Inf Technol Bioand Med Inform. 2011, 6865: 123137. 10.1007/9783642232084_12.View ArticleGoogle Scholar
 Boeva V, Hristoskova A, Tsiporkova E: Clustering of multiple dna microarrays through combination of particle swarm intelligence and kmeans. 6th International Conference on Computational Intelligence and Bioinformatics: Modelling, Identification, and Simulation. 2011, Pittsburgh, USA: ACTA Press, 3238.Google Scholar
 Hristoskova A, Boeva V, Tsiporkova E: An integrative clustering approach combining particle swarm optimization and formal concept analysis. Proceedings of Information Technology in Bioand Medical Informatics. 2012, Vienna, Austria: Springer Berlin Heidelberg, 8498.View ArticleGoogle Scholar
 MacQueen J: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1: 21 June  18 July, 1965 and 27 December, 1965  7 January, 1966; California, USA. 1967, Berkeley, Calif.: University of California Press, 281297.Google Scholar
 Kaufman L, Rousseeuw PJ: Fitting groups in data: an introduction to cluster analysis. J Am Stat Ass. 1991, 86 (415): 830832.Google Scholar
 Kennedy J, Eberhart R: Particle swarm optimization. IEEE International Conference on Neural Networks, vol. 4. 1995, IEEE, 19421948.Google Scholar
 Shi Y, Eberhart R: A modified particle swarm optimizer. IEEE International Conference on Evolutionary Computation. 1998, IEEE, 6973.Google Scholar
 Omran M, Engelbrecht AP, Salman A: Particle swarm optimization method for image clustering. Int J Pattern Recogn Artif Intell. 2005, 19 (3): 297322. 10.1142/S0218001405004083.View ArticleGoogle Scholar
 Omran M, Salman A, Engelbrecht A: Image classification using particle swarm optimization. Proceedings of the 4th AsiaPacific Conference on Simulated Evolution and Learning, vol. 1: 1822 November 2002; Orchid Country Club, Singapore. 2002, Singapore: [Nanyang Technological University, School of Electrical & Electronic Engineering], 370374.Google Scholar
 Van der Merwe D, Engelbrecht A: Data clustering using particle swarm optimization. IEEE Congress on Evolutionary Computation, vol. 1. 2003, Canberra, Australia: IEEE, 215220.Google Scholar
 Xiao X, Dow ER, Eberhart R, Miled ZB, Oppelt RJ: Gene clustering using selforganizing maps and particle swarm optimization. 17th International Symposium on Parallel and Distributed Processing. 2003, Nice, France: IEEE, 1021.View ArticleGoogle Scholar
 Kuo RJ, Syu YJ, Chen ZY, Tien FC: Integration of particle swarm optimization and genetic algorithm for dynamic clustering. Inform Sci. 2012, 195: 124140.View ArticleGoogle Scholar
 Ganter B, Stumme G, Wille R: Formal Concept Analysis: Foundations and Applications, vol. 3626. 2005, Berlin, Heidelberg: SpringerGoogle Scholar
 Besson J, Robardet C, Boulicaut JF: Constraintbased mining of formal concepts in transactional data. Adv Knowl Discov Data Mining. 2004, 3056: 615624.Google Scholar
 Besson J, Robardet C, Boulicaut JF, Rome S: Constraintbased concept mining and its application to microarray data analysis. Intell Data Anal. 2005, 9 (1): 5982.Google Scholar
 Choi V, Huang Y, Lam V, Potter D, Laubenbacher R, Duca K: Using formal concept analysis for microarray data comparison. J Bioinform Comput Biol. 2008, 6 (1): 6510.1142/S021972000800328X.View ArticlePubMedGoogle Scholar
 Potter DP: A combinatorial approach to scientific exploration of gene expression data: an integrative method using formal concept analysis for the comparative analysis of microarray data. PhD thesis. Citeseer;. 2005,Google Scholar
 KaytoueUberall M, Duplessis S, Napoli A: Using formal concept analysis for the extraction of groups of coexpressed genes. Model Computat Optimization Inf Syst Manage Sci. 2008, 14: 445455.Google Scholar
 Kaytoue M, Kuznetsov S, Napoli A, Duplessis S: Mining gene expression data with pattern structures in formal concept analysis. Inf Sci. 2011, 181 (10): 19892001. 10.1016/j.ins.2010.07.007.View ArticleGoogle Scholar
 Halkidi M, Batistakis Y, Vazirgiannis M: On clustering validation techniques. J Intell Inform Syst. 2001, 17 (2): 107145.View ArticleGoogle Scholar
 Theodoridis S, Koutroubas K: Pattern Recognition. 1999, New York: Academic PressGoogle Scholar
 Tsiporkova E, Boeva V: Twopass imputation algorithm for missing value estimation in gene expression time series. J Bioinform Comput Biol. 2007, 5 (5): 10051022. 10.1142/S0219720007003053.View ArticlePubMedGoogle Scholar
 Boeva V, Tsiporkova E: A multipurpose time series data standardization method. Intell Syst Theory Pract. 2010, 299: 445460. 10.1007/9783642134289_22.View ArticleGoogle Scholar
 Handl J, Knowles J, Kell DB: Computational cluster validation in postgenomic data analysis. Bioinformatics. 2005, 21 (15): 32013212. 10.1093/bioinformatics/bti517.View ArticlePubMedGoogle Scholar
 Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987, 20: 5365.View ArticleGoogle Scholar
 Maere S, Heymans K, Kuiper M: Bingo: a cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics. 2005, 21 (16): 34483449. 10.1093/bioinformatics/bti551.View ArticlePubMedGoogle Scholar
 Weka: data mining software in Java. [http://www.cs.waikato.ac.nz/ml/weka/],
 Galicia: Galois lattice interactive constructor. [http://www.iro.umontreal.ca/galicia/features.html],
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.