Partitioning algorithms
Three partitioning algorithms are commonly used for the purpose of dividing data objects into k disjoint clusters [16]: k-means, k-medians and k-medoids clustering. All three methods start by initializing a set of k cluster centres, where k is preliminarily determined. Subsequently, each object of the dataset is assigned to the cluster whose centre is the nearest, and the cluster centres are recomputed. This process is repeated until the objects inside every cluster become as close to the centre as possible and no further object item reassignments take place. The EM algorithm in [12] is commonly used for that purpose, i.e. to find the optimal partitioning into k groups. The three partitioning methods in question differ in how the cluster centre is defined. In k-means, the cluster centre is defined as the mean data vector averaged over all objects in the cluster. Instead of the mean, k-medians calculates the median for each dimension in the data vector. Finally, in k-medoids [17], which is a robust version of the k-means, the cluster centre is defined as the object which has the smallest sum of distances to the other objects in the cluster, i.e., this is the most centrally located point in a given cluster.
Particle swarm optimization
Particle swarm optimization (PSO) is an evolutionary computation method introduced in [18]. In order to find an optimal or near-optimal solution to the problem, PSO updates the current generation of particles (each particle is a candidate solution to the problem) using the information on the best solution obtained by each particle and the entire population.
The hybrid algorithm proposed in [14] combines k-means and PSO for deriving a clustering result from a group of n related microarray datasets M1,M2,…,M
n
. Each dataset contains the gene expression levels of m genes in n
i
different experimental conditions or time points. In this context, each matrix i is used to generate k cluster centers, which are considered to represent a particle, i.e. the particle is treated as a set of points in an n
i
-dimensional space. The final (optimal) clustering solution is found by updating the particles using the information on the best clustering solution obtained by each data matrix and the entire set of matrices.
Assume that the i-th particle is initialized with a set of k cluster centersa and a set of velocity vectorsb using gene expression matrix M
i
. Thus each cluster center is a vector and each velocity vector is a vector , i.e. each particle i is a matrix (or a set of points) in the k×n
i
dimensional space.
Next, assume that P
g
={Pg1,Pg2,…,P
g
k
} is a set of cluster centers in an n
g
-dimensional space representing the best clustering solution found so far within the set of matrices and is the set of centroids of the best solution discovered so far by the corresponding matrix. The update equation for the d-th dimension of the j-the velocity vector of the i-th particle is defined as follows
(1)
where i=1,…,n; j=1,…,k; d=1,…,n
i
and
(2)
The variables φ1 and φ2 are uniformly generated random numbers in the range [0,1], c1 and c2 are called acceleration constants whereas w is called inertia weight as defined in [19]. The first part of Equation (1) represents the inertia of the previous velocity, the second part is the cognition part that identifies the personal experience of the particle and the third part represents the cooperation among particles and is therefore named the social component. Acceleration numbers c1, c2 and inertia weight w are predefined by the user. Note that the cognition part in the above equation has a modified interpretation. Namely, it represents the private ‘thinking’ (opinion) of the particle based on its own source of information (dataset). Due to this we adapted the social part (see equation (2)) since each particle matrix has a different number of columns (n
i
) due to different number of experiment points in each dataset. It was demonstrated in [19] that when w is in the range [0.9,1.2] PSO will have the best chance to find the global optimum within a reasonable number of iterations. Furthermore, w=0.72 and c1=c2=1.49 were found in [20] to ensure good convergence.
The clustering algorithm combining PSO and k-means can be summarized as follows:
-
1.
Initialize each particle with k cluster centers obtained as a result of applying the k-means algorithm to the corresponding data matrix.
-
2.
Initialize the personal best clustering solution of each matrix with the corresponding clustering solution found in Step 1.
-
3.
for iteration = 1 to max-iteration do
-
(a)
for i=1to n do (i.e. for all datasets)
-
i.
for j=1to m do (i.e. for all genes in the current dataset)
-
A.
Calculate the distance of gene g
j
with all cluster centers.
-
B.
Assign g
j
to the cluster that has the nearest center to g
j
.
-
ii.
end for
-
iii.
Calculate the fitness function for the clustering solution C
i
.
-
iv.
Update the personal best clustering solution P
i
.
-
(b)
end for
-
(c)
Find the global best solution P
g
.
-
(d)
Update the cluster centers according to the velocity updating formula proposed in equation (1).
-
4.
end for
The PSO-based clustering algorithm was first introduced in [21] showing that it outperforms k-means and a few other state-of-the-art clustering algorithms. In this method, each particle represents a possible set of k cluster centroids. The authors in [22] hybridized the approach in [21] with the k-means algorithm for clustering general datasets. A single particle of the swarm is initialized with the result of the k-means algorithm while the rest of the swarm is initialized randomly. In [23] a new approach is proposed based on the combination of PSO and Self Organizing Maps and applied it for clustering gene expression data obtaining promising results. Further the study in [24] considers a dynamic clustering approach based on PSO and genetic algorithm. The main advantage of this algorithm is that it can automatically determine the optimal number of clusters and simultaneously cluster the data set with minimal user interference. The downside of all the foregoing approaches is that they are not suitable for consolidating multiple partitions as the conducted clustering analysis is based on a single expression matrix.
Formal concept analysis
Formal Concept Analysis (FCA) [25] is a mathematical formalism allowing to derive a concept lattice from a formal context constituted of a set of objects O, a set of attributes A, and a binary relation defined as the Cartesian product O×A. The context is described as a table which rows correspond to objects and the columns to attributes or properties and a cross in a table cell means that “an object possesses a property”. FCA is used for a number of purposes among which knowledge formalization and acquisition, ontology design, and data mining.
The concept lattice is composed of formal concepts, or simply concepts, organized into a hierarchy by a partial ordering (a subsumption relation allowing to compare concepts). Intuitively, a concept is a pair (X,Y) where X⊆O, Y⊆A, and X is the maximum set of objects sharing the whole set of attributes in Y and vice-versa. The set X is called the extent and the set Y the intent of the concept (X,Y). The subsumption (or sub concept - super concept) relation between concepts is defined as follows:
(3)
Relying on this subsumption relation ≺, the set of all concepts extracted from a context is organized within a complete lattice. That means that for any set of concepts there is a smallest super concept and a largest sub concept, called the concept lattice.
The FCA or concept lattice approach has been applied for extracting local patterns from microarray data [26, 27] or for performing microarray data comparison [28, 29]. For example, the FCA method proposed in [28] builds a concept lattice from the experimental data together with additional biological information. Each vertex of the lattice corresponds to a subset of genes that are grouped together according to their expression values and some biological information related to the gene function. It is assumed that the lattice structure of the gene sets might reflect biological relationships in the dataset. In [30], a FCA-based method is proposed for extracting groups or classes of co-expressed genes. A concept lattice is constructed where each concept represents a set of co-expressed genes in a number of situations. A serious drawback of the method is the fact that the expression matrix is transformed into a binary table (the input for the FCA step) which leads to possible introduction of biases or information loss. Thus the authors further propose and compare two FCA-based methods for mining gene expression data and show that they are equivalent [31]. The first one relies on interordinal scaling, encoding all possible intervals of attribute values in a formal context that is processed with classical FCA algorithms. The second one relies on pattern structures without a prior transformation, and is shown to be more computationally efficient and to provide more readable results. Notice that all the mentioned FCA-based methods focus solely on optimizing the clustering of a single expression matrix and consequently, they are not suited for the consolidation of multiple partitions.
FCA-enhanced consensus clustering algorithm
The problem of deriving clustering results from a set of gene expression matrices can be approached in two different ways: 1) information contained in different datasets may be combined at the level of expression (or similarity) matrices and afterwards clustered; 2) given multiple clustering solutions, one per each dataset, find a consensus (combined) clustering. In this section, a general FCA-enhanced consensus clustering algorithm for deriving a clustering result from multiple microarray datasets is proposed which adopts the second approach.
Assume that a particular biological phenomenon is monitored in several high-throughput experiments under n different conditions. Each experiment i (i=1,2,…,n) is supposed to measure the gene expression levels of m
i
genes in n
i
different experimental conditions or time points. Thus, a set of n different data matrices M1,M2,…,M
n
will be produced, one per experiment.
The FCA-enhanced consensus clustering consists of three distinctive steps in Figure 1: 1) the expression datasets are divided into several smaller groups using some predefined criterion; 2) a consensus clustering algorithm (e.g. Integrative, PSO-based or other) is applied to each group of datasets separately, which produces a list of different clustering solutions, one per group; 3) these clustering solutions are further transformed into a single clustering result by employing FCA.
In contrast to the consensus clustering algorithms discussed in the foregoing sections, where some partitioning (e.g. Integrative and PSO-clustering) algorithm is applied to the entire set of experiments in order to produce the final clustering solution, the algorithm proposed herein initially divides the available microarray datasets into groups of related (similar) experiments with respect to a predefined criterion. The rationale behind this is that if the experiments are closely related to one another, then these experiments produce more accurate and robust clustering solution. Thus, the selected consensus clustering algorithm is applied to each group of experiments separately. This produces a list of different clustering solutions, one per each group. Subsequently, these solutions are pooled together and further analyzed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over the whole experimental compendium. FCA produces a concept lattice where each concept represents a subset of genes that belong to a number of clusters. The different concepts compose the final disjoint clustering partition.
The proposed FCA-enhanced consensus clustering approach has the following characteristics:
-
1.
the clustering uses all data by allowing potentially each group of related experiments to have a different set of genes, i.e. the total set of studied genes is not restricted to those contained into all datasets;
-
2.
it is better tuned to each experimental condition by identifying the initial number of clusters for each group of related experiments separately depending on the number, composition and quality of the gene profiles;
-
3.
the problem with ties is avoided (i.e. a case when a gene is randomly assigned to a cluster because it belongs to more than one cluster) by employing FCA in order to analyze together all the partitioning results and find the final clustering solution representative of the whole experimental compendium.
The distinctive steps of the FCA-enhanced clustering algorithm, visualized in Figure 1, are explained in detail below.
Initialization step
Let us consider the aforementioned expression matrices M1,M2,…,M
n
monitoring N different genes in total. The initialization step is the most variable part of the algorithm since it closely depends on the concrete clustering algorithm employed.
The available gene expression matrices are divided into r groups of related (similar) datasets with respect to some predefined criterion, e.g. the used synchronized method or the expression similarity between the matrices. For each group the set of studied genes needs to be restricted to those contained in all datasets of the group, i.e. the number of overlapping genes found across all datasets of the group. Then the number of cluster centres is identified for each group of experiments i (i=1,2,…,r) separately. As discussed in [32, 33], this can be performed by running the k-means or other clustering algorithm on each data matrix for a range of different numbers of clusters. Subsequently, the quality of the obtained clustering solutions needs to be assessed in some way in order to identify the clustering scheme which best fits the datasets in question. Some commonly used validation measures are the Silhouette Index and Connectivity, presented in the Cluster validation measures section, which are able to identify the best clustering scheme. Finally, the prevailing number of clusters within the concrete group of experiments is selected as representative for the whole group.
Clustering step
The selected consensus clustering (e.g. Integrative, PSO-based or other) algorithm is applied to each group of related experiments i (i=1,2,…,r) separately. The latter will generate a list of r different clustering solutions, one per each group. The result is that K (K=k1+…+k
r
) different clusters are produced by the different groups. This clustering solution is disjoint in terms of the gene expression profiles produced in the different experiments. However, it is not disjoint in terms of the different participating genes, i.e. there will be genes which will belong to more than one cluster.
FCA-based analysis step
As discussed above, the N studied genes are grouped during the Clustering step into K clusters that are not guaranteed to be disjoint. This overlapping partition is further analysed and refined into a disjoint one by applying FCA. As mentioned above FCA is a principled way of automatically deriving a hierarchical conceptual structure from a collection of objects and their properties. The approach takes as input a matrix (referred to as the formal context) specifying a set of objects and the properties thereof, called attributes. In our case, a formal context consists of the set G of the N studied genes (objects), the set of clusters C=C1,C2,…,C
K
produced by the clustering step (attributes), and an indication of which genes belong to which clusters. Thus, the context is described as a matrix, with the genes corresponding to the rows and the clusters corresponding to the columns of the matrix, and a value of 1 in cell (i, j) whenever gene i belongs to cluster C
j
. Subsequently, a formal concept for this context is defined to be a pair (X, Y)such that
-
X⊆G & Y⊆C & every gene in X belongs to every cluster in Y
-
for every gene in G that is not in X, there is a cluster in Y that does not contain that gene
-
for every cluster in C that is not in Y, there is a gene in X that does not belong to that cluster.
The family of these concepts obeys the mathematical axioms defining a concept lattice. The constructed lattice consists of concepts where each one represents a subset of genes belonging to a number of clusters. The set of all concepts partitions the genes into a set of disjoint clusters.