Cluster analysis or clustering is an unsupervised technique that aims at agglomerating a set of patterns in homogeneous groups or clusters [4, 5]. Hierarchical Clustering (HC) is one of several different available techniques for clustering which seeks to build a hierarchy of clusters, and it can be of two types, namely agglomerative, where each sample starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy, or divisive, where all samples start in one cluster, and splits are performed recursively as one moves down the hierarchy [6]. Thus, HC aims at grouping similar objects into a cluster, and were the endpoint is a set of clusters where each cluster is distinct from each other, and the objects within each cluster are broadly similar to each other. HC can be performed either on a distance matrix or raw data. Agglomerative HC starts by treating each observation as a separate cluster, and it repeatedly executes the following two steps: (1) identifies the two clusters that are closest together, and (2) merges the two most similar clusters. This process continues until all the clusters are merged together.
The main output of HC is a dendrogram, which shows the hierarchical relationship between the clusters distances. Many distance metrics have been developed and the choice should be made based on theoretical concerns from the domain of study.
Later on, it is necessary to determine how the distance is computed (e.g., single-linkage, complete-linkage, average-linkage). As with distance metrics, the choice of linkage criteria should be based on theoretical considerations from the application domain.
In non-fuzzy clustering (or hard clustering) data is divided into distinct clusters and each data point can only belong to exactly one cluster. In fuzzy clustering, data points can potentially belong to multiple clusters. For example, in hard clustering, given some parameters, a “symptom” can be (in a mutually exclusive way) present or absent (red or blue) whereas, in fuzzy clustering, that “symptom” could (simultaneously) be of some grade red and some other grade blue. In Fig. 2, a comparison between hard and fuzzy categorisation is shown. The reader can refer to [7] for a recent comparison between hard and fuzzy clustering. In this work, we introduce a data integration methodology based on fuzzy concepts. In particular, we associate a dendrogram to a fuzzy equivalence relation (i.e., Łukasiewicz valued fuzzy similarity), so that a consensus matrix in a multi-view clustering, that is the representative information of all dendrograms, can be obtained from multiple hierarchical agglomerations [8, 9]. The main steps of fuzzy agglomeration can be summarised as follows:
-
Characterisation of membership functions;
-
Computation of a fuzzy similarity matrix (or dendrogram) for all models, at a given time;
-
Construction of a consensus matrix for all hierarchical agglomerations.
Membership functions
When dealing with clustering tasks, Fuzzy Logic (FL) permits to obtain a soft clustering instead of an hard clustering of data [10]. Specifically, data points can belong to more than one cluster simultaneously. The fundamental concept in FL, upon which all the subsequent theory is constructed, is the notion of fuzzy set, a generalisation of a crisp set from classical set theory.
A fuzzy set generalises a crisp set by allowing its characteristic function, i.e., its membership function, assuming values in the interval [0,1] rather than in the set {0,1}. In this way, a given item belongs to the fuzzy set with a degree of truth ranging from do not belong at all (i.e., its membership function assumes value 0) to completely belong (i.e., the membership function assumes value 1). In FL applications, fuzzy sets make it possible to represent qualitative (non-numeric) values (i.e., linguistic variables such as High, Medium, Low) for approximate reasoning, inference or fuzzy control systems. Linguistic variables can be represented by fuzzy sets through a transformation step called fuzzification, and it is achieved by using different types of membership functions representing the degree of truth to which a given input sample belongs to a fuzzy set (see “Membership Functions” section in Supplementary Material).
Fuzzy similarity matrix
A measure of similarity or dissimilarity defines the resemblance between two samples or objects. Similarity measure is a significant means for measuring uncertain information. Fuzzy similarity measure is a measure that depicts the closeness among fuzzy sets and has been used for dealing issues of pattern recognition and clustering analysis.
A binary fuzzy relation that is reflexive, symmetric, and transitive is known as a similarity relation. Fuzzy similarity relations are the generalisation of equivalence relations, in binary crisp relations, to binary fuzzy relations. In details, a fuzzy similarity relation can be considered to effectively group elements into crisp sets whose members are similar to each other to some specified grade and it is a generalization of classical equivalence relation as described in “Fuzzy Similarity” section in Supplementary Material. In order to introduce the fuzzy similarity, in the following, we focus on the properties of the Łukasiewicz t-norm (tL) and the bi-residuum. In this way we obtain a fuzzy equivalence relation that can be used for building dendrogram. For more details in the derivation of these results see “Fuzzy Similarity” section in Supplementary Material.
Dendrogram and consensus matrix
If a similarity relation is min-transitive (i.e., t= min) then it implies the existence of the dendrogram (see “Dendrogram and Consensus Matrix” section in SupplementaryMaterial for details). The min-transitive closure of a relation matrix R can be easily computed and the overall process is described in Algorithm 1.
The last ingredient to accomplish an agglomerative clustering is a dissimilarity relation. Here we considered the following result [11]:
Lemma 1
Letting R be a similarity relation with the elements R〈x,y〉∈[0,1] and letting D be a dissimilarity relation, which is obtained from R by
$$ D(x,y) = 1 - R \langle x,y \rangle $$
(1)
then D is ultrametric iif R is min-transitive.
In other words, we have a one-to-one correspondence between min-transitive similarity matrices and dendrogram and between ultrametric dissimilarity matrices and dendrograms. Finally, after the dendrograms have been obtained each time, a consensus matrix, i.e., the representative information of all dendrograms is obtained by combining the transitive closures (i.e., max-min operation) [11]. The overall approach is described in Algorithm 2. The overall workflow of the proposed approach is summarised in Fig. 3. In particular, for each omic data set Xi a fuzzification step is adopted for obtaining the new data set Yi (see Supplementary Material). Successively, adopting a fuzzy similarity measure the similarity matrix Si is computed and to guarantee the transitive closure of the matrix a new matrix Ci is computed (see Algorithm 1). Finally, all the Ci matrices are collected for obtaining the consensus matrix A and the overall final dendrogram (see Algorithm 2).
In Fig. 4, we show an example that summarize a realistic agglomeration result. We plot in Figs. 4a-b-c three input hierarchies obtained on datasets that should be combined. In this case, four sequences of patients are considered, namely s1,s2,s3 and s4, respectively. In Fig. 4d, we show the final result by agglomerating dendrograms. We observe that the output hierarchy contains clusters (s1,s2,s3) and (s1,s2,s3,s4) at different levels and each of these clusters (e.g., (s1,s2,s3)) are repeated at least in two out of the three input dendrograms. Moreover, it is worth stressing that the proposed approach, based on the agglomeration of dendrograms, can also be applied with commonly used metrics (e.g., Euclidean distance). In Fig. 5, we show a comparison between the dendrograms obtained by using an Euclidean metric and a similarity based approach (i.e., Łukasiewicz t-norm), respectively. In this realistic example, we simulate three omic data sets with 10 rows (i.e., number of patients) and 100 columns (i.e., features). We split the single datasets in two partitions (or clusters) such that the first 5 rows are random samples from a standard normal distribution with variance 1 and the other 5 rows have the same distribution with variance 0.5, obtaining a sort of an overlap. We observe that both methods find two separated clusters, but the similarity based approach in Fig. 5b, permits to obtain a perfect separation of the source partitions.