 Research
 Open access
 Published:
Data integration by fuzzy similaritybased hierarchical clustering
BMC Bioinformatics volume 21, Article number: 350 (2020)
Abstract
Background
High throughput methods, in biological and biomedical fields, acquire a large number of molecular parameters or omics data by a single experiment. Combining these omics data can significantly increase the capability for recovering finetuned structures or reducing the effects of experimental and biological noise in data.
Results
In this work we propose a multiview integration methodology (named FHClust) for identifying patient subgroups from different omics information (e.g., Gene Expression, Mirna Expression, Methylation). In particular, hierarchical structures of patient data are obtained in each omic (or view) and finally their topologies are merged by consensus matrix. One of the main aspects of this methodology, is the use of a measure of dissimilarity between sets of observations, by using an appropriate metric. For each view, a dendrogram is obtained by using a hierarchical clustering based on a fuzzy equivalence relation with Łukasiewicz valued fuzzy similarity. Finally, a consensus matrix, that is a representative information of all dendrograms, is formed by combining multiple hierarchical agglomerations by an approach based on transitive consensus matrix construction. Several experiments and comparisons are made on real data (e.g., Glioblastoma, Prostate Cancer) to assess the proposed approach.
Conclusions
Fuzzy logic allows us to introduce more flexible data agglomeration techniques. From the analysis of scientific literature, it appears to be the first time that a model based on fuzzy logic is used for the agglomeration of multiomic data. The results suggest that FHClust provides better prognostic value and clinical significance compared to the analysis of singleomic data alone and it is very competitive with respect to other techniques from literature.
Background
Nowadays, high throughput methods, in biological and biomedical fields, acquire a large number of molecular parameters by a single experiment [1]. In particular, such measured parameters are collected in “omics” datasets (e.g., genomics, transcriptomics, methylomics). Among multiple measured parameters, DNA genome sequence, RNA expression and DNA methylation are representative instances. For individually analysing such data, several methodologies have been introduced in literature, even though, recently, a number of studies pointed out the best performance coming from the integration of multiomics data. For instance, analysing each omic (or view in the machine learning jargon), set separately, fundamental patterns can be detected from data, however some finetuned structures, such as cancer subtypes, can be highlighted by both gene expression and DNA methylation information, so that multiomics analysis can reduce the effects of experimental and biological noise in data [2]. From literature, three kinds of integration methodologies emerge:

early integration, builds a single featurebased matrix by concatenating each omic dataset (i.e., view) and applies a singleomic analysis;

intermediate integration, builds a joint representation of data given the views;

late integration, each omic is analysed separately and the solutions are integrated.
In general, late integration methods, and in particular clustering, are preferred when the analysis combines continuous and discrete data together. For a review of integration approaches and their comparisons, the reader may refer to [3]. In this work, a multiview clustering methodology, named FHClust, is introduced (see Fig. 1 for its general schema) for identifying patient subgroups from different omics information or datasets (e.g., Gene Expression, Mirna Expression, Methylation). Specifically, for each omic dataset a fuzzybased hierarchical clustering approach is adopted and finally the results are merged by consensus matrix. The idea behind the proposed approach comes from observing that a hierarchical clustering dendrogram can be associated with a Łukasiewicz fuzzy similaritybased equivalence relation, so that a consensus matrix, that is the representative information of all dendrograms, is derived by combining multiple hierarchical agglomerations following an approach based on transitive consensus matrix construction.
Methods
Cluster analysis or clustering is an unsupervised technique that aims at agglomerating a set of patterns in homogeneous groups or clusters [4, 5]. Hierarchical Clustering (HC) is one of several different available techniques for clustering which seeks to build a hierarchy of clusters, and it can be of two types, namely agglomerative, where each sample starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy, or divisive, where all samples start in one cluster, and splits are performed recursively as one moves down the hierarchy [6]. Thus, HC aims at grouping similar objects into a cluster, and were the endpoint is a set of clusters where each cluster is distinct from each other, and the objects within each cluster are broadly similar to each other. HC can be performed either on a distance matrix or raw data. Agglomerative HC starts by treating each observation as a separate cluster, and it repeatedly executes the following two steps: (1) identifies the two clusters that are closest together, and (2) merges the two most similar clusters. This process continues until all the clusters are merged together.
The main output of HC is a dendrogram, which shows the hierarchical relationship between the clusters distances. Many distance metrics have been developed and the choice should be made based on theoretical concerns from the domain of study.
Later on, it is necessary to determine how the distance is computed (e.g., singlelinkage, completelinkage, averagelinkage). As with distance metrics, the choice of linkage criteria should be based on theoretical considerations from the application domain.
In nonfuzzy clustering (or hard clustering) data is divided into distinct clusters and each data point can only belong to exactly one cluster. In fuzzy clustering, data points can potentially belong to multiple clusters. For example, in hard clustering, given some parameters, a “symptom” can be (in a mutually exclusive way) present or absent (red or blue) whereas, in fuzzy clustering, that “symptom” could (simultaneously) be of some grade red and some other grade blue. In Fig. 2, a comparison between hard and fuzzy categorisation is shown. The reader can refer to [7] for a recent comparison between hard and fuzzy clustering. In this work, we introduce a data integration methodology based on fuzzy concepts. In particular, we associate a dendrogram to a fuzzy equivalence relation (i.e., Łukasiewicz valued fuzzy similarity), so that a consensus matrix in a multiview clustering, that is the representative information of all dendrograms, can be obtained from multiple hierarchical agglomerations [8, 9]. The main steps of fuzzy agglomeration can be summarised as follows:

Characterisation of membership functions;

Computation of a fuzzy similarity matrix (or dendrogram) for all models, at a given time;

Construction of a consensus matrix for all hierarchical agglomerations.
Membership functions
When dealing with clustering tasks, Fuzzy Logic (FL) permits to obtain a soft clustering instead of an hard clustering of data [10]. Specifically, data points can belong to more than one cluster simultaneously. The fundamental concept in FL, upon which all the subsequent theory is constructed, is the notion of fuzzy set, a generalisation of a crisp set from classical set theory.
A fuzzy set generalises a crisp set by allowing its characteristic function, i.e., its membership function, assuming values in the interval [0,1] rather than in the set {0,1}. In this way, a given item belongs to the fuzzy set with a degree of truth ranging from do not belong at all (i.e., its membership function assumes value 0) to completely belong (i.e., the membership function assumes value 1). In FL applications, fuzzy sets make it possible to represent qualitative (nonnumeric) values (i.e., linguistic variables such as High, Medium, Low) for approximate reasoning, inference or fuzzy control systems. Linguistic variables can be represented by fuzzy sets through a transformation step called fuzzification, and it is achieved by using different types of membership functions representing the degree of truth to which a given input sample belongs to a fuzzy set (see “Membership Functions” section in Supplementary Material).
Fuzzy similarity matrix
A measure of similarity or dissimilarity defines the resemblance between two samples or objects. Similarity measure is a significant means for measuring uncertain information. Fuzzy similarity measure is a measure that depicts the closeness among fuzzy sets and has been used for dealing issues of pattern recognition and clustering analysis.
A binary fuzzy relation that is reflexive, symmetric, and transitive is known as a similarity relation. Fuzzy similarity relations are the generalisation of equivalence relations, in binary crisp relations, to binary fuzzy relations. In details, a fuzzy similarity relation can be considered to effectively group elements into crisp sets whose members are similar to each other to some specified grade and it is a generalization of classical equivalence relation as described in “Fuzzy Similarity” section in Supplementary Material. In order to introduce the fuzzy similarity, in the following, we focus on the properties of the Łukasiewicz tnorm (t_{L}) and the biresiduum. In this way we obtain a fuzzy equivalence relation that can be used for building dendrogram. For more details in the derivation of these results see “Fuzzy Similarity” section in Supplementary Material.
Dendrogram and consensus matrix
If a similarity relation is mintransitive (i.e., t= min) then it implies the existence of the dendrogram (see “Dendrogram and Consensus Matrix” section in SupplementaryMaterial for details). The mintransitive closure of a relation matrix R can be easily computed and the overall process is described in Algorithm 1.
The last ingredient to accomplish an agglomerative clustering is a dissimilarity relation. Here we considered the following result [11]:
Lemma 1
Letting R be a similarity relation with the elements R〈x,y〉∈[0,1] and letting D be a dissimilarity relation, which is obtained from R by
then D is ultrametric iif R is mintransitive.
In other words, we have a onetoone correspondence between mintransitive similarity matrices and dendrogram and between ultrametric dissimilarity matrices and dendrograms. Finally, after the dendrograms have been obtained each time, a consensus matrix, i.e., the representative information of all dendrograms is obtained by combining the transitive closures (i.e., maxmin operation) [11]. The overall approach is described in Algorithm 2. The overall workflow of the proposed approach is summarised in Fig. 3. In particular, for each omic data set X_{i} a fuzzification step is adopted for obtaining the new data set Y_{i} (see Supplementary Material). Successively, adopting a fuzzy similarity measure the similarity matrix S_{i} is computed and to guarantee the transitive closure of the matrix a new matrix C_{i} is computed (see Algorithm 1). Finally, all the C_{i} matrices are collected for obtaining the consensus matrix A and the overall final dendrogram (see Algorithm 2).
In Fig. 4, we show an example that summarize a realistic agglomeration result. We plot in Figs. 4abc three input hierarchies obtained on datasets that should be combined. In this case, four sequences of patients are considered, namely s_{1},s_{2},s_{3} and s_{4}, respectively. In Fig. 4d, we show the final result by agglomerating dendrograms. We observe that the output hierarchy contains clusters (s_{1},s_{2},s_{3}) and (s_{1},s_{2},s_{3},s_{4}) at different levels and each of these clusters (e.g., (s_{1},s_{2},s_{3})) are repeated at least in two out of the three input dendrograms. Moreover, it is worth stressing that the proposed approach, based on the agglomeration of dendrograms, can also be applied with commonly used metrics (e.g., Euclidean distance). In Fig. 5, we show a comparison between the dendrograms obtained by using an Euclidean metric and a similarity based approach (i.e., Łukasiewicz tnorm), respectively. In this realistic example, we simulate three omic data sets with 10 rows (i.e., number of patients) and 100 columns (i.e., features). We split the single datasets in two partitions (or clusters) such that the first 5 rows are random samples from a standard normal distribution with variance 1 and the other 5 rows have the same distribution with variance 0.5, obtaining a sort of an overlap. We observe that both methods find two separated clusters, but the similarity based approach in Fig. 5b, permits to obtain a perfect separation of the source partitions.
Results and discussion
In the following we describe the behaviour of the proposed methodology on multiomics benchmark datasets.
Omics datasets
We consider 10 multiomics cancer datasets available from The Cancer Genome Atlas (TCGA) [3]. TCGA is a large multiomic repository of data on thousands of cancer patients. All datasets contain three omics: gene expression, miRNA expression and DNA methylation^{Footnote 1}. In Table 1 are summarised the main properties of the datasets, namely, Acute Myeloid Leukemia (AML), Breast Invasive Carcinoma (BIC), Colon Adenocarcinoma (COAD), Glioblastoma Multiforme (GBM), Kidney Renal Clear Cell Carcinoma (KIRC), Liver Hepatocellular Carcinoma (LIHC), Lung Squamous Cell Carcinoma (LUSC), Skim Cutaneous Melanoma (SKCM), Ovarian serous cystadenocarcinoma (OV), Sarcoma (SARC). The number of patients ranges from 170 for AML to 621 for BIC.
Multiview clustering algorithms
For validating the effectiveness of our model, we compared it against several categories of multiview clustering algorithms^{Footnote 2}:

Kmeans and Spectral Clustering techniques [3];

LRACluster [12]: It is a lowrank approximation based integrative probabilistic model to fast find the shared principal subspace across multiple data types;

PINS [13]: Perturbation clustering for data integration and disease subtyping (PINS) is able to address subtype discovery, as well as integration of multiple data types. The algorithm is built upon the resilience of patient connectivity and cluster ensembles to ensure robustness against noise and bias;

SNF [14]: Similarity network fusion (SNF) allows for discovery of disease subtypes through integration of several types of highthroughput data on a genomic scale. SNF creates a fused network of patients using a metric fusion technique and then partitions the data using spectral clustering. SNF appears to be the state of the art in this area and has proven to be very powerful. However, the unstable nature of kernelbased clustering makes the algorithm sensitive to small changes in molecular measurements or in its parameter settings.

MCCA [15]: Multi Canonical Correlation Analysis (MCCA), which extends the application of Canonical Correlation Analysis (CCA) to more than two views, is one of the most widely used dimension reduction method for finding linear relations between two or more multidimensional random variables.
Evaluation metrics
In order to assess the performance of each method, we adopt three evaluation metrics that are: the logrank test, the enrichment of clinical labels in the clusters and the methods runtime [3]. The logrank test assumes that if clusters of patients have significantly different survival, they are different in a biologically meaningful way. For the enrichment of clinical labels in clusters, six clinical labels are considered: gender, age at diagnosis, pathologic tumor, pathologic metastases, pathologic lymph nodes and pathologic stage. The four latter parameters are discrete pathological parameters, measuring the progression of the tumor, metastases and cancer in lymph nodes, and the total progression (pathologic stage). Enrichment for discrete parameters was calculated using the χ^{2} test for independence, and for numeric parameters using KruskalWallis test. Not all clinical parameters were available for all cancer types, so a total of 41 clinical parameters were available for testing. To derive a pvalue for the logrank test, the χ^{2} test for independence, the KruskalWallis test and the statistic for these three tests is assumed to have χ^{2} distribution [3].
Preprocessing
TCGA datasets were preprocessed as follows: patients and features with more than 20% missing values were removed, and missing values were imputed using knearest neighbor imputation. The sequence features were logtransformed. The 2000 features with highest variance from geneexpression and methylation omics were selected. In the miRNA omic, features with zero variance were filtered. All features were then normalized to have zero mean and standard deviation 1. For methylation, we selected the 5000 features with maximal variance in each dataset and also adopted the standard pipeline proposed in [16], whose procedure filters out the probes from the X and Y chromosomes or probes that are known to have common SNPs at the CpG site.
A further unsupervised variable selection step has been performed by using the Mean Decrease Gini [17] based on Random Forest [18]. The Mean Decrease in Gini is the average of a variable total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the forest. This is effectively a measure of how important a variable is for estimating the value of the target variable across all of the trees that make up the forest. A higher Mean Decrease in Gini indicates higher variable importance, therefore the most important variables to the model is the highest in the plot with the largest Mean Decrease in Gini Values, conversely, the least important variable is the lowest in the plot with the smallest Mean Decrease in Gini values. By following this strategy, we cutoff all those variables whose importance is zero. The number of variable cutoff at each step is summarised in Table 1.
Experimental results
In the experiments, for all methods, the number of searched clusters is selected in the range [2−15]. To determine the number of clusters for a method we used the “elbow method”. To automatically pick out the optimal elbow rather than choose it manually, we used as approximation the second derivative of a vector v
In particular, we consider the index i that brings this expression to a maximum or minimum (depending on whether v increases or decreases). For all methods, we adhered to the guidelines for usage and parameter selection given by the developers. In some cases, where no information was provided by the authors, we devised parameter selection methods. We performed the same process pipeline used in [3] for evaluating the performance of our method. All methods were run on a 24 multicore Intel(R) Xeon(R) CPU E52620 v3 @ 2.40GHz with 64 GB RAM. In the following, the obtained experimental results are described.
Figure 6 shows the average performance for multiomics data and for each singleomic separately, across all cancer types, and Fig. 7 shows the performance on the different cancer datasets. All algorithms show quite similar performance in either differential survival or enriched clinical parameters. With respect to survival, our FHClust method achieved the overall best prognostic value (sum of −log10pvalues =15.77), while PINS (15.35) and MCCA (15.11) ranked, second and third, respectively.
In Table 2 the differential survival between clusters (mean −log10 of logrank’s test pvalue) are reported. Spectral achieved the highest total number of significant clinical parameters, with 13 parameters. FHClust, along with LRAcluster and Kmeans placed themselves second with 11 parameters. SNF achieved the third position with 10 parameters.
With respect to survival (Table 3), FHClust outperformed its competitors achieving 16 parameters. MCCA, PINS and SNF have achieved good results with 15, 15 and 14 enriched parameters, respectively.
We also counted the number of datasets for which a method solution obtains significantly different survival. These results are reported in Table 4. All methods that were developed for multiomics data had at least four cancer types with significantly different survival. In this case, FHClust and PINS had 5 different cancer subtypes for which its clustering had significantly different prognosis. FHClust, Spectral clustering and MCCA had enrichment in 8 cancer types.
On average, FHClust, PINS and MCCA had better prognostic value, but found less enriched clinical labels as compared to spectral clustering method.
The number of clusters found for each dataset are presented in Table 5, ranging from 2 to 4. Because of the good methods performance in the previous analysis, partitioning the data into a relatively high number of clusters could indicate that clustering cancer patients into more clusters improves prognostic value and clinical significance.
Concerning with methods computational burden, their run times are reported in Table 6. FHClust takes, on average, 110 seconds per dataset, while Spectral and SNF got lower timing. The worst method takes roughly 18 minutes per dataset (see Fig. 8).
Finally, Fig. 9 shows the benchmarked methods performance for singleomic data, moreover, for each dataset and method, the single omic that gave the best results for survival and clinical enrichment are also shown. These results suggest that FHClust provides better prognostic value and clinical significance on multiomics data compared to the analysis of singleomic data used separately. Nevertheless, the interested reader may refer to the supplementary material for details on additional results concerning singleomics. We also stress that the proposed method, differently from other methods, such as SNF, does not need any hyperparameter tuning. Moreover, clustering is embedded in the data integration (and vice versa), and the use of fuzzy concepts (i.e., tnorms), from one hand, permits to obtain a generalisation of the clustering approaches whereas, on the other hand, gives the possibility to apply an inference system (e.g., Mamdani) for a quantitative and qualitative measure (e.g., “High”, “Medium”, “Low”) in cancer risk assessment.
Conclusions
In this work, we proposed a multiview clustering methodology for identifying patient subgroups from different omics data. In biological and biomedical fields, combining these omics data can significantly increase data mining capabilities. One of the main aspects of this methodology is the use of a measure of dissimilarity between sets of observations, by using an appropriate metric and a consensus matrix, that is a representative agglomerate information of all the dendrograms. As emerged from the analysis of the scientific literature, to the best of our knowledge our work concerns for the first time a model based on fuzzy logic used for the agglomeration of multiomic data. The use of fuzzy logic allows us to introduce more flexible data mining features also related to approximate reasoning. Several experiments and comparisons have been made on real data (e.g., Glioblastoma, Prostate Cancer) to assess the proposed methodology. The results suggest that FHClust provides better prognostic value and clinical significance compared to analysis of singleomic data alone. Fuzzy Logic concepts, and in particular membership functions, permits to develop a fuzzy inference model (i.e., Mamdani, Fuzzy Cognitive Maps) for easily obtaining a model for a quantitative and qualitative risk assessment of the cancer. The model, based on approximate reasoning, can be particularly useful for embedded devices.
In future work, it could be possible to improve results for multiomics analysis, in a number of ways. For instance, more accurate feature selection[19] algorithms could be adopted for improving the overall performance. On one hand, the integration of labelled data could improve the feature selection step. On the other hand, some specific feature extraction strategies could be adopted, indeed approaches based on the signal analysis of gene expression data (e.g., nonlinear Principal Component Analysis, Compressive Sensing), could possibly further improve the performance [20, 21]. In future, it is possible to foresee a different weight for each omic data, in order to obtain a more robust similarity, and parametric similarity measures can be adopted (e.g., uninorm) for generalizing the concept of AND and OR connections between clusters.
Availability of data and materials
Code and data of the proposed approach are available on: MultiOmicsCancerBenchmarkGitHubrepository .
Notes
Abbreviations
 FHClust:

Fuzzyhierarchical CLUSTering
 DNA:

DeoxyriboNucleic acid
 RNA:

RiboNucleic acid
 Hierarchical Clustering:

HC
 Crisp HC:

Crisp hierarchical clustering
 FL:

Fuzzy logic
 TCGA:

The cancer genome atlas
 AML:

Acute myeloid leukemia
 BIC:

Breast invasive carcinoma
 COAD:

Colon adenocarcinoma
 GBM:

Glioblastoma multiforme
 KIRC:

Kidney renal clear cell carcinoma
 LIHC:

Liver hepatocellular carcinoma
 LUSC:

Lung squamous cell carcinoma
 SKCM:

Skim Cutaneous Melanoma
 OV:

Ovarian serous cystadenocarcinoma
 SARC:

Sarcoma
 PINS:

Perturbation Clustering for data INtegration and disease Subtyping
 LRACluster:

Low rank approximation based multiomics data clustering
 SNF:

Similarity Network Fusion
 MCCA:

Multi Canonical Correlation Analysis
References
Camastra F, Di Taranto MD, Staiano A. Statistical and computational methods for genetic diseases: An overview. Comput Math Meth Med. 2015; 2015(Article ID 954598):1–8.
Serra A, Fratello M, Fortino V, Raiconi G, Tagliaferri R, Greco D. Mvda: a multiview genomic data integration methodology. BMC Bioinformatics. 2015; 16(261):1–13.
Rappoport N, Shamir R. Multiomic and multiview clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 2018; 46(20):10546–62.
Reddy CK, Aggarwal CC. Data Clustering. Boca Raton: Chapman and Hall/CRC; 2016.
Camastra F, Ciaramella A, Son LH, Riccio A, Staiano A. Fuzzy similaritybased hierarchical clustering for atmospheric pollutants prediction. LNCS. 2019; 11291:123–33.
Ciaramella A, Staiano A. On the role of clustering and visualization techniques in gene microarray data. Algorithms. 2019; 12(16):123.
Bora DJ, Gupta AK. Int J Emerg Trends Technol Comput Sci. 2014; 10(2):108–13.
Napolitano F, Pinelli M, Raiconi G, Tagliaferri R, Ciaramella A, Staiano A, Miele G. Clustering and visualization approaches for human cell cycle gene expression data analysis. Int J Approx Reason. 2008; 47:70–84.
Ciaramella A, Cocozza S, Iorio F, Miele G, Napolitano F, Pinelli M, Raiconi G, Tagliaferri R. Interactive data analysis and clustering of genomic data. Neural Netw. 2008; 21:368–78.
Turunen E. Mathematics Behind Fuzzy Logic. Heidelberg: Advances in Soft Computing, SpringerVerlag; 1999.
Mirzaei A, Rahmati M. Hierarchicalclusteringcombination scheme based on fuzzysimilarity relations. IEEE Trans Fuzzy Syst. 2010; 18(1):27–39.
Wu D, Wang D, Zhang MQ, Gu J. Fast dimension reduction and integrative clustering of multiomics data using lowrank approximation: Application to cancer molecular classification. BMC Genomics. 2015; 16(1):1–10.
Nguyen T, Tagett R, Diaz D, Draghici S. A novel approach for data integration and disease subtyping. Genome Res. 2017; 27(12):2025–39.
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, HaibeKains B, A. G. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014; 11:333–7.
Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol. 2009; 8(1):1–27.
Maksimovic J, Phipson B, Oshlack A. A crosspackage bioconductor workflow for analysing methylation array data. F1000Research. 2016; 5:1281.
Han H, Guo X, Yu H. Variable selection using mean decrease accuracy and mean decrease gini based on random forest. In: 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS). Beijing: IEEE: 2016. p. 219–224.
Staiano A, Di Taranto MD, Bloise E, D’Agostino MN, D’Angelo A, Marotta G, Gentile M, Jossa F, Iannuzzi A, Rubba P, et al. Investigation of single nucleotide polymorphisms associated to familial combined hyperlipidemia with random forests. In: Neural Nets and Surroundings vol. 19. Berlin, Heidelberg: Springer: 2013. p. 169–78.
Nardone D, Ciaramella A, Staiano A. A sparsemodeling based approach for class specific feature selection. PeerJ Comput Sci. 2019; 5:237.
Ciaramella A, Gianfico M, Giunta G. Compressive sampling and adaptive dictionary learning for the packet loss recovery in audio multimedia streaming. Multimed Tools Appl. 2016; 75(24):17375–92.
Ciaramella A, Giunta G. Packet loss recovery in audio multimedia streaming by using compressive sensing. IET Communications. 2016; 10(4):387–92.
Funding
Publication costs are funded by a grant from the Dipartimento di Scienze e Tecnologie Università degli Studi di Napoli “Parthenope” Tecniche di Machine Learning e Soft Computing per l’elaborazione di dati multivariati, SOFTMULAN, PI: Ciaramella).
Author information
Authors and Affiliations
Contributions
AC originally designed the methodology. AC and DN worked on the developing of the method and the design of the experiments. AC, DN and AS contributed for interpreting and for analysing the results. All author(s) contributed for writing the manuscript, read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
No ethics approval was required for the study.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary Material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ciaramella, A., Nardone, D. & Staiano, A. Data integration by fuzzy similaritybased hierarchical clustering. BMC Bioinformatics 21 (Suppl 10), 350 (2020). https://doi.org/10.1186/s12859020035676
Published:
DOI: https://doi.org/10.1186/s12859020035676