DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data

Mrukwa, Grzegorz; Polanska, Joanna

doi:10.1186/s12859-022-05093-z

Software
Open access
Published: 12 December 2022

DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data

BMC Bioinformatics volume 23, Article number: 538 (2022) Cite this article

2186 Accesses
5 Citations
5 Altmetric
Metrics details

Abstract

Background

Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible—therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated analyses require experience setting the algorithms’ hyperparameters and expert knowledge about the analysed biological processes. Moreover, feature engineering is needed to obtain valuable results because of the numerous features measured.

Results

We propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions (regular k-means, spatial and spectral approaches) combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions). Three quality indices: Dice Index, Rand Index and EXIMS score, focusing on the overall composition of the clustering, coverage of the tumour region and spatial cluster consistency, are used to assess the quality of unsupervised analyses. Algorithms were validated on mass spectrometry imaging (MSI) datasets—2D human cancer tissue samples and 3D mouse kidney images. DiviK algorithm performed the best among the four clustering algorithms compared (overall quality score 1.24, 0.58 and 162 for d(0, 0, 0), d(1, 1, 1) and the sum of ranks, respectively), with spectral clustering being mostly second. Feature engineering techniques impact the overall clustering results less than the algorithms themselves (partial \(\eta ^2\) effect size: 0.141 versus 0.345, Kendall’s concordance index: 0.424 versus 0.138 for d(0, 0, 0)).

Conclusions

DiviK could be the default choice in the exploration of MSI data. Thanks to its unique, GMM-based local optimisation of the feature space and deglomerative schema, DiviK results do not strongly depend on the feature engineering technique applied and can reveal the hidden structure in a tissue sample. Additionally, DiviK shows high scalability, and it can process at once the big omics data with more than 1.5 mln instances and a few thousand features. Finally, due to its simplicity, DiviK is easily generalisable to an even more flexible framework. Therefore, it is helpful for other -omics data (as single cell spatial transcriptomic) or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik.

Background

Mass Spectrometry Imaging (MSI) is widely used for discovering molecular profiles, as it provides unparalleled insight into the metabolomics of tissue samples [1,2,3]. Applied to a tumour specimen, MSI allows investigating heterogeneities that could indicate functional differences across tissue regions, as well as varying cancer subtypes that require dedicated treatment [4,5,6,7,8,9,10].

Technically, MSI is an excellent example of big biological data, with the following characteristics:

Volume: the spatial resolution of \(5-100 \mu m\) leads to 10,000—4,000,000 spectra potentially acquired from a \(1cm^2\) tissue sample with Time-of-Flight (ToF) spectrometers and becomes an even more severe problem for 3D MSI data [11];
Velocity: the last three years have brought more than 4500 MSI datasets uploaded to the METASPACE database [12, 13];
Variety: a dataset may consist of more than 200,000 mass channels (features) per single spectrum (observation), representing proteins, peptides, and/or metabolites;
Veracity: the acquisition methods and various biological phenomena introduce heavy duplication of information, thus capturing the most relevant nuances becomes complicated with dominating high-level patterns amplified [14]. On the other hand, feature importance may be diverse across separate regions of interest.

Efficient analyses of thousands of MSI spectra usually require careful feature space adaptation, irrespective of the extensive preprocessing pipeline [15,16,17,18]. In a fully unsupervised setup, there are the following groups of methods: filtering, linear (e.g. Principal Components Analysis), and non-linear (e.g. Universal Manifold Approximation and Projection) [19]. Filtering removes features according to some threshold (often constant). Linear methods produce a new set of features, a linear combination of the input features. Non-linear methods aim to find a low-dimensional representation that best approximates distances between pairs of points. All these methods are applied once globally to the entire dataset.

Many clustering methods are known, and one can divide them into a few significant groups based on their approach: centroid-based (e.g. k-means), connectivity-based (e.g. hierarchical clustering), distribution-based (e.g. Gaussian Mixture Modeling), and density-based (e.g. graph-cuts clustering). In addition, sometimes, a method is proposed that exercises more than a single property of the data (e.g. both connectivity and centroids) and such a method is called hybrid clustering.

The clustering quality can be captured via metrics like Dunn’s index [20] or Adjusted Rand Index [21]. These were introduced for low-dimensional data clustering, but studies [22, 23] compare their usefulness for high-dimensional data and subspace clustering. Hundreds of features that clustering algorithms process are already seen as a high-dimensionality problem, but it is still at least an order of magnitude less severe than in MSI data analyses. Therefore domain-related work must be investigated [18].

The earliest solutions for Mass Spectrometry Imaging take advantage of Principal Components Analysis (PCA), and agglomerative clustering [24]. PCA transforms the data, and components explaining 70% of the variance are selected. The authors use the Euclidean metric to capture spectra similarity and Ward linkage to provide the most meaningful results. The approach is semi-supervised, as it requires the manual setting of the number of clusters based on histological examination.

Another method features high-dimensional data clustering (HDDC) [25]—a hybrid approach based on the Gaussian Mixture Model (GMM). The fitting process yields locally relevant features, so the domain is independently adapted for each estimated component. The observations’ similarity is considered Euclidean metric in some subspaces of the original feature space. For MSI data, a combination of HDDC with edge-preserving denoising of m/z-images is applied [26]. Oppositely to smoothing the clustering results, the authors propose an approach oriented on m/z images, so the denoising provides better quality data for the clustering method. The idea of denoising is extended [27]. The Fast Map algorithm adapts the domain, and K-means clustering follows, which is more straightforward than HDDC. The denoising stage contains the main advancements, which we will omit here. The overlap of clusters and actual structures is the primary quality metric. However, both HDDC and K-means approaches [26, 27] also compare the visual perception of cluster consistency.

Region consistency is the main idea behind EXIMS [28]—a modern PCA-oriented feature engineering approach. First, histogram equalisation enhances the contrast of a molecular image. Then, the algorithm proposes a scoring method to assess whether structures exist in the enhanced molecular image. The random distribution of counts over the image leads to a low score value. Next, typical structure types like regions, curves, gradients and islets are recognised with a grey level co-occurrence matrix, resulting in a higher score value. However, the definition of the score yields unbounded values. Thus there is no clear threshold that can help discern informative peaks. Finally, the informative peaks are PCA-transformed and clustered with the fuzzy C-means algorithm.

K-means and spectral clustering are further investigated in a two-step scenario to speed up computations [29]. In the first step of clustering, both methods consider several random subsets of input data independently, and then cluster representatives are grouped to form the final clusters. Algorithms use cosine distance to measure spectrum similarity. Authors manually perform hyperparameter selection (i.e. selecting the number of eigenvectors of the connectivity graph) to provide the best results (visually). Both methods isolate 7 clusters per prior settings.

Feature engineering techniques also address the increased dataset volume. For example, Hierarchical Stochastic Neighbor Embedding (HSNE) has been claimed to be superior to classical non-linear techniques [30]. Reduced feature space is constructed hierarchically, with the approach Overview-First, Details-on-Demand. Firstly, characteristic points of the dataset (landmarks) are embedded to provide an overview. Then, the algorithm creates a new embedding for local neighbourhoods. Finally, each point is assigned the likelihood of being represented by a specific landmark. The authors discuss using maps of the likelihood for molecular segmentation given a landmark. However, they propose clusters of landmarks manually.

Another scalable feature engineering example is the Universal Manifold Approximation and Projection (UMAP) algorithm. It is more scalable than t-SNE and preserves more of the global structure [31]. Moreover, no restriction is made on the embedding dimensions. Thus, it may be used as a more general feature engineering method in MSI than the visualizing-only capabilities of t-SNE [32].

The recent years introduced a new category of unsupervised segmentation approaches based on deep learning. They form groups that correspond to dual interpretation of MSI data:

Spectrum oriented: neural network creates a low-dimensional embedding of spectra [33,34,35] that is further analysed with classical clustering methods or visualised.
Ion-image oriented: molecular images are analysed to find similarities in molecule distribution and enhance biologically relevant information. The neural approach extends the idea from EXIMS [28] and classical community detection algorithms [36]. It mimics a human recognising detailed molecular patterns and clusters the ion images based on spatial expressions [37]. This way, new molecular features are obtained, simultaneously representing whole groups of ion images. These neural ion images are subject to a classical unsupervised analysis pipeline, including pixel-wise approaches.

The unsupervised analyses of MSI data via clustering outlined above follow a single schema: some domain adaptation is applied globally, and a basic clustering algorithm is applied. This approach has several drawbacks.

PCA applied to the whole dataset can capture the most variance on a high level, but at the same time, it discards all nuances. These nuances might be crucial for hierarchical analyses. Other domain adaptation methods may partially resolve the issue, but the disproportion between low- and high-level detail is significant with thousands of dimensions.
Non-hierarchical approaches (e.g. one-step, two-step) cannot provide insights into the internal structures unless the molecular differences dominate the whole dataset.
Only hierarchical approaches describe the relation of subgroups as a clustering tree that captures context like parent, child, or sibling. This context provides additional insight into molecular diversity. For example, the parent-child relation of clusters could describe different functional subregions of the tumour region, while sibling clusters could represent other types of tumours.
Agglomerative or bisecting deglomerative hierarchical algorithms provide limited information about siblings since only two objects are merged or divided on each level. Therefore, many siblings are mistreated as an artificial parent-child relation. Such mistreatment requires an additional post-processing step.

We propose a hybrid framework that directly answers the above drawbacks: Divisive intelligent K-means (DiviK). It is a stepwise clustering procedure with feature engineering applied locally at each step (see Fig. 1). We implemented local feature engineering as filtering based on GMM decomposition of the feature variance across the subregion. The clustering algorithm implemented in our approach is a standard K-means algorithm with a tailored procedure for initial condition selection. One can easily substitute GMM filtering and K-means with other algorithms. Information criteria decide the number of clusters in the optimised feature space. The decision on the cluster number is two-stage: due to its numerical complexity, we use GAP at the first stage when assessing the data’s homogeneity. We use the approximate Dunn’s index to select the best local partition in the next step.

The strength of the proposed solution lies in optimising the feature space locally for data clustering according to the variability of the analysed subregion. DiviK considers separate regions independently on each step, starting from the entire input dataset feature space. It provides biologically relevant results and sustains scalability for large datasets, contrary to sophisticated approaches.

DiviK addresses the drawbacks mentioned above in the following way:

It has a hierarchical nature to provide the most comprehensive insight.
During the analysis, local domain adaptation is performed separately for each cluster/subcluster to avoid the excessive influence of dominating global patterns.
Before the cluster/subcluster analysis, a check is performed on whether there is evidence that diverse groups are present in the cluster/subcluster.

Implementation

Three main components of the DiviK algorithm: feature engineering and filtering technique, stop condition definition, and clustering algorithm, constitute its framework. All of these can be easily changed to adapt to the specificity of the analysed data.

Feature space optimisation

We propose to use a feature filtering procedure based on GMM decomposition [38, 39]. First, we average abundance and calculate abundance variance across the analysed subcluster for each peak. Then, empirical distributions of average abundances and abundance variances across all peaks, presented on the logarithmic scale (upper panel of Fig. 2), are decomposed into Gaussian mixtures, as shown in the middle panel of Fig. 2. We use the maximum conditional probability criterion to classify a particular peak into one of the mixture components. The crossing points between Gaussian components serve as the thresholds.

We discarded the GMM components containing less than 1% of peaks to avoid numeric artefacts.

We filter out all the peaks with the average abundance below the first crossing point of the GM model. For the abundance variance, we persisted only the peaks with the variance above the topmost crossing point, assuming it results in no less than 1% of the available features, as is shown in the bottom right panel of Fig. 2. If the 1% condition is not fulfilled, the second left or the second topmost threshold is used.

As we aim for a hands-free pipeline, Bayesian Information Criterion is used to select the optimal number of GMM components for both the model of average abundance distribution for global noise filtering and the model of local abundance variance distribution for discrimination of locally the most informative features.

Stop condition

Having a feature space optimised for cluster discovery, we must assess whether there are any clusters. There exist many indices that can validate the quality of unsupervised segmentation. These take into account cluster separability, compactness or probabilistic measures. However, only a few allow for a comparison between a multi-cluster partition and no further partition (one cluster only). For the K-means algorithm, the GAP statistic [40] is one of the options. It relates a data-specific partition to partitions obtained via the same method over random datasets within the same bounds (see Fig. 3). We stop splitting of particular ROI if GAP statistic suggests region homogeneity.

Clustering

In this work, we focus only on the K-means-based implementation due to its promising initial results [9, 41] and low computational complexity. With a cluster molecular diversity confirmed, we can proceed with the clustering. For MSI data analysis, we adjusted the distance metric [41] and the initialisation method.

The K-means algorithm requires the number of clusters to be specified before the analysis occurs. Since the actual number of molecularly homogeneous regions is unknown, we use unsupervised quality metrics to guide the computations. The current implementation includes Dunn’s index [20] and GAP statistic [40].

Initialisation method

Since the DiviK algorithm aims to identify biologically meaningful heterogeneity in the data, we decided to implement the deterministic initialisation procedure (see Fig. 4). Although there are plenty of well-known algorithms for random initialisation, the deterministic approaches are usually customised to the data structure, allowing to achieve the clustering goal. As mentioned above, in the case of MSI data, we seek the most heterogeneous tissue regions independently of the cluster imbalance. To address the expected redundancy in MSI data and decrease the computational time, we compress the data by the KD-tree algorithm (with the compression factor equal to 0.01 for 2D data and 0.001 for 3D data). The obtained “boxes of spectra” are represented by their centroids and weighting factors related to the number of spectra assigned to the particular box. The next step requires computing the distances from the global dataset centroid to the box-specific centroids. We approximate the distances’ cumulative distribution function (cdf) and choose the centroid containing the k-th percentile. It serves as the first initial seed for the clustering algorithm. The procedure mentioned above repeats to obtain the second seed, but we substitute the global centroid with the first seed and the box containing that seed is not considered in the calculations. Having two initial seeds, we can search for the third one (if required). We estimate the cdf of the distance between the box-specific centroids and the line defined by already chosen two seeds to find the box containing k-th percentile. The centroid of that box is considered the third seed. One can identify the subsequent seeds similarly, substituting the line by the hyperplane defined by the already found seeds. We perform the initial seed setting in the already locally optimised feature space. The comprehensive study on the initialisation procedures (not shown in the manuscript) demonstrated the power and robustness of our approach when compared to the commonly used initialisation algorithms.

Scalability

DiviK is an original stepwise deglomerative algorithm that provides scalability for big data in several ways:

It uses a locally optimised K-means algorithm iteratively. K-means algorithm is known as one of the few highly scalable clustering algorithms.
Filtration-based feature engineering, used in the local optimisation of the feature space step, requires conducting neither PCA nor UMAP. The Gaussian Mixture Model approach supports threshold identification necessary for the feature filtration (see Fig. 2). It is computationally efficient, as it requires only computing the within-cluster mean and the variance of each feature, followed by an automated data-driven threshold selection.
To ensure computational complexity is linear with the number of samples, we developed a procedure of deterministic initialisation based on a KD-Tree algorithm.
The automatic selection of the number of clusters in the K-means algorithm is computationally expensive by default. Therefore, we split the decision process into two steps. First, we use the sampled distribution of GAP statistics to decide if the cluster is heterogeneous (requires further splitting) or homogenous (the splitting process is finished). In the case of a heterogeneous cluster, we use Dunn’s Index to decide on the number of subclusters, which keeps the process scalable.

Technology

DiviK was written mainly in Python, and it is distributed cross-platform under the Apache 2.0 license through Python Package Index (https://pypi.org/project/divik/) and Docker Hub (https://hub.docker.com/r/gmrukwa/divik). It is designed to be simple, efficient, accessible to a non-expert, and highly reusable. Thus, DiviK is accessible in Python directly and via a command-line interface (CLI). The Python API follows the scikit-learn [42] design patterns and similar package organisation conventions to provide reusable building blocks. The CLI allows the construction of highly-flexible processing pipelines owing to its injection-based configuration system.

Results

In order to compare the effectiveness of the DiviK framework and the existing approaches, we evaluated the tool against two MSI datasets of different characteristics:

Oral Squamous Cell Carcinoma (OSCC). It consists of two individual slices collected from different patients suffering from OSCC, annotated by an experienced pathologist;
Mouse kidney. It consists of 75 serial slices from a single 3D-allocated volume from a mouse kidney without annotation.

We compared DiviK to the combinations of state-of-the-art MSI data clustering and global feature engineering methods. We selected K-means, spectral and spatial clustering [27] as representatives of clustering algorithms validated for MSI. All were used in a fully unsupervised setting with the number of clusters selected using the GAP statistic. For global feature engineering, we used one of the following options: no global feature engineering, PCA with the knee-based selection of the number of components [43], PCA on top of EXIMS-preselected features [28], UMAP [31], and neural ion images obtained with Xception network [37].

All the experiments were carried out in the Polyaxon environment [44].

Oral squamous cell carcinoma

The OSCC dataset [41] is a two-dimensional MALDI-ToF MSI dataset. The original dataset consists of 45,738 raw spectra with 109,568 mass channels. After pre-processing and spectra modelling, 3714 peaks/features were obtained. The spectra are provided with annotations of the tissue type, allowing for distinguishing regions of interest (ROI) of the tumour, epithelium, and healthy areas (Fig. 5). Therefore, the OSCC dataset can be a benchmark for classification and clustering problems. There are diverse sources of variability, including:

inter-patient sample heterogeneity;
tissue type heterogeneity;
intra-tissue heterogeneity.

We selected a subset consisting of two patient samples with top tissue variety (19,874 spectra and 3714 peaks in total) and segmented the subset using all the combinations of the methods discussed in the previous section. The obtained clusters of spectra usually consisted of spectra originally annotated to different tissue subtypes, so the annotation translation from spectrum to cluster level is required. The developed reannotation procedure maximises the multi-cluster Dice similarity index related to the primary tissue subtypes (Region of Interests ROI) as presented in Fig. 6. First, we sort clusters according to the percentage of their area covered by the consecutive ROIs. Then, we select clusters sequentially to maximise the Dice index for a given ROI collectively. We assign one for each (cluster, ROI) pair if the cluster is included in the set maximising Dice index for a given ROI, zero otherwise. If only one pair \((C, ROI_i)\) is assigned 1 for a given cluster C, the assignment is considered final. Otherwise, we evaluate each pair (assignment scenario) in terms of the Rand index.

We used this dataset to compare ROI reconstruction capabilities across different clustering algorithms and feature engineering techniques. With clusters reassigned to the pathologist-defined ROIs, we considered the following quality metrics:

Rand Index—to capture global multi-ROI reconstruction capabilities;
Dice Index—to capture tumour reconstruction capabilities;
EXIMS score—to capture the spatial consistency of the clusters.

The EXIMS score is a measure with unbounded values, useful in comparative analyses, but its magnitude is hard to interpret. Therefore we scaled and clipped the values so that the highest multi-cluster result was 1.

DiviK sweeps for up to 10 clusters on each segmentation level. It operates with correlation distance, as researchers have already proven it to work well for MSI data [29, 32]. The minimal number of local features required to preserve is 1%. The minimal size of the cluster was set to 200 spectra. The leaf size of KD-Tree was 1%, and the algorithm started from the leaf containing the 99th percentile of the distance. We sampled ten times 1000 spectra each to compute Dunn’s and GAP indices to keep the computational complexity of quality estimation bounded. Standalone K-means clustering swept for up to 50 clusters, with the same criteria for computing the GAP index as DiviK and the correlation distance. Spatial clustering was launched with a radius of 7. Spectral clustering was used with cosine metrics during the embedding. The embedding generated the number of components equal to 1% of the number of global features (information capacity equivalent to DiviK filtering). UMAP was run with 30 neighbours, correlation distance, 500 epochs and a negative sample rate of 70 to obtain three components. We used the Xception network with a patch size of \(71\times 71\) pixels and an upsample rate of 4 to ensure a patch edge of 1.775mm.

Table 1 Values of quality indices measured for OSCC data

Full size table

We presented the visualisation of the performance of the DiviK algorithm and remaining combinations in Fig. 7. Dimensions of the presented cube are the quality indices mentioned at the beginning of this section, namely Dice Index, Rand Index, and EXIMS Score. The exact values of quality metrics are available in Table 1. In Fig. 8 one can find a normalised map of the clusters obtained in the process. We compare the combinations of global feature engineering methods and clustering algorithms exhaustively. The proposed DiviK approach uses no global feature engineering method, yet we conducted such experiments and presented their results for comparison. Sample spatial ion heatmaps, correlated with the pathologist’s tissue subtypes and the clusters found by DiviK, are presented in Fig. 9. The averaged results for the used clustering methods can be seen in Table 2 whereas Table 3 presents results obtained with the used global feature engineering methods for d-based aggregations. Tables 4 and 5 present the rank-based summarisation results.

Table 2 Average quality indices for clustering methods with OSCC data

Full size table

Mouse kidney 3D

The mouse kidney dataset [45] is a three-dimensional MALDI-ToF MSI dataset. Therefore, the dataset has a high volume and can be a benchmark for the algorithm’s scalability. The sources of variability are limited to a single source specimen. It consists of 75 sections from the central part of a mouse kidney, 1,362,830 spectra in total, and 7680 data points per spectrum. The dataset was already subject to Gaussian spectral smoothing and baseline reduction with the Top Hat algorithm.

We conduct clustering in the same scenarios as previously, except feature engineering based on neural ion images, as it lacks generalisation for 3D data. No expert-defined labels are available for the spectra; thus, one must assess the capabilities of heterogeneity detection visually. Clusters are subject to a similar reannotation procedure as depicted in Fig. 6 to indicate the similarities between results.

The minimal number of local features that DiviK is required to preserve is set to 0.5%. We did not split the cluster further if it already was 50,000 spectra or fewer, as this dataset cannot confirm detailed internal tissue diversity. KD-Tree leaf size during the initialisation was 0.1%, and the algorithm started from the leaf containing the 95th percentile of the distance. To compute the Dunn’s and GAP indices, we sampled ten times 5000 spectra each to keep the computational complexity of quality estimation bounded.

In Fig. 10, we present volumes with clusters marked. We matched cluster colours to enable visual comparison, although we used no cluster normalisation procedure, as the ground truth labels are nonexistent. Unfortunately, spectral clustering is not included in this study, as the computational complexity of the basic approach did not allow for enough scalability. On the other hand, a two-step approach [29] did not lead to convergence. Additionally, spatial clustering [27] does not consider modified spatial strides in the third dimension. The same clusters are presented in Fig. 11 on each 6th consecutive slice.

Discussion

Oral squamous cell carcinoma

As shown in Fig. 8, it is possible to observe varying capabilities for heterogeneity detection depending on the selected methods for clustering and global feature selection. For example, UMAP combined with spatial clustering did not create clusters overlapping with biological structures (Dice index 0.00, Rand index 0.00, see Table 1), even though both spatial clustering and UMAP tend to exhibit the potential for capturing the necessary detail (see Fig. 8). One can observe similar results for the Xception-based feature engineering combined with spectral clustering.

In contrast, spatial clustering with EXIMS-based global feature engineering approximated the tumour region with the highest Dice index and the top ROIs composition expressed via Rand index. Neural ion images with spatial clustering yield the second-highest Dice index and the third-highest Rand index. However, these are not the only criteria used by medical experts during the quality assessment.

A visual comparison between clusters (Fig. 8) and pathologist-annotated ROIs (Fig. 5) shows that DiviK yields relatively stable regions regardless of the global feature engineering method. Table 2 seems to support that statement, as the standard deviation of the Dice index and adjusted Rand index is the lowest of all clustering methods. Spectral clustering with the most basic features (no global feature engineering or knee PCA) seems to discover similar regions.

Xception-based feature engineering method resulted in much less consistent regions discovered with K-means and DiviK. Two hypotheses seem probable explanations for this phenomenon:

Neural ion images amplified noise. Grouping of ion images deduplicates features; however, it increases the proportional representation of the noise-related mass channels compared to the original dataset. The results of spatial clustering seem to support this hypothesis, as its embedded bilateral filter reduces noise.
The original ion image normalisation based on winsorisation and scaling is not sufficient for this dataset. The contrast of the patches analysed by the Xception network is too small to generate useful embedding. Since the approach proposed by the authors [37] formalises the visual similarity of ion maps, it could probably be useful to apply classical ion image normalisation [46] to enhance the contrast first.

The comparison of global feature engineering methods is much less conclusive than the comparison of clustering methods. There does not seem to be any simple rule that could yield outstanding stability or performance against varying clustering methods. While EXIMS PCA and no global feature engineering seem to demonstrate increased tumour reconstruction capabilities (as expressed through the Dice index), at the same time, these yield a medium EXIMS score.

Finally, the EXIMS score and the overall quality concept allow us to assess the trade-off between the agreement measures and the visual consistency of clusters. We defined the overall quality in two ways:

overall quality d(0, 0, 0)—it is the length of the vector in 3D space defined by the values of quality indices and the origin of the coordinate system (see Fig. 7). The better the clustering result, the higher the value.
overall quality d(1, 1, 1)—it is the length of the vector in 3D space defined by the values of quality indices and their maximal theoretical value in point (1, 1, 1). The better the clustering result, the lower the value.

Naturally, the most consistent segmentations achieve the top EXIMS score: cases that yield a single cluster (spatial clustering with UMAP) or completely miss one of the ROIs (K-means with PCA). Since these are not relevant from the medical point of view, we bind their relative EXIMS score at 1.0 for further consideration. The DiviK clustering without a global feature engineering technique achieved the next top EXIMS score. That renders its overall quality d(0, 0, 0) the top one. We obtained the following two best results in terms of overall quality d(0, 0, 0) using Spatial clustering with EXIMS PCA-based feature engineering and Spectral clustering with PCA. At the same time, the overall quality d(1, 1, 1) indicates Spatial clustering with EXIMS PCA-based feature engineering as the top result. Spatial clustering over Xception-generated features and DiviK clustering with UMAP feature engineering yield the next two best scores.

Table 1 ranks the individual configurations’ scores (tumour coverage, overall composition and cluster consistency) for a qualitative comparison. The lower the rank, the better the result. The lowest ranks appear consecutively for Spatial clustering with EXIMS PCA-based feature engineering, Spatial clustering over neural ion images, and DiviK with no feature engineering. Table 4 continues the ranking for clustering methods. DiviK exhibits the lowest ranks for tumour coverage and overall composition, while K-means exhibits the lowest for cluster consistency.

One can find DiviK with no feature engineering method scores within the top three in each of the aggregated overall quality measures. Sums of ranks ranged from 16 to 53. DiviK with no feature engineering was assigned a sum of ranks equal to 20.5, which is around 12% of the observed quality measure range. Overall quality d(0, 0, 0) ranged from 0.8122 to 1.3560, with a top value of 1.3560 for DiviK. Overall quality d(1, 1, 1) ranged from 0.4438 to 1.4142. DiviK with UMAP feature engineering was around 7% of the observed quality measure range.

Table 3 Average quality indices for global feature engineering methods with OSCC data

Full size table

Similarly as for the basic quality measures, Table 2 and 3 summarize also both variants of overall quality. On average, DiviK appears to be the most distant from the origin (mean \(d(0,0,0)=1.236\)) and the closest to the theoretical maximum of quality measures (mean \(d(1,1,1)=0.576\)) as its good clustering results do not depend so much on the feature engineering techniques applied. The local subcluster-driven feature space adaptation allows reducing the domain significantly by filtering the less informative and noise-level signals. The second top is the spatial clustering with mean \(d(0,0,0)=1.192\) and mean \(d(1,1,1)=0.706\). The Cohen’s effect size in differences between these two approaches is close to medium and equals to 0.435 for d(0, 0, 0) and 0.451 for d(1, 1, 1). While the feature engineering techniques for other than DiviK methods are a concern, the first choice should be EXIMS PCA as it outperforms the other techniques tested. In the case of DiviK, no feature engineering is recommended.

We conducted a qualitative evaluation of the effect of the feature engineering method and clustering algorithm on the clustering results quality. Table 6 shows the partial \(\eta ^2\) effect size and Kendall’s W concordance coefficient on the considered quality indices. The partial \(\eta ^2\) effect size values exceed the 0.14 large effect size threshold. The choice of the clustering method influences the Dice and Rand indices more than the feature engineering method. The remaining quality measure (EXIMS score) is, as expected, stronger influenced by the feature engineering method, and the clustering method is less relevant. While an overall quality is considered, choice of clustering algorithm is more important than feature engineering techniques applied.

Mouse kidney 3D

Published benchmark datasets [45] are oriented explicitly on high-scale computations. The risk of missing the crucial details increases significantly for such a scale. Moreover, it renders many methods useless due to their computational complexity. Therefore, additional modifications are sometimes required [29], yet insufficient.

As shown in Fig. 10, the high-level composition of the medical sample is generally consistent between the configurations. The most straightforward configuration of K-means clustering with no global feature engineering yields a single cluster for the whole dataset.

The structures discovered via K-means and DiviK algorithms look consistent compared to other work [30, 35, 47] that has analysed this data. The detected regions share a molecular signature across the 3D volume and exhibit functional differences.

Algorithms were ranked concerning their scalability limits; Table 4 sums the ranks. K-means and DiviK were the only algorithms that completed computations and were assigned the best ranks.

Table 4 High-level summary of clustering algorithm characteristics

Full size table

Table 5 High-level summary of feature engineering characteristics

Full size table

Table 6 Qualitive evaluation of the effect of the feature engineering method and clustering algorithm. We conducted Kruskal-Wallis test and effect size used in the table is partial \(\eta ^2\). Furthermore, we calculated Kendall’s W concordance index

Full size table

DiviK approach does not require applying PCA or UMAP globally. The GMM-based feature filtration/selection is applied to optimise the feature space locally. Thanks to that approach, the internal structure, dominated by the main clusters, might be revealed in the subsequent divisive steps. We used PCA and UMAP-based pipelines for reference in the comparison study, and they are not a part of the proposed DiviK algorithm. We completed these pipelines using a machine equipped with 48 CPU cores and 256 GB RAM, correspondingly in 23.8 min (PCA) and 198.5 min (UMAP).

On the contrary, the GMM-based filtering procedure for the full mouse kidney dataset completes on average within 8.66 seconds (156 ms standard deviation). Due to the deglomerative pipeline of the DiviK algorithm and local optimisation of the feature space, the filtering procedure is repeated per each splitting step, which led in the case of the mouse kidney dataset up to 60 decompositions for different subregions of this tissue sample. It gives, on average, less than 9 min (8.66) in total.

Conclusions

The comprehensive comparative study considering several aspects of data exploration demonstrated that DiviK is a tool that provides a reasonable trade-off between the accuracy of unsupervised heterogeneity discovery and the consistency of obtained clusters. DiviK appears to be scalable and robust enough to cope with both small- and large-scale MSI data. Moreover, it is also insensitive to modifications of the preprocessing pipeline. Compared with other configurations already proven effective for MSI, we showed that DiviK might be feasible for real-world applications when applied to solve a complex multi-dimensional problem. DiviK provides researchers with additional insight into which features were the most important for the differentiation of a specific region, which could be a subject of an even more comprehensive analysis, crucial for doctors to understand the underlying phenomena.

One could easily extend the DiviK framework to support other kinds of -omics data, with slight adjustments to the similarity measure and local filtering schema. The heterogeneity investigations based on epigenetic, transcriptomic, proteomic and metabolomic profiles require no modifications to the algorithm. Applying the DiviK algorithm to genomics data (GWAS-type) will require the application of a data-specific similarity measure and careful selection of the characteristic for the local feature filtering scheme. Finally, it is worth noting that the DiviK algorithm does not take advantage of any specific characteristics of the MSI data. Therefore, one could apply DiviK as a generic clustering algorithm, useful for grouping images after appropriate embedding (including medical images) or any other kind of tabular data.

Availability of data and materials

OSCC dataset is available from Zenodo repository [48]. 3D Mouse Kidney dataset is available from GigaScience Database [49].

Abbreviations

cdf:: Cumulative distribution function
CLI:: Command-line interface
DiviK:: Divisive intelligent K-means
GMM:: Gaussian mixture modeling
HDDC:: High-dimensional data clustering
MSI:: Mass spectrometry imaging
OSCC:: Oral squamous cell carcinoma
PCA:: Principal components analysis
ROI:: Region of interest
ToF:: Time-of-flight
UMAP:: Universal manifold approximation and projection

References

Aichler M, Walch A. MALDI imaging mass spectrometry: current frontiers and perspectives in pathology research and practice. Lab Investig. 2015;95(4):422–31.
Article CAS Google Scholar
Miura D, Fujimura Y, Yamato M, Hyodo F, Utsumi H, Tachibana H, et al. Ultrahighly sensitive in situ metabolomic imaging for visualizing spatiotemporal metabolic behaviors. Anal Chem. 2010;82(23):9789–96.
Article CAS Google Scholar
Hattori K, Kajimura M, Hishiki T, Nakanishi T, Kubo A, Nagahata Y, et al. Paradoxical ATP elevation in ischemic penumbra revealed by quantitative imaging mass spectrometry. NY: Mary Ann Liebert, Inc.; 2010.
Djidja MC, Claude E, Snel MF, Francese S, Scriven P, Carolan V, et al. Novel molecular tumour classification using MALDI-mass spectrometry imaging of tissue micro-array. Anal Bioanal Chem. 2010;397(2):587–601.
Article CAS Google Scholar
Morita Y, Ikegami K, Goto-Inoue N, Hayasaka T, Zaima N, Tanaka H, et al. Imaging mass spectrometry of gastric carcinoma in formalin-fixed paraffin-embedded tissue microarray. Cancer Sci. 2010;101(1):267–73.
Article CAS Google Scholar
Groseclose MR, Massion PP, Chaurand P, Caprioli RM. High-throughput proteomic analysis of formalin-fixed paraffin-embedded tissue microarrays using MALDI imaging mass spectrometry. Proteomics. 2008;8(18):3715–24.
Article CAS Google Scholar
Quaas A, Bahar AS, von Loga K, Seddiqi AS, Singer JM, Omidi M, et al. MALDI imaging on large-scale tissue microarrays identifies molecular features associated with tumour phenotype in oesophageal cancer. Histopathology. 2013;63(4):455–62.
Google Scholar
Steurer S, Borkowski C, Odinga S, Buchholz M, Koop C, Huland H, et al. MALDI mass spectrometric imaging based identification of clinically relevant signals in prostate cancer using large-scale tissue microarrays. Int J Cancer. 2013;133(4):920–8.
Article CAS Google Scholar
Pietrowska M, Diehl HC, Mrukwa G, Kalinowska-Herok M, Gawin M, Chekan M, et al. Molecular profiles of thyroid cancer subtypes: classification based on features of tissue revealed by mass spectrometry imaging. Biochimica et Biophysica Acta (BBA) Proteins Proteom. 2017;1865(7):837–45.
Martinez-Outschoorn UE, Peiris-Pagés M, Pestell RG, Sotgia F, Lisanti MP. Cancer metabolism: a therapeutic perspective. Nat Rev Clin Oncol. 2017;14(1):11–31.
Article CAS Google Scholar
Vos D, Ellis S, Balluff B, Heeren R. Experimental and data analysis considerations for three-dimensional mass spectrometry imaging in biomedical research. Mol Imaging Biol. 2020;66:1–11.
Google Scholar
Palmer A, Phapale P, Chernyavsky I, Lavigne R, Fay D, Tarasov A, et al. FDR-controlled metabolite annotation for high-resolution imaging mass spectrometry. Nat Methods. 2017;14(1):57–60.
Article CAS Google Scholar
METASPACE annotation platform: datasets summary. Accessed 14 June 2020. https://metaspace2020.eu/datasets/summary.
Polanski A, Marczyk M, Pietrowska M, Widlak P, Polanska J. Signal partitioning algorithm for highly efficient Gaussian mixture modeling in mass spectrometry. PLoS ONE. 2015;10(7):66.
Article Google Scholar
Jones EA, van Remoortere A, van Zeijl RJ, Hogendoorn PC, Bovée JV, Deelder AM, et al. Multiple statistical analysis techniques corroborate intratumor heterogeneity in imaging mass spectrometry datasets of myxofibrosarcoma. PLoS ONE. 2011;6(9): e24913.
Article CAS Google Scholar
Thomas SA, Race AM, Steven RT, Gilmore IS, Bunch J. Dimensionality reduction of mass spectrometry imaging data using autoencoders. In: IEEE symposium series on computational intelligence (SSCI). IEEE; 2016. p. 1–7.
Veselkov KA, Mirnezami R, Strittmatter N, Goldin RD, Kinross J, Speller AV, et al. Chemo-informatic strategy for imaging mass spectrometry-based hyperspectral profiling of lipid signatures in colorectal cancer. Proc Natl Acad Sci. 2014;111(3):1216–21.
Article CAS Google Scholar
Verbeeck N, Caprioli RM, Van de Plas R. Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry. Mass Spectrom Rev. 2020;39(3):245–91.
Article CAS Google Scholar
Postma E, van den Herik H, van der Maaten L. Dimensionality reduction: a comparative review. J Mach Learn Res. 2009;10(1–41):66–71.
Google Scholar
Dunn JC. Well-separated clusters and optimal fuzzy partitions. J Cybernet. 1974;4(1):95–104.
Article Google Scholar
Lawrence H, Phipps A. Comparing partitions. J Classif. 1985;2(1):193–218.
Article Google Scholar
Lipor J, Balzano L. Clustering quality metrics for subspace clustering. Pattern Recognit. 2020;66:107–328.
Google Scholar
Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa LdF, et al. Clustering algorithms: a comparative approach. PLoS ONE. 2019;14(1):e0210236.
Deininger SO, Ebert MP, Futterer A, Gerhard M, Rocken C. MALDI imaging combined with hierarchical clustering as a new tool for the interpretation of complex human cancers. J Proteome Res. 2008;7(12):5230–6.
Article CAS Google Scholar
Bouveyron C, Girard S, Schmid C. High-dimensional data clustering. Comput Stat Data Anal. 2007;52(1):502–19.
Article Google Scholar
Alexandrov T, Becker M, So D, Ernst G, Wehder L, Grasmair M, et al. Spatial segmentation of imaging mass spectrometry data with edge-preserving image denoising and clustering. J Proteome Res. 2010;9(12):6535–46.
Article CAS Google Scholar
Alexandrov T, Kobarg JH. Efficient spatial segmentation of large imaging mass spectrometry datasets with spatially aware clustering. Bioinformatics. 2011;27(13):i230-8.
Article CAS Google Scholar
Wijetunge CD, Saeed I, Boughton BA, Spraggins JM, Caprioli RM, Bacic A, et al. EXIMS: an improved data analysis pipeline based on a new peak picking method for EXploring Imaging Mass Spectrometry data. Bioinformatics. 2015;31(19):3198–206.
Article CAS Google Scholar
Dexter A, Race AM, Steven RT, Barnes JR, Hulme H, Goodwin RJ, et al. Two-phase and graph-based clustering methods for accurate and efficient segmentation of large mass spectrometry images. Anal Chem. 2017;89(21):11293–300.
Article CAS Google Scholar
Abdelmoula WM, Pezzotti N, Hölt T, Dijkstra J, Vilanova A, McDonnell LA, et al. Interactive visual exploration of 3D mass spectrometry imaging data using hierarchical stochastic neighbor embedding reveals spatiomolecular structures at full data resolution. J Proteome Res. 2018;17(3):1054–64.
Article CAS Google Scholar
McInnes L, Healy J, Melville J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. 2018;.
Smets T, Verbeeck N, Claesen M, Asperger A, Griffioen G, Tousseyn T, et al. Evaluation of distance metrics and spatial autocorrelation in uniform manifold approximation and projection applied to mass spectrometry imaging data. Anal Chem. 2019;6:66.
Google Scholar
Inglese P, McKenzie JS, Mroz A, Kinross J, Veselkov K, Holmes E, et al. Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer. Chem Sci. 2017;8(5):3500–11.
Article CAS Google Scholar
Gardner W, Cutts SM, Muir BW, Jones RT, Pigram PJ. Visualizing ToF-SIMS hyperspectral imaging data using color-tagged toroidal self-organizing maps. Anal Chem. 2019;91(21):13855–65.
Article CAS Google Scholar
Abdelmoula WM, Lopez BGC, Randall EC, Kapur T, Sarkaria JN, White FM, et al. Peak learning of mass spectrometry imaging data using artificial neural networks. Nat Commun. 2021;12(1):1–13.
Article Google Scholar
Wüllems K, Kölling J, Bednarz H, Niehaus K, Hans VH, Nattkemper TW. Detection and visualization of communities in mass spectrometry imaging data. BMC Bioinform. 2019;20(1):1–12.
Article Google Scholar
Zhang W, Claesen M, Moerman T, Groseclose MR, Waelkens E, De Moor B, et al. Spatially aware clustering of ion images in mass spectrometry imaging data using deep learning. Anal Bioanal Chem. 2021;413(10):2803–19.
Article CAS Google Scholar
Marczyk M, Jaksik R, Polanski A, Polanska J. Adaptive filtering of microarray gene expression data based on Gaussian mixture decomposition. BMC Bioinform. 2013;14(1):101.
Article Google Scholar
Polanski A, Marczyk M, Pietrowska M, Widlak P, Polanska J. Initializing the EM algorithm for univariate Gaussian, multi-component, heteroscedastic mixture models by dynamic programming partitions. Int J Comput Methods. 2018;15(03):1850012.
Article Google Scholar
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol. 2001;63(2):411–23.
Article Google Scholar
Widlak P, Mrukwa G, Kalinowska M, Pietrowska M, Chekan M, Wierzgon J, et al. Detection of molecular signatures of oral squamous cell carcinoma and normal epithelium-application of a novel methodology for unsupervised segmentation of imaging mass spectrometry data. Proteomics. 2016;16(11–12):1613–21.
Article CAS Google Scholar
Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, et al. API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD workshop: languages for data mining and machine learning; 2013. p. 108–22.
Satopaa V, Albrecht J, Irwin D, Raghavan B, Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In: 31st international conference on distributed computing systems workshops. IEEE; 2011; p. 166–71.
Mourafiq M. Polyaxon: cloud native machine learning automation platform; 2017. https://github.com/polyaxon/polyaxon.
Oetjen J, Veselkov K, Watrous J, McKenzie JS, Becker M, Hauberg-Lotte L, et al. Benchmark datasets for 3D MALDI-and DESI-imaging mass spectrometry. GigaScience. 2015;4(1):s13742-015.
Article Google Scholar
Race AM, Bunch J. Optimisation of colour schemes to accurately display mass spectrometry imaging data based on human colour perception. Anal Bioanal Chem. 2015;407(8):2047–54.
Article CAS Google Scholar
Race AM, Palmer AD, Dexter A, Steven RT, Styles IB, Bunch J. SpectralAnalysis: software for the masses. Anal Chem. 2016;88(19):9451–8.
Article CAS Google Scholar
Widlak P, Mrukwa G, Kalinowska M, Pietrowska M, Chekan M, Wierzgon J, et al. Oral squamous cell carcinoma—mass spectrometry imaging dataset. https://doi.org/10.5281/zenodo.7377802.
Oetjen J, Veselkov K, Watrous J, McKenzie JS, Becker M, Hauberg-Lotte L, et al. Supporting materials for “Benchmark datasets for 3D MALDI-and DESI-imaging mass spectrometry”. https://doi.org/10.5524/100131.

Download references

Acknowledgements

We would like to thank Piotr Widłak, Monika Pietrowska, Marta Gawin and Mykola Chekan from Maria Skłodowska-Curie National Research Institute of Oncology, Gliwice branch, for providing the necessary biological and MSI data acquisition background. Finally, we would like to thank Katarzyna Frątczak for help in data preprocessing.

Availability and requirements

Project name: DiviK; Project home page: https://github.com/gmrukwa/divik/; Operating system(s): Linux, Windows, Mac; Programming language: Python (\(\ge\)3.6); License: Apache 2.0; Any restrictions to use by non-academics: none.

Funding

Grzegorz Mrukwa was awarded the scholarship from NCBiR Grant AIDA No. I029/17-POWR.03.02.00-IP.08-00-DOK/17. This project was financially supported by NCN Grant BITIMS No. UMO-2015/19/B/ST6/01736 for data acquisition and numerical simulations.

Author information

Authors and Affiliations

Department of Data Science and Engineering, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland
Grzegorz Mrukwa & Joanna Polanska
Netguru, Małe Garbary 9, 61-756, Poznań, Poland
Grzegorz Mrukwa

Authors

Grzegorz Mrukwa
View author publications
You can also search for this author in PubMed Google Scholar
Joanna Polanska
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JP and GM designed the DiviK core algorithm. GM implemented the scripts, performed the tests, and developed the package for PyPI submission. JP and GM wrote the manuscript. JP critically revised the work. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Joanna Polanska.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Mrukwa, G., Polanska, J. DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data. BMC Bioinformatics 23, 538 (2022). https://doi.org/10.1186/s12859-022-05093-z

Download citation

Received: 22 December 2020
Accepted: 01 December 2022
Published: 12 December 2022
DOI: https://doi.org/10.1186/s12859-022-05093-z

DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data

Abstract

Background

Results

Conclusions

Background

Implementation

Feature space optimisation

Stop condition

Clustering

Initialisation method

Scalability

Technology

Results

Oral squamous cell carcinoma

Mouse kidney 3D

Discussion

Oral squamous cell carcinoma

Mouse kidney 3D

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Availability and requirements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us