Statistical power for cluster analysis

BMC Bioinformatics

Table 1 Summary of the simulation analyses that were conducted, and the variables that were varied in each set of simulations

Analysis	N	k	Effect size	Covariance	Dimensionality reduction	Cluster algorithms
(1) What drives cluster separation	1000	– 2 (10/90%)	Δ = 0.3–8.1	15 features	– None	– K-means
		– 2 (equal)		– None	– MDS	– Ward
		– 3 (equal)		– Random	– UMAP	– Cosine
				– Mixed		– HDBSCAN
(2) Statistical power	10–160	– 2 (10/90%)	Δ = 1–10	2 features	– None	– K-means
		– 2 (equal)		– None		– HDBSCAN
		– 3 (equal)				– C-means
		– 4 (equal)
(3) Discrete versus fuzzy clustering	120	– 1	Δ = 1–10	2 features	– None	– K-means
		– 2 (equal)		– None		– C-means
		– 3 (equal)				– Mixture model
		– 4 (equal)

Each unique combination of listed features was simulated. “Ward” and “cosine” refer to agglomerative (hierarchical) clustering, using Ward linkage and Euclidean distance or average linkage and cosine distance, respectively. “Mixture model” refers to finite Gaussian mixture modelling

ISSN: 1471-2105