Skip to main content
Figure 3 | BMC Bioinformatics

Figure 3

From: Self consistency grouping: a stringent clustering method

Figure 3

Effect of random errors introduced in distance measurements. (a) Dataset: We randomly generated a dataset of 80 points around four centers (0,8), (0, -8), (8,0) and (-8,0), 20 points for each center. Each point was offset from the center in both X and Y directions by a random amount following normal distribution (µ=0 and SD= 1). (b) Effect of random error on average cluster sizes: For the given dataset of 80 points, the Euclidean distances were calculated. Then we perturbed each pairwise distances with random value following a Gaussian distribution with µ=0 and SD shown on X axis. Note that SD = 0 implies that there are no perturbations. These distances are used to build clusters using SCG (cyan line), complete linkage (CL, blue line), average linkage (AL, green line) and single linkage (SL, red line). Since CL, AL and SL requires score cut-offs, we measured the clustering with distance cut-off values of 2 (conservative), and 4 (less conservative) denoted by the numbers following “/” (solid lines and dotted lines respectively). Finally, the number of clusters was measured per method per cut-off (this includes singletons). Thus, the maximum value can be 80 (all singletons) and the minimum possible value is 1. The ideal number is 4 by design (Fig 3. (a) ). The error bars shown at different points of the curves (each representing a method) are derived from 100 perturbations for a given SD. Note that the SCG shows steepest rise in the number of clusters. (c) Effect of random error on cluster qualities: Legends and the unit of X-axis are same as in (b). After each method identifies clusters, we enumerate all pairs within a cluster, e.g. the cluster {1,2,3} is decomposed into three pairs 1-2, 1-3, and 2-3. If both objects of any pair do not belong to the same reference cluster then we increment the number of incorrect pairs by one. Note that the number of incorrect pairs is intrinsically related to the number of clusters. If the number of clusters is 1, meaning all objects are grouped into one cluster, the number of clusters is minimum and that is reflected in the big numbers of incorrect pairs (See SL/4). If the number of clusters is 80, (similar to SCG at high errors) then no incorrect pairs exist.

Back to article page