 Proceedings
 Open access
 Published:
Robust clustering in high dimensional data using statistical depths
BMC Bioinformatics volume 8, Article number: S8 (2007)
Abstract
Background
Meanbased clustering algorithms such as bisecting kmeans generally lack robustness. Although componentwise median is a more robust alternative, it can be a poor center representative for high dimensional data. We need a new algorithm that is robust and works well in high dimensional data sets e.g. gene expression data.
Results
Here we propose a new robust divisive clustering algorithm, the bisecting kspatialMedian, based on the statistical spatial depth. A new subcluster selection rule, Relative Average Depth, is also introduced. We demonstrate that the proposed clustering algorithm outperforms the componentwisemedianbased bisecting kmedian algorithm for high dimension and low sample size (HDLSS) data via applications of the algorithms on two real HDLSS gene expression data sets. When further applied on noisy real data sets, the proposed algorithm compares favorably in terms of robustness with the componentwisemedianbased bisecting kmedian algorithm.
Conclusion
Statistical data depths provide an alternative way to find the "center" of multivariate data sets and are useful and robust for clustering.
Background
In gene expression studies, the number of samples in most data sets is limited, while the total number of genes assayed is easily ten or twenty thousand. Such high dimension and low sample size data arise not only commonly in genomics but also frequently emerge in various other areas of science. In radiology and biomedical imaging, for example, one is typically able to collect far fewer measurements about an image of interest than the number of pixels.
These HDLSS data present a substantial challenge to many methods of classical analysis, including cluster analysis. In high dimensional data, it is not uncommon for many attributes to be irrelevant. In fact, the extraneous data can make identifying clusters very difficult [1]. Robust clustering methods are needed that are resistant to small perturbations of the data and the inclusion of unrelated variables [2].
The bisecting kmeans algorithm is a hybrid of hierarchical clustering and the kmeans algorithm. It proceeds topdown, splitting a cluster into two in each step, after which it will select one cluster based on a selection rule (commonly the cluster with the largest variance) to further split. In each splitting step, it randomly picks a pair of data points that are symmetric about the "center" of the data and assigns all other data points to one cluster or the other based on distance to the two selected points, thus the algorithm is similar to the kmeans algorithm. The center is usually the mean. This whole process continues until each point is a cluster or a predefined number of clusters is reached.
Similar to other commonly used methods that are based on mean, e.g. kmeans, bisecting kmeans is not robust because the mean is susceptible to outliers and noise [3]. As a common remedy, the bisecting kmedian algorithm, which replaces the mean by the componentwise median, is less sensitive to outliers. However, the componentwise median may be a very poor center representative of data, because it disregards the interdependence information among the components and is calculated separately on each component (dimension). For example, the componentwise median of the points (a, 0, 0), (0, b, 0) and (0, 0, c) for arbitrary reals a, b, c is (0, 0, 0) which even does not lie on the plane passing through the three points.
A new center representative for multivariate data that is robust and takes into account the interdependence among the dimensions is clearly needed.
Of the various multivariate medians, however, those defined via statistical depth functions are advantageous because the theory of statistical depth has been quite nicely established, though it is still relatively young and still under development. Analogous to linear order in one dimension, statistical depth functions provide an ordering of all points from the center outward in a multivariate data set. Linear order induces an ordering and ranking for 1dimensional observations. Median is the "deepest" point in the data set. In contrast, for dimension d ≥ 2, there is no natural order. As compensation, it is convenient and natural to orient to a "center", the deepest point, that is, the multivariate median. This leads to centeroutward ordering of points and to a description in terms of nested contours. Tukey [4] first introduced halfspace depth. Oja [5] defined Oja depth. Liu [6] proposed simplicial depth. Zuo and Serfling [7] considered projection depth. Other notions include Zonoid depth [8], generalized Tukey depth [9], and spatial depth [10] among others. See [7] for a systematic exhibition.
Of the various depth functions, the spatial depth is especially appealing because of its computational ease and mathematical tractability, see Vardi [11], Serfling [12], Chaudhuri [10] and Koltchinskii [13] among others. The spatial depth (SPD) of a point x w.r.t. a distribution F is defined asSPD(x, F) = 1  _{ F }S(x  X), x ∈ ℝ^{d},
where S(x) = x/x is the spatial sign function (S(0) = 0) with Euclidean norm ·. The sample spatial depth is
where F_{ n }(x) is the empirical distribution function of the data X_{1},...,X_{ n }. Points deep inside the data cloud have high depth values, while the points on the outskirts have lower depth values.
Figure 1 illustrates the spatial depth. Let e_{ i }= S(y x_{ i }) = (y  x_{ i }) = (y  x_{ i }) where e_{ i }represents the unit vector from y to x_{ i }. When y is located deep inside the cloud of x's, summing up e_{ i }will result in a vector close to \overrightarrow{0}, since unit vectors have different directions and they cancel each other out. The depth of y is approaching 1. See the diagram on the left in Figure 1. When y is outside the data cloud (as in the diagram on the right in Figure 1), the sum of e_{ i }has a large norm, thus the depth is approaching 0. The point where the spatial depth attains its maximum value 1 is called the spatial median. The spatial median represents the geometric center of the data, in particular, for a symmetrical distribution, the spatial median is the symmetric center. The spatial depth and the spatial median possess many nice properties. Robustness is one of them.
From the definition of the sample spatial depth, it is not difficult to see that the depth value of a point x does not change if any observations are moved to ∞ along the rays connecting them to the point x. Thus the spatial depth and the spatial median are highly robust in the presence of outliers. In fact, the breakdown point of the spatial median is 1/2, depending on neither the data nor the dimension and reaching the highest possible value for the translation equivalent location estimator. Here the "breakdown point" is the prevailing quantitative measure of robustness proposed by Donoho and Huber [14]. Roughly speaking, the breakdown point is the minimum fraction of the "bad" data points that can render the estimator beyond any boundary. It is clear to see that one bad point of a data set is enough to ruin the sample mean. Thus, the breakdown point of mean is 1/n → 0, the lowest possible value. That is, the sample mean vector is not robust, hence neither is the clustering method kmeans which is based on nonrobust sample means.
Unlike the componentwise median, the spatial median is equivariant under orthogonal transformations (e.g. rotations) of the data though it is not equivariant under general affine transformation. The spatial median may not be a reasonable estimate when the scale of different coordinates of the data are widely different. It is, however, very desirable for preprocessed gene data, where variables are isometric.
The complexity of the spatial median is O(n^{2}) for sample size n regardless of the dimension. This independence of dimension is particularly important for HDLSS data because high dimension usually causes problems for classical methods.
In our bisecting kspatialMedian algorithm, we propose the use of a robust spatial median to replace the nonrobust mean or the lessrobust componentwise median to determine the center of the data. The bisecting kspatialMedian algorithm is shown to be more robust than the bisecting kmedian algorithm in high dimension.
For the selection criterion, we replace the largest variance criterion, which is sensitive to outliers, and propose a depthbased notion, relative average depth (RAD), which characterizes the separatedness of a data set. With its range in [0, 2], a smaller value of the relative average depth indicates less separatedness and a larger value is an indication of higher separatedness. Indeed, in conjunction with the robust spatial median, we can use any existing selection criterion, including largest variance.
Results and discussion
Simulation study
To demonstrate the difference in performance between algorithms based on the spatial median and the componentwise median, we conduct a simulation of four clusters in ℝ^{3}, see Figure 2. Clusters I and II are comprised of data points (X, 0, 0) with X generated from the uniform distributions U(1.5, 2) and U(2.5, 3); and clusters III and IV comprised of data points (0, Y, 0) and (0, 0, Z) with Y from U(0.5, 1.2) and Z from U(3.5, 4.5), where III and IV have the same sample size equaling the sum of the sample sizes of I and II. We observe that the bisecting kmedian completely fails to separate the four clusters, while the proposed bisecting kspatialMedian successfully finds the four clusters. As shown in Figure 2, the four clusters were perfectly identified by the bisecting kspatialMedian algorithm. Since the output of the bisecting kmedian is the whole data set, its graph is in one color, without identification of clusters.
The phenomenon observed in the above simulated data seems unrepresentative because the data structure appears so contrived. But actually this is a quite general structure for HDLSS data. In fact, Hall et al. [15] show that there is a tendency for HDLSS data to lie deterministically at the vertices of a regular simplex and all the randomness in the data appears as a random rotation of this simplex. Based on this geometric representation, we have shown that the angle between any two distinct data points centered at their common mean is approximately perpendicular, and all these centered data points will lie on the coordinate axes. See the Methods section for more details.
The bisecting kspatialMedian algorithm
Based on the spatial median, we propose the bisecting kspatialMedian algorithm. Specifically, the bisecting kspatialMedian algorithm recursively splits a cluster by randomly choosing one point C_{ L }as the center of one subcluster. Let C be the spatial median of the whole data set. Then the center C_{ R }of the other subcluster is determined as the symmetric point of C_{ L }about C, namely, C_{ R }= C  (C_{ L } C). Every point X in the cluster is assigned to the subcluster containing C_{ L }or C_{ R }according to the smaller Euclidean distance X  C_{ L } or X  C_{ R }. This process is repeated until the convergence criterion is met, namely, the centers of the subclusters no longer change. After the cluster is split into two subclusters, a selection rule is needed to determine which subcluster is to be further split.
The basic bisecting kspatialMedian algorithm follows:
INITIALIZE:
K: number of clusters
C: center (spatial median) of the data cluster
C_{ L }: center of left subcluster
C_{ R }: center of right subcluster
FOR i = 1 to K  1 do
choose a cluster to split by the selection rule
randomly select a point C_{ L }as center of left subcluster
compute C_{ R }= C  (C_{ L } C)
for j = 1 to MAXITER do
for each data point X_{ i }
if X_{ i } C_{ L } ≥ X_{ i } C_{ R }
assign X_{ i }to the right subcluster
else
assign X_{ i }to the left subcluster
end
Let M_{ L }be the spatial median of the left subcluster
Let M_{ R }be the spatial median of the right subcluster
if M_{ L }== C_{ L }and M_{ R }== C_{ R }
break
C_{ L }= M_{ L }
C_{ R }= M_{ R }
end
END
Subcluster selection rule
In the bisecting kspatialMedian algorithm, we need to decide which cluster is to be further split in each step. Selecting the one with the largest variance is a very common approach. Here we propose a new rule based on the statistical spatial depth.
Suppose that a data set is naturally composed of two clusters J_{1} and J_{2}. Let {D}_{1}^{w} be the sum of spatial depth values of all data points in J_{1} with respect to J_{1}. Let {D}_{2}^{w} be the sum of spatial depth values of all data points in J_{2} with respect to J_{2}. Note that {D}_{1}^{w} or {D}_{2}^{w} represents "withindepth", because it is calculated with respect to the cluster to which the data points belong. Let {D}_{1}^{b} be the sum of spatial depth values of all data points in J_{1} with respect to J_{2}. Similarly, let {D}_{2}^{b} be the sum of spatial depth values of all data points in J_{2} with respect to J_{1}. {D}_{1}^{b} or {D}_{2}^{b} represents ''betweendepth", because it is calculated with respect to the other cluster. See Figure 3 for a graphic display. The withindepth is larger when a cluster is more condensed whereas the betweendepth is smaller when two clusters are further away from each other.
Let J_{1} and J_{2} represent the number of data points in J_{1} and J_{2} respectively. The relative average depth is defined as
As shown from Figure 3, if a data set is naturally composed of two clusters and thus should be split into two, the withindepth should be relatively large and the betweendepth relatively small, therefore the relative average depth (RAD) which is essentially the averaged difference between the withindepth and the betweendepth will be relatively large compared to the RAD of a data set that is more condensed and cannot be split into two clusters obviously. In fact we have shown that a larger value of RAD indicates less condenseness of a data set. See Section Methods for technical details. Hence we obtain a new selection rule: A cluster with the largest value of RAD should be selected to split.
The following simulation demonstrates the relationship between the value of RAD and the condenseness of a data set. As shown in Figure 4a, two clusters were generated from normal distributions with means μ_{1} = (0, 0) and μ_{2} = (4, 4), covariances Σ_{1} = (1, 0.5; 0.5, 1) and Σ_{2} = (1, 0.5; 0.5, 1) for the same sample size 200. Obviously the data comprises of two clusters and should be split as such. The relative average depth RAD = 0.7864. If the second cluster is moved from μ_{2} = (4, 4) to μ_{2} = (6, 6), the two clusters are further away from each other, as shown in Figure 4b. Compared with the previous situation, this new data should have higher priority to be selected for further splitting. The relative average depth increases to RAD = 0.8018. Table 1 lists the values of RAD with one cluster being moved further away from another one with μ_{1} = (0, 0). We can see that the RAD value increases slowly when the two clusters are more separated.
Applications
Data sets
We use the proposed bisecting kspatialMedian algorithm to analyze two well known data sets. The first is the colon cancer data (Alon data) [16], which is comprised of expression levels of 2000 genes describing 62 samples (40 tumor and 22 normal colon tissues, Affymetrix oligonucleotide arrays). The second is a pediatric Acute Lymphoblastic Leukemia (ALL) data from St. Jude Children's Research Hospital (SJCRH) [17], which includes 12,625 gene expression measurements (Affymetrix arrays) per patient from 246 patients with six different subtypes of ALL.
In the investigation at SJCRH, 246 cases of pediatric ALL were analyzed on the U133 A and B chips, involving six primary subtypes of ALL: BCRABL, E2APBX1, Hyperdiploid > 50, MLL, TALL and TEL. The original data has patient information with two additional subtypes, which did not fit into one of the above primary diagnostic groups or were added for the analysis of relapse and secondary AML. Our study did not include these two subtypes.
Design of the experiment
Since the mean is known to lack robustness, we will focus on the comparison of the bisecting algorithms based on componentwise median and spatial median in this paper.
The two data sets were used to compare the performance of the proposed bisecting kspatialMedian with the bisecting kmedian. Since the class labels of the two data sets are known, the number K of clusters is also known. The Alon data set has two classes, so K = 2. For the ALL data from SJCRH, K = 6. The algorithms are applied on the two datasets and terminated when K clusters have been reached.
In order to investigate the performance of the proposed clustering algorithm for HDLSS data, we test them on the two data sets for various dimensions, i.e., different number of genes selected. For the ALL data which has 12265 genes, we test the dimensions \mathcal{D} = {100; 200; 500; 1000; 1500; 2000; 3000; 4000; 5000}; for the Alon data which has 2000 genes we test the dimensions \mathcal{D} = {50; 100; 200; 500; 1000; 2000}.
For each \mathcal{D}, we trim the data with only \mathcal{D} "most important" genes. We use the SVMRFEAnnealing algorithm [18] to select the \mathcal{D} most important genes. All clustering algorithms are then applied to the trimmed data.
Validation of the clustering results is usually not easy. However, in situations where data are already categorized, as with these data, we can compare the predicted clusters from our algorithms with the true class labels. To display the results, we build a confusion matrix in which rows represent the predicted clusters while columns represent the true clusters. The number in the cell (i, j) is the number of observations that are from cluster j but are predicted to be from cluster i. The rows and columns are "matched" by a brute force algorithm, and this is optimistic. Two evaluation measures, Entropy and Misclustering rate, are used. See the Methods section for more details.
Because the bisecting divisive clustering algorithm randomly selects a point as the center of the subcluster C_{ L }, it is nondeterministic and therefore yields stochastic clustering results. To evaluate the stochastic clustering result, we ran each algorithm 20 times and calculated the average entropy and misclustering rates as the clustering measures. These algorithms select the next subcluster to split based on the criterion of the largest variance. We compare the performance of our proposed bisecting kspatialMedian with bisecting kmedian based on the same selection rule, the largest variance, on the two data sets. The performance of bisecting kspatialMedian with the selection criterion of the relative average depth is also presented.
To investigate the robustness of our proposed procedure, we compare the sensitivity of the proposed algorithm to noise with the bisecting kmedian algorithm. We add noise to the Alon data and then apply the three algorithms (bisecting kmedian, bisecting kspatialMedian with largest variance splitting rule, bisecting kspatialMedian with RAD splitting rule) on it to investigate their performance.
We generated a percentage of random noise and added to the Alon data by changing the expression value of a point to either the maximum or minimum value of all data points. In this way, some data points are changed to have extreme values and more likely to become outliers. Experiments show that our proposed algorithms based on spatial median perform better than the bisecting kmedian algorithm in this noisy environment.
The result on the Alon data
Figure 5 reports the entropy and the misclustering rates of the algorithms on the trimmed Alon data. These algorithms are the bisecting kmedian (median), the bisecting kspatialMedian (spatialMedian), the bisecting kspatialMedian based on the selection criterion of the relative average depth (SMRAD). The first two algorithms use the largest variance as selection rule.
From Figure 5a and 5b, we can see that both of the algorithms using spatial median have lower entropy and misclustering rates than the one using componentwise median in most of the cases. When we use more than 400 genes in clustering, the algorithms using spatial median are better than the one using componentwise median, which demonstrates that spatial median is more robust in higher dimensional data. Also the performance of the algorithm using median is decreasing dramatically with dimensions increasing from 200 to 1000, while the performance of the algorithms using spatial median does not degrade as much.
Figure 6 shows the entropy values with standard deviation of the three algorithms. We can see that the three algorithms display similar variation, about 0.2 in most cases. The very similar results are obtained by using misclustering rate.
Additional file 1 gives an example of the relationship of the number of runs and average entropy of the Alon data. In additional file 1, the entropy values get more stable with the number of runs increasing, which justifies the need of running the clustering algorithms multiple times. The average misclustering rate and the number of runs have the similar relationship.
The result on the SJCRH data
Similarly, Figure 7 reports the entropy and misclustering rates of the algorithms on the trimmed SJCRH data. We can see that in most of the cases after 500 genes are used, both of the algorithms using spatial median are better than the bisecting kmedian. The largest difference between bisecting kspatialMedian and median is more than 10%. The results are consistent with the results on the Alon data.
Similarly, Figure 8 shows the entropy values with standard deviation of the three algorithms. We can see that the three algorithms display similar variation, less than 0.1 in most cases, although the algorithm using median achieves the lowest standard deviation. Standard deviation appears to be more consistent with median than with spatialMedian on the SJCRH data. The very similar results are obtained by using misclustering rate.
Additional file 2 gives an example of the relationship of the number of runs and average entropy of the SJCRH data. In additional file 2, the entropy values get more stable with the number of runs increasing. The average misclustering rate and the number of runs have the similar relationship.
The result on the noisy Alon data
We randomly add noise to the Alon data to see how well the algorithms based on the componentwise median and the spatial median perform in a noisy environment.
To this end, we randomly pick 10% of data from the Alon data, and reset their values to be either the maximum or minimum value in the data matrix.
We applied the three algorithms to this noisy data and observed that all the algorithms have been influenced by the noise. However, the bisecting kmedian is more susceptible to the noise, which can be demonstrated by the fact that it cannot separate the two clusters at all.
This process is repeated several times and the results are very consistent. We further increase the amount of noise from 10% to 20% and get a similar result.
Figure 9 shows that the algorithms based on spatial median have very similar entropy values and misclustering rates on the noisy Alon data. Since the bisecting kmedian cannot separate the two clusters, its entropy value or misclustring rate is not available thus not shown in Figure 9.
Conclusion
The spatial depth function provides a robust location estimator whereas componentwise median may not work well in high dimension and low sample size data, which is illustrated by easily designed simulation. The experimental results on real data sets further verify that the spatial median based bisecting clustering algorithm is more robust to outliers and noise in high dimensional data, such as gene expression data, than the bisecting kmedian algorithm.
Methods
Geometric structure of HDLSS data
In their 2005 article, Hall, Marron and Neeman [15] point out that for ddimensional i.i.d. random vectors Z_{1},...,Z_{ m }whose coordinates are i.i.d. with the standard normal \mathcal{N}(0, 1), all distinct pairwise Euclidean distances Z_{ i } Z_{ j }_{ d }are approximately equal and all pairwise angles ang(Z_{ i }, Z_{ j }) are approximately perpendicular for large d. Without normality assumptions they further demonstrate that all pairwise distances are still approximately equal under certain moment assumptions. Specially they give the following geometric representation. For an infinite sequence X = (X^{(1)}, X^{(2)},...) of random variables, assume

(i)
There exists a constant M such that \mathbb{E}X^{(i)}^{4} <M for all i = 1, 2,....

(ii)
There is a constant σ^{2} such that
\begin{array}{cc}\frac{1}{d}{\displaystyle \sum _{k=1}^{d}\mathbb{V}ar({X}^{(k)})}\to {\sigma}^{2},& d\to \infty .\end{array}(1) 
(iii)
The infinite sequence X is ρ mixing, for detail, see [15].
Let X(d) = (X^{(1)},...,X^{(d)}) be a coordinate projection of X into the ddimensional space ℝ^{d}and let X_{1}(d),...,X_{ m }(d) be independent and identical copies of X(d). Then for all distinct pairs X_{ i }≠ X_{ j }, the distances {\Vert {X}_{i}{X}_{j}\Vert}_{d}={\left({{\displaystyle {\sum}_{k=1}^{d}({X}_{i}^{(k)}{X}_{j}^{(k)})}}^{2}\right)}^{1/2} are approximately equal:d^{1/2}X_{ i } X_{ j }_{ d }→ 2 σ, d → ∞.
Observing their result, we find, with μ = \mathbb{E}X_{ i }, that
d^{1/2}X_{ i } X_{ j }_{ d } d^{1/2}X_{ i } μ_{ d } d^{1/2}X_{ j } μ_{ d }→ 0,
as d → ∞. This shows, in view of the Pythagorean theorem, the following fact.
Fact 1. Under the above assumptions (i)–(iii), the pairwise angle between distinct X_{ i } μ_{ i }and X_{ j } μ_{ j }is approximately perpendicular:
ang(X_{ i } μ, X_{ j } μ) = π/2 + O_{ p }(d^{1/2}). (3)
It is well known that spatial depth function attains its maximum value at the symmetric center of a distribution under very mild assumptions and the spatial median is the maximizer. Thus the spatial median is the center of the regular simplex when the number of observations at every vertex is equal.
This exhibits that, for HDLSS data, the spatial depth can find the center and this helps find the right clusters, while a componentwise median may fail to find the symmetric center and thus the componentwisemedianbased procedures may be unable to find the right clusters. In fact, we expanded the dimension of our data set from the previous simulation which has three dimensions as shown in Figure 2 and found that the componentwisemedianbased bisecting kmedian breaks down more easily with increasing dimension while the bisecting kspatialMedian does not.
Theoretical verification of subcluster selection rule
Suppose that we have collected observations X_{ j }: j ∈ J = {1,...,n} which are points in ℝ^{d}. Suppose also that these observations are from two sources. We want to find a rule to measure the condenseness of the data, in other words, how different the two resources are. Statistically we suppose that X_{ j }: j ∈ J = {1,...,n} are independent observations from a population distribution F. Suppose that X_{ j }: j ∈ J_{1} and X_{ j }: j ∈ J_{2} are from population distributions F_{1} and F_{2} respectively with J_{1}, J_{2} being partitions of J. For convenience we refer to these two subclusters of J as J_{1} and J_{2} respectively. We want to use the robust depth functions to measure the condenseness of J, or in other words, the separatedness of J_{1} and J_{2}. Let D(x, F) be the population depth of a point x with respect to F. The sample depth is D(x, J) ≡ D(x, F_{ n }) where F_{ n }is the empirical distribution of F.
One of the desirable properties for most of the depth functions is monotonicity relative to the deepest point, i.e., the depthbased multivariate median. Specifically, as a point x ∈ ℝ^{d}moves away from the multivariate median M along any fixed ray through M, the depth at x decreases monotonically, namely,
D(x, F) ≤ D(M + α(x  M), F), x ∈ ℝ^{d} (4)
holds for all α ∈ [0, 1]. This property can be used to characterize the separatedness of the two clusters. For unambiguity let us write X_{ i }for the observations X_{ i }: i ∈ J_{1} and Y_{ j }for X_{ j }: j ∈ J_{2}.
Suppose that clusters J_{1} and J_{2} are separated. Observe that, by the monotonicity (4), if X_{ i }is from cluster J_{1} and Y_{ j }from cluster J_{2} then the depth of X_{ i }should be larger than the depth of Y_{ j }, both with respect to cluster J_{1}. Namely,
D(X_{ i }, J_{1}) ≽ D(Y_{ j }, J_{1}), i ∈ J_{1}, j ∈ J_{2}, (5)
where is the stochastic ordering in the sense that η ≽ ξ, if and only if ℙ (η ≥ ξ) ≥ 1/2 for two random variables η, ξ. The inequalities are useful in characterizing the separatedness of two clusters J_{1} and J_{2}.
Note that D(X_{ i }, J_{1}) and D(Y_{ j }, J_{1}) are called within and betweendepth by [19] and [2]. The population version of (5) is
D(X, F_{1}) ≽ D(Y, F_{1}), X ~ F_{1}, Y ~ F_{2}. (6)
The inequality has clear geometric interpretation. With respect to distribution F_{1}, the depth of random variable X from distribution F_{1} is larger than the depth of random variable Y from distribution F_{2}. Indeed we have the following fact for the spatial depth.
Fact 2. Suppose F_{2} = F_{1}(·  c) where c ∈ ℝ^{d}is a constant vector. If F_{1} has finite support, then for X ~ F_{1} and Y ~ F_{2},
Proof. Using \mathbb{E}S(x  ξ)^{2} = \mathbb{E}_{ξ,η}S^{⊥}(x  ξ)S(x  η) where ξ, η are independent and have a common distribution and \mathbb{E}_{ξ,η}is calculated under the joint probability of ξ and η, one has
ℙ(SPD(X, F_{1}) ≥ SPD(Y, F_{1})) = ℙ(_{ξ,η}[S^{⊥}(X  ξ)S(X  η)  S^{⊥}(X  ξ + c)S(X  η + c)] ≤ 0).
It is easy to see S^{⊥}(X  ξ + c)S(X  η + c) → 1 as c → ∞. Combining the above yields the desired result and the proof is complete.
Fact 2 implies that if one cluster is shifted away further enough then we have the stochastic ordering (6) and hence (5) for large sample.
However, the inequality is a little too strong. Instead of (6) holding for all X ~ F_{1} and Y ~ F_{2}, a less restrictive inequality would be to require (6) to hold on average, i.e.,
Analogously,
Indeed similarly to the proof of Fact 2, we may establish the above two inequalities which shall be discussed elsewhere.
In order to characterize the separatedness of the two clusters we first introduce the following notions.
Depth total, Within and BetweenDepth
Let D_{J}be the sum of the sample depths of all observations on J, i.e., D_{J}= ∑_{j∈J}D(X_{ j }, J), and we call it the depth total on J. We call the depth total on J_{1} and J_{2},
the withindepth, and
the betweendepth. Figure 3 is a graphic display of these notations.
Summing up i ∈ J_{1}, j ∈ J_{2} through (5) yields
Analogously,
These two inequalities can be used to characterize the separatedness of two clusters J_{1} and J_{2}. To exploit the inequalities simultaneously we introduce the following.
Relative average depth
is called the relative average depth. If clusters F_{1} and F_{2} are separated, then the two inequalities (7) and (8) should hold. We believe that the two inequalities can be used to characterize the separatedness of two clusters of random variables. Note that if indeed Y is from the same distribution as F_{1}, namely, F_{1} = F_{2}, then the equalities in (7) and (8) hold. In other words, a value of RAD close to zero indicates the cluster J is actually one cluster. Clearly RAD is bounded from above by 2. A value of RAD close to 2 indicates that the cluster J is comprised of two clusters J_{1} and J_{2}. Summarizing our discussion above, we have the following result.
Selection criterion
A cluster with the largest value of RAD should be selected to split. If a cluster is less condensed, the RAD value will be larger. So the cluster with the largest RAD value will be the least condensed and thus should be selected for splitting.
Evaluation measures
Suppose that Z = (z_{ ij }) is the m × n confusion matrix, where z_{ ij }is the number of data points which are predicted from cluster C_{ i }but in fact are from the true cluster C_{ j }. For generality, we use m and n where m and n can be different. But in our experiments, the number of actual clusters k is known, therefore m = n = k. {m}_{j}={\displaystyle {\sum}_{i=1}^{m}{z}_{ij}} is the number of data points in the true cluster j and {n}_{i}={\displaystyle {\sum}_{j=1}^{n}{z}_{ij}} is the number of data points in the predicted cluster i. Let N be the total number of data points.
One common measure of cluster quality is entropy. The entropy of predicted cluster i is defined as:
where k is the number of clusters.
The value of entropy ranges from 0 to 1. An entropy value of 0 means the cluster is comprised entirely of one class, while an entropy value near 1 implies that the cluster contains a uniform mixture of classes. The smaller the entropy value, the better the clustering performance.
Another measure of clustering we use is misclustering rate. Based on the confusion matrix, the accuracy jth cluster is for z_{ ij }/m_{ j }. Since each true cluster contributes m_{ j }to the total N={\displaystyle {\sum}_{i=1}^{m}{m}_{i}+}{\displaystyle {\sum}_{j=1}^{n}{n}_{j}} data points, its contribution has a weight m_{ j }/N. The global accuracy [20] is the weighted sum,
Then the misclustering rate is 1{\displaystyle {\sum}_{j=1}^{n}\frac{{z}_{jj}}{N}}.
Since we do not know how to match up the predicted clusters with the true ones, z_{ ij }on the diagonal of the confusion matrix may not be the accurately predicted number of data. We use brute force to search for the best alignment between the predicted and the true clusters. The time complexity is O(k!) if there are k true clusters and k predicted clusters. This brute force approach is not a part of the algorithm itself, but is used to aid in a fair evaluation.
References
Parsons L, Haque E, Liu H: Subspace Clustering for High Dimensional Data: a Review. SIGKDD Explor Newsl. 2004, 6: 90105. 10.1145/1007730.1007731.
Jörnsten R, Vardi Y, Zhang CH: A Robust Clustering Method and Visualization Tool Based on Data Depth. 2002, Basel: Birkhäuser
GarciaEscudero LA, Gordaliza A: Robustness Properties of k Means and Trimmed k Means. Journal of the American Statistical Association. 1999, 94 (447): 956969. 10.2307/2670010.
Tukey W: Mathematics and the Picturing of Data. Proceedings of the International Congress of Mathematicians. 1975, 2: 523531.
Oja H: Descriptive Statistics for Multivariate Distributions. Statist Probab Lett. 1983, 1: 327333. 10.1016/01677152(83)900548.
Liu RY: On a Notion of Data Depth Based upon Random Simplices. The Annals of Statistics. 1990, 18: 405414.
Zuo Y, Serfling R: General Notions of Statistical Depth Function. The Annals of Statistics. 2000, 28 (2): 461482. 10.1214/aos/1016218226.
Koshevoy G, Mosler K: Zonoid Trimming for Multivariate Distributions. Annals of Statistics. 1997, 25 (5): 19982017. 10.1214/aos/1069362382.
Zhang J: Some Extensions of Tukey's Depth Function. Journal of Multivariate Analysis. 2002, 82: 134165. 10.1006/jmva.2001.2011.
Chaudhuri P: On a Geometric Notion of Quantiles for Multivariate Data. Journal of the American Statistical Association. 1996, 91: 862872. 10.2307/2291681.
Vardi Y, Zhang CH: The Multivariate L1median and Associated Data Depth. Proc Natl Acad Sci USA. 2000, 97 (4): 14231426. 10.1073/pnas.97.4.1423.
Serfling R: A Depth Function and a Scale Curve Based on Spatial Quantiles. 2002, Birkhäeuser: Boston: Birkhauser, 2538.
Koltchinskii VI: Mestimation, Convexity and Quantiles. Ann Statistics. 1997, 25: 435477. 10.1214/aos/1031833659.
Donoho DL, Huber P: The Notion of Breakdown Point. 1983, Belmont, CA: Wadsworth, 157184.
Hall P, Marron JS, Neeman A: Geometric Representation of High Dimension, Low Sample Size Data. J R Statist Soc B. 2005, 67: 427444. 10.1111/j.14679868.2005.00510.x.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc Natl Acad Sci USA. 1999, 96: 67456750. 10.1073/pnas.96.12.6745.
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR: Pediatric Lymphoblastic Leukemia by Gene Expression Profiling. Cancer Cell. 2002, 1: 133143. 10.1016/S15356108(02)000326.
Ding Y, Wilkins D: Improving the Performance of SVMRFE to Select Genes in Microarray Data. BMC Bioinformatics. 2006, 7 (Suppl 2):
Jörnsten R: Clustering and Classification Based on the L1 Data Depth. Journal of Multivariate Analysis. 2004, 90: 6789. 10.1016/j.jmva.2004.02.013.
Ding C, He X: Cluster Merging and Splitting in Hierarchical Clustering Algorithms. Proceedings of IEEE International Conference on Data Mining (ICDM'02). 2002
Acknowledgements
We thank Dr. Yixin Chen for his valuable suggestion. We also thank Alon et al. and St. Jude Children's Research Hospital for the use of their data sets.
This article has been published as part of BMC Bioinformatics Volume 8 Supplement 7, 2007: Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/8?issue=S7.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
XD and HP contributed to the theoretical development. YD and DW contributed to the experimentation and development of the computer code. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2007_1939_MOESM1_ESM.eps
Additional file 1: The relationship between the number of runs and average entropy of the three algorithms on the Alon data. Additional file 1 demonstrates that when the algorithms are run more times, the average entropy values of all the algorithms get more stable. In this figure, 500 genes were selected. (EPS 14 KB)
12859_2007_1939_MOESM2_ESM.eps
Additional file 2: The relationship between the number of runs and average entropy of the three algorithms on the SJCRH data. Additional file 2 demonstrates that when the algorithms are run more times, the average entropy values of all the algorithms get more stable. In this figure, 1000 genes were selected. (EPS 14 KB)
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Ding, Y., Dang, X., Peng, H. et al. Robust clustering in high dimensional data using statistical depths. BMC Bioinformatics 8 (Suppl 7), S8 (2007). https://doi.org/10.1186/147121058S7S8
Published:
DOI: https://doi.org/10.1186/147121058S7S8