Robust clustering in high dimensional data using statistical depths

Ding, Yuanyuan; Dang, Xin; Peng, Hanxiang; Wilkins, Dawn

doi:10.1186/1471-2105-8-S7-S8

Volume 8 Supplement 7

Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine

Proceedings
Open access
Published: 01 November 2007

Robust clustering in high dimensional data using statistical depths

Yuanyuan Ding¹,
Xin Dang²,
Hanxiang Peng² &
…
Dawn Wilkins¹

BMC Bioinformatics volume 8, Article number: S8 (2007) Cite this article

5338 Accesses
12 Citations
Metrics details

Abstract

Background

Mean-based clustering algorithms such as bisecting k-means generally lack robustness. Although componentwise median is a more robust alternative, it can be a poor center representative for high dimensional data. We need a new algorithm that is robust and works well in high dimensional data sets e.g. gene expression data.

Results

Here we propose a new robust divisive clustering algorithm, the bisecting k-spatialMedian, based on the statistical spatial depth. A new subcluster selection rule, Relative Average Depth, is also introduced. We demonstrate that the proposed clustering algorithm outperforms the componentwise-median-based bisecting k-median algorithm for high dimension and low sample size (HDLSS) data via applications of the algorithms on two real HDLSS gene expression data sets. When further applied on noisy real data sets, the proposed algorithm compares favorably in terms of robustness with the componentwise-median-based bisecting k-median algorithm.

Conclusion

Statistical data depths provide an alternative way to find the "center" of multivariate data sets and are useful and robust for clustering.

Background

In gene expression studies, the number of samples in most data sets is limited, while the total number of genes assayed is easily ten or twenty thousand. Such high dimension and low sample size data arise not only commonly in genomics but also frequently emerge in various other areas of science. In radiology and biomedical imaging, for example, one is typically able to collect far fewer measurements about an image of interest than the number of pixels.

These HDLSS data present a substantial challenge to many methods of classical analysis, including cluster analysis. In high dimensional data, it is not uncommon for many attributes to be irrelevant. In fact, the extraneous data can make identifying clusters very difficult [1]. Robust clustering methods are needed that are resistant to small perturbations of the data and the inclusion of unrelated variables [2].

The bisecting k-means algorithm is a hybrid of hierarchical clustering and the k-means algorithm. It proceeds top-down, splitting a cluster into two in each step, after which it will select one cluster based on a selection rule (commonly the cluster with the largest variance) to further split. In each splitting step, it randomly picks a pair of data points that are symmetric about the "center" of the data and assigns all other data points to one cluster or the other based on distance to the two selected points, thus the algorithm is similar to the k-means algorithm. The center is usually the mean. This whole process continues until each point is a cluster or a predefined number of clusters is reached.

Similar to other commonly used methods that are based on mean, e.g. k-means, bisecting k-means is not robust because the mean is susceptible to outliers and noise [3]. As a common remedy, the bisecting k-median algorithm, which replaces the mean by the componentwise median, is less sensitive to outliers. However, the componentwise median may be a very poor center representative of data, because it disregards the interdependence information among the components and is calculated separately on each component (dimension). For example, the componentwise median of the points (a, 0, 0), (0, b, 0) and (0, 0, c) for arbitrary reals a, b, c is (0, 0, 0) which even does not lie on the plane passing through the three points.

A new center representative for multivariate data that is robust and takes into account the interdependence among the dimensions is clearly needed.

Of the various multivariate medians, however, those defined via statistical depth functions are advantageous because the theory of statistical depth has been quite nicely established, though it is still relatively young and still under development. Analogous to linear order in one dimension, statistical depth functions provide an ordering of all points from the center outward in a multivariate data set. Linear order induces an ordering and ranking for 1-dimensional observations. Median is the "deepest" point in the data set. In contrast, for dimension d ≥ 2, there is no natural order. As compensation, it is convenient and natural to orient to a "center", the deepest point, that is, the multivariate median. This leads to center-outward ordering of points and to a description in terms of nested contours. Tukey [4] first introduced halfspace depth. Oja [5] defined Oja depth. Liu [6] proposed simplicial depth. Zuo and Serfling [7] considered projection depth. Other notions include Zonoid depth [8], generalized Tukey depth [9], and spatial depth [10] among others. See [7] for a systematic exhibition.

Of the various depth functions, the spatial depth is especially appealing because of its computational ease and mathematical tractability, see Vardi [11], Serfling [12], Chaudhuri [10] and Koltchinskii [13] among others. The spatial depth (SPD) of a point x w.r.t. a distribution F is defined asSPD(x, F) = 1 - ||_FS(x - X)||, x ∈ ℝ^d,

where S(x) = x/||x|| is the spatial sign function (S(0) = 0) with Euclidean norm ||·||. The sample spatial depth is

\begin{matrix} S P D (x, F_{n}) = 1 - ‖ \frac{1}{n} \sum_{i = 1}^{n} S (x - X_{i}) ‖, & x \in ℝ^{d}, \end{matrix}

where F_n(x) is the empirical distribution function of the data X₁,...,X_n. Points deep inside the data cloud have high depth values, while the points on the outskirts have lower depth values.

Figure 1 illustrates the spatial depth. Let e_i= S(y -x_i) = (y - x_i) = (||y - x_i||) where e_irepresents the unit vector from y to x_i. When y is located deep inside the cloud of x's, summing up e_iwill result in a vector close to $\vec{0}$ , since unit vectors have different directions and they cancel each other out. The depth of y is approaching 1. See the diagram on the left in Figure 1. When y is outside the data cloud (as in the diagram on the right in Figure 1), the sum of e_ihas a large norm, thus the depth is approaching 0. The point where the spatial depth attains its maximum value 1 is called the spatial median. The spatial median represents the geometric center of the data, in particular, for a symmetrical distribution, the spatial median is the symmetric center. The spatial depth and the spatial median possess many nice properties. Robustness is one of them.

From the definition of the sample spatial depth, it is not difficult to see that the depth value of a point x does not change if any observations are moved to ∞ along the rays connecting them to the point x. Thus the spatial depth and the spatial median are highly robust in the presence of outliers. In fact, the breakdown point of the spatial median is 1/2, depending on neither the data nor the dimension and reaching the highest possible value for the translation equivalent location estimator. Here the "breakdown point" is the prevailing quantitative measure of robustness proposed by Donoho and Huber [14]. Roughly speaking, the breakdown point is the minimum fraction of the "bad" data points that can render the estimator beyond any boundary. It is clear to see that one bad point of a data set is enough to ruin the sample mean. Thus, the breakdown point of mean is 1/n → 0, the lowest possible value. That is, the sample mean vector is not robust, hence neither is the clustering method k-means which is based on nonrobust sample means.

Unlike the componentwise median, the spatial median is equivariant under orthogonal transformations (e.g. rotations) of the data though it is not equivariant under general affine transformation. The spatial median may not be a reasonable estimate when the scale of different coordinates of the data are widely different. It is, however, very desirable for preprocessed gene data, where variables are isometric.

The complexity of the spatial median is O(n²) for sample size n regardless of the dimension. This independence of dimension is particularly important for HDLSS data because high dimension usually causes problems for classical methods.

In our bisecting k-spatialMedian algorithm, we propose the use of a robust spatial median to replace the non-robust mean or the less-robust componentwise median to determine the center of the data. The bisecting k-spatialMedian algorithm is shown to be more robust than the bisecting k-median algorithm in high dimension.

For the selection criterion, we replace the largest variance criterion, which is sensitive to outliers, and propose a depth-based notion, relative average depth (RAD), which characterizes the separatedness of a data set. With its range in [0, 2], a smaller value of the relative average depth indicates less separatedness and a larger value is an indication of higher separatedness. Indeed, in conjunction with the robust spatial median, we can use any existing selection criterion, including largest variance.

Results and discussion

Simulation study

To demonstrate the difference in performance between algorithms based on the spatial median and the componentwise median, we conduct a simulation of four clusters in ℝ³, see Figure 2. Clusters I and II are comprised of data points (X, 0, 0) with X generated from the uniform distributions U(1.5, 2) and U(2.5, 3); and clusters III and IV comprised of data points (0, Y, 0) and (0, 0, Z) with Y from U(0.5, 1.2) and Z from U(3.5, 4.5), where III and IV have the same sample size equaling the sum of the sample sizes of I and II. We observe that the bisecting k-median completely fails to separate the four clusters, while the proposed bisecting k-spatialMedian successfully finds the four clusters. As shown in Figure 2, the four clusters were perfectly identified by the bisecting k-spatialMedian algorithm. Since the output of the bisecting k-median is the whole data set, its graph is in one color, without identification of clusters.

The phenomenon observed in the above simulated data seems unrepresentative because the data structure appears so contrived. But actually this is a quite general structure for HDLSS data. In fact, Hall et al. [15] show that there is a tendency for HDLSS data to lie deterministically at the vertices of a regular simplex and all the randomness in the data appears as a random rotation of this simplex. Based on this geometric representation, we have shown that the angle between any two distinct data points centered at their common mean is approximately perpendicular, and all these centered data points will lie on the coordinate axes. See the Methods section for more details.

The bisecting k-spatialMedian algorithm

Based on the spatial median, we propose the bisecting k-spatialMedian algorithm. Specifically, the bisecting k-spatialMedian algorithm recursively splits a cluster by randomly choosing one point C_Las the center of one subcluster. Let C be the spatial median of the whole data set. Then the center C_Rof the other subcluster is determined as the symmetric point of C_Labout C, namely, C_R= C - (C_L- C). Every point X in the cluster is assigned to the subcluster containing C_Lor C_Raccording to the smaller Euclidean distance ||X - C_L|| or ||X - C_R||. This process is repeated until the convergence criterion is met, namely, the centers of the subclusters no longer change. After the cluster is split into two subclusters, a selection rule is needed to determine which subcluster is to be further split.

The basic bisecting k-spatialMedian algorithm follows:

INITIALIZE:

K: number of clusters

C: center (spatial median) of the data cluster

C_L: center of left subcluster

C_R: center of right subcluster

FOR i = 1 to K - 1 do

choose a cluster to split by the selection rule

randomly select a point C_Las center of left subcluster

compute C_R= C - (C_L- C)

for j = 1 to MAXITER do

for each data point X_i

if ||X_i- C_L|| ≥ ||X_i- C_R||

assign X_ito the right subcluster

else

assign X_ito the left subcluster

end

Let M_Lbe the spatial median of the left subcluster

Let M_Rbe the spatial median of the right subcluster

if M_L== C_Land M_R== C_R

break

C_L= M_L

C_R= M_R

end

END

Subcluster selection rule

In the bisecting k-spatialMedian algorithm, we need to decide which cluster is to be further split in each step. Selecting the one with the largest variance is a very common approach. Here we propose a new rule based on the statistical spatial depth.

Suppose that a data set is naturally composed of two clusters J₁ and J₂. Let $D_{1}^{w}$ be the sum of spatial depth values of all data points in J₁ with respect to J₁. Let $D_{2}^{w}$ be the sum of spatial depth values of all data points in J₂ with respect to J₂. Note that $D_{1}^{w}$ or $D_{2}^{w}$ represents "within-depth", because it is calculated with respect to the cluster to which the data points belong. Let $D_{1}^{b}$ be the sum of spatial depth values of all data points in J₁ with respect to J₂. Similarly, let $D_{2}^{b}$ be the sum of spatial depth values of all data points in J₂ with respect to J₁. $D_{1}^{b}$ or $D_{2}^{b}$ represents ''between-depth", because it is calculated with respect to the other cluster. See Figure 3 for a graphic display. The within-depth is larger when a cluster is more condensed whereas the between-depth is smaller when two clusters are further away from each other.

Let |J₁| and |J₂| represent the number of data points in J₁ and J₂ respectively. The relative average depth is defined as

R A D = \frac{D_{1}^{w}}{| J_{1} |} + \frac{D_{2}^{w}}{| J_{2} |} - \frac{D_{1}^{b}}{| J_{1} |} - \frac{D_{2}^{b}}{| J_{2} |} .

As shown from Figure 3, if a data set is naturally composed of two clusters and thus should be split into two, the within-depth should be relatively large and the between-depth relatively small, therefore the relative average depth (RAD) which is essentially the averaged difference between the within-depth and the between-depth will be relatively large compared to the RAD of a data set that is more condensed and cannot be split into two clusters obviously. In fact we have shown that a larger value of RAD indicates less condenseness of a data set. See Section Methods for technical details. Hence we obtain a new selection rule: A cluster with the largest value of RAD should be selected to split.

The following simulation demonstrates the relationship between the value of RAD and the condenseness of a data set. As shown in Figure 4a, two clusters were generated from normal distributions with means μ₁ = (0, 0) and μ₂ = (4, 4), covariances Σ₁ = (1, 0.5; 0.5, 1) and Σ₂ = (1, -0.5; -0.5, 1) for the same sample size 200. Obviously the data comprises of two clusters and should be split as such. The relative average depth RAD = 0.7864. If the second cluster is moved from μ₂ = (4, 4) to μ₂ = (6, 6), the two clusters are further away from each other, as shown in Figure 4b. Compared with the previous situation, this new data should have higher priority to be selected for further splitting. The relative average depth increases to RAD = 0.8018. Table 1 lists the values of RAD with one cluster being moved further away from another one with μ₁ = (0, 0). We can see that the RAD value increases slowly when the two clusters are more separated.

Table 1 The Relative Average Depth. This table illustrates the relationship of RAD and the separatedness of two clusters. Two clusters are from normal distribution with mean μ₁ = (0, 0)and μ₂ = (2, 2). With μ₂ changing from (2, 2) to (7, 7), the value of RAD increases from 0.6310 to 0.8081 as cluster 2 moves further away from cluster 1.

Full size table

Applications

Data sets

We use the proposed bisecting k-spatialMedian algorithm to analyze two well known data sets. The first is the colon cancer data (Alon data) [16], which is comprised of expression levels of 2000 genes describing 62 samples (40 tumor and 22 normal colon tissues, Affymetrix oligonucleotide arrays). The second is a pediatric Acute Lymphoblastic Leukemia (ALL) data from St. Jude Children's Research Hospital (SJCRH) [17], which includes 12,625 gene expression measurements (Affymetrix arrays) per patient from 246 patients with six different subtypes of ALL.

In the investigation at SJCRH, 246 cases of pediatric ALL were analyzed on the U133 A and B chips, involving six primary subtypes of ALL: BCR-ABL, E2A-PBX1, Hyperdiploid > 50, MLL, T-ALL and TEL. The original data has patient information with two additional subtypes, which did not fit into one of the above primary diagnostic groups or were added for the analysis of relapse and secondary AML. Our study did not include these two subtypes.

Design of the experiment

Since the mean is known to lack robustness, we will focus on the comparison of the bisecting algorithms based on componentwise median and spatial median in this paper.

The two data sets were used to compare the performance of the proposed bisecting k-spatialMedian with the bisecting k-median. Since the class labels of the two data sets are known, the number K of clusters is also known. The Alon data set has two classes, so K = 2. For the ALL data from SJCRH, K = 6. The algorithms are applied on the two datasets and terminated when K clusters have been reached.

In order to investigate the performance of the proposed clustering algorithm for HDLSS data, we test them on the two data sets for various dimensions, i.e., different number of genes selected. For the ALL data which has 12265 genes, we test the dimensions $D$ = {100; 200; 500; 1000; 1500; 2000; 3000; 4000; 5000}; for the Alon data which has 2000 genes we test the dimensions $D$ = {50; 100; 200; 500; 1000; 2000}.

For each $D$ , we trim the data with only $D$ "most important" genes. We use the SVM-RFE-Annealing algorithm [18] to select the $D$ most important genes. All clustering algorithms are then applied to the trimmed data.

Validation of the clustering results is usually not easy. However, in situations where data are already categorized, as with these data, we can compare the predicted clusters from our algorithms with the true class labels. To display the results, we build a confusion matrix in which rows represent the predicted clusters while columns represent the true clusters. The number in the cell (i, j) is the number of observations that are from cluster j but are predicted to be from cluster i. The rows and columns are "matched" by a brute force algorithm, and this is optimistic. Two evaluation measures, Entropy and Misclustering rate, are used. See the Methods section for more details.

Because the bisecting divisive clustering algorithm randomly selects a point as the center of the subcluster C_L, it is non-deterministic and therefore yields stochastic clustering results. To evaluate the stochastic clustering result, we ran each algorithm 20 times and calculated the average entropy and misclustering rates as the clustering measures. These algorithms select the next subcluster to split based on the criterion of the largest variance. We compare the performance of our proposed bisecting k-spatialMedian with bisecting k-median based on the same selection rule, the largest variance, on the two data sets. The performance of bisecting k-spatialMedian with the selection criterion of the relative average depth is also presented.

To investigate the robustness of our proposed procedure, we compare the sensitivity of the proposed algorithm to noise with the bisecting k-median algorithm. We add noise to the Alon data and then apply the three algorithms (bisecting k-median, bisecting k-spatialMedian with largest variance splitting rule, bisecting k-spatialMedian with RAD splitting rule) on it to investigate their performance.

We generated a percentage of random noise and added to the Alon data by changing the expression value of a point to either the maximum or minimum value of all data points. In this way, some data points are changed to have extreme values and more likely to become outliers. Experiments show that our proposed algorithms based on spatial median perform better than the bisecting k-median algorithm in this noisy environment.

The result on the Alon data

Figure 5 reports the entropy and the misclustering rates of the algorithms on the trimmed Alon data. These algorithms are the bisecting k-median (median), the bisecting k-spatialMedian (spatialMedian), the bisecting k-spatialMedian based on the selection criterion of the relative average depth (SM-RAD). The first two algorithms use the largest variance as selection rule.

From Figure 5a and 5b, we can see that both of the algorithms using spatial median have lower entropy and misclustering rates than the one using componentwise median in most of the cases. When we use more than 400 genes in clustering, the algorithms using spatial median are better than the one using componentwise median, which demonstrates that spatial median is more robust in higher dimensional data. Also the performance of the algorithm using median is decreasing dramatically with dimensions increasing from 200 to 1000, while the performance of the algorithms using spatial median does not degrade as much.

Figure 6 shows the entropy values with standard deviation of the three algorithms. We can see that the three algorithms display similar variation, about 0.2 in most cases. The very similar results are obtained by using misclustering rate.

Additional file 1 gives an example of the relationship of the number of runs and average entropy of the Alon data. In additional file 1, the entropy values get more stable with the number of runs increasing, which justifies the need of running the clustering algorithms multiple times. The average misclustering rate and the number of runs have the similar relationship.

The result on the SJCRH data

Similarly, Figure 7 reports the entropy and misclustering rates of the algorithms on the trimmed SJCRH data. We can see that in most of the cases after 500 genes are used, both of the algorithms using spatial median are better than the bisecting k-median. The largest difference between bisecting k-spatialMedian and median is more than 10%. The results are consistent with the results on the Alon data.

Similarly, Figure 8 shows the entropy values with standard deviation of the three algorithms. We can see that the three algorithms display similar variation, less than 0.1 in most cases, although the algorithm using median achieves the lowest standard deviation. Standard deviation appears to be more consistent with median than with spatialMedian on the SJCRH data. The very similar results are obtained by using misclustering rate.

Additional file 2 gives an example of the relationship of the number of runs and average entropy of the SJCRH data. In additional file 2, the entropy values get more stable with the number of runs increasing. The average misclustering rate and the number of runs have the similar relationship.

The result on the noisy Alon data

We randomly add noise to the Alon data to see how well the algorithms based on the componentwise median and the spatial median perform in a noisy environment.

To this end, we randomly pick 10% of data from the Alon data, and reset their values to be either the maximum or minimum value in the data matrix.

We applied the three algorithms to this noisy data and observed that all the algorithms have been influenced by the noise. However, the bisecting k-median is more susceptible to the noise, which can be demonstrated by the fact that it cannot separate the two clusters at all.

This process is repeated several times and the results are very consistent. We further increase the amount of noise from 10% to 20% and get a similar result.

Figure 9 shows that the algorithms based on spatial median have very similar entropy values and mis-clustering rates on the noisy Alon data. Since the bisecting k-median cannot separate the two clusters, its entropy value or misclustring rate is not available thus not shown in Figure 9.

Conclusion

The spatial depth function provides a robust location estimator whereas componentwise median may not work well in high dimension and low sample size data, which is illustrated by easily designed simulation. The experimental results on real data sets further verify that the spatial median based bisecting clustering algorithm is more robust to outliers and noise in high dimensional data, such as gene expression data, than the bisecting k-median algorithm.

Methods

Geometric structure of HDLSS data

In their 2005 article, Hall, Marron and Neeman [15] point out that for d-dimensional i.i.d. random vectors Z₁,...,Z_mwhose coordinates are i.i.d. with the standard normal $N$ (0, 1), all distinct pairwise Euclidean distances ||Z_i- Z_j||_dare approximately equal and all pairwise angles ang(Z_i, Z_j) are approximately perpendicular for large d. Without normality assumptions they further demonstrate that all pairwise distances are still approximately equal under certain moment assumptions. Specially they give the following geometric representation. For an infinite sequence X = (X⁽¹⁾, X⁽²⁾,...) of random variables, assume

(i)
There exists a constant M such that $E$ |X⁽ⁱ⁾|⁴ <M for all i = 1, 2,....
(ii)
There is a constant σ² such that
$\begin{matrix} \frac{1}{d} \sum_{k = 1}^{d} V a r (X^{(k)}) \to σ^{2}, & d \to \infty . \end{matrix}$
(1)
(iii)
The infinite sequence X is ρ mixing, for detail, see [15].

Let X(d) = (X⁽¹⁾,...,X^(d)) be a coordinate projection of X into the d-dimensional space ℝ^dand let X₁(d),...,X_m(d) be independent and identical copies of X(d). Then for all distinct pairs X_i≠ X_j, the distances ${‖ X_{i} - X_{j} ‖}_{d} = {({\sum_{k = 1}^{d} (X_{i}^{(k)} - X_{j}^{(k)})}^{2})}^{1 / 2}$ are approximately equal:d^-1/2||X_i- X_j||_d→ 2 σ, d → ∞.

Observing their result, we find, with μ = $E$ X_i, that

d^-1/2||X_i- X_j||_d- d^-1/2||X_i- μ||_d- d^-1/2||X_j- μ||_d→ 0,

as d → ∞. This shows, in view of the Pythagorean theorem, the following fact.

Fact 1. Under the above assumptions (i)–(iii), the pairwise angle between distinct X_i- μ_iand X_j- μ_jis approximately perpendicular:

ang(X_i- μ, X_j- μ) = π/2 + O_p(d^-1/2). (3)

It is well known that spatial depth function attains its maximum value at the symmetric center of a distribution under very mild assumptions and the spatial median is the maximizer. Thus the spatial median is the center of the regular simplex when the number of observations at every vertex is equal.

This exhibits that, for HDLSS data, the spatial depth can find the center and this helps find the right clusters, while a componentwise median may fail to find the symmetric center and thus the componentwise-median-based procedures may be unable to find the right clusters. In fact, we expanded the dimension of our data set from the previous simulation which has three dimensions as shown in Figure 2 and found that the componentwise-median-based bisecting k-median breaks down more easily with increasing dimension while the bisecting k-spatialMedian does not.

Theoretical verification of subcluster selection rule

Suppose that we have collected observations X_j: j ∈ J = {1,...,n} which are points in ℝ^d. Suppose also that these observations are from two sources. We want to find a rule to measure the condenseness of the data, in other words, how different the two resources are. Statistically we suppose that X_j: j ∈ J = {1,...,n} are independent observations from a population distribution F. Suppose that X_j: j ∈ J₁ and X_j: j ∈ J₂ are from population distributions F₁ and F₂ respectively with J₁, J₂ being partitions of J. For convenience we refer to these two subclusters of J as J₁ and J₂ respectively. We want to use the robust depth functions to measure the condenseness of J, or in other words, the separatedness of J₁ and J₂. Let D(x, F) be the population depth of a point x with respect to F. The sample depth is D(x, J) ≡ D(x, F_n) where F_nis the empirical distribution of F.

One of the desirable properties for most of the depth functions is monotonicity relative to the deepest point, i.e., the depth-based multivariate median. Specifically, as a point x ∈ ℝ^dmoves away from the multivariate median M along any fixed ray through M, the depth at x decreases monotonically, namely,

D(x, F) ≤ D(M + α(x - M), F), x ∈ ℝ^d (4)

holds for all α ∈ [0, 1]. This property can be used to characterize the separatedness of the two clusters. For unambiguity let us write X_ifor the observations X_i: i ∈ J₁ and Y_jfor X_j: j ∈ J₂.

Suppose that clusters J₁ and J₂ are separated. Observe that, by the monotonicity (4), if X_iis from cluster J₁ and Y_jfrom cluster J₂ then the depth of X_ishould be larger than the depth of Y_j, both with respect to cluster J₁. Namely,

D(X_i, J₁) ≽ D(Y_j, J₁), i ∈ J₁, j ∈ J₂, (5)

where is the stochastic ordering in the sense that η ≽ ξ, if and only if ℙ (η ≥ ξ) ≥ 1/2 for two random variables η, ξ. The inequalities are useful in characterizing the separatedness of two clusters J₁ and J₂.

Note that D(X_i, J₁) and D(Y_j, J₁) are called within- and between-depth by [19] and [2]. The population version of (5) is

D(X, F₁) ≽ D(Y, F₁), X ~ F₁, Y ~ F₂. (6)

The inequality has clear geometric interpretation. With respect to distribution F₁, the depth of random variable X from distribution F₁ is larger than the depth of random variable Y from distribution F₂. Indeed we have the following fact for the spatial depth.

Fact 2. Suppose F₂ = F₁(· - c) where c ∈ ℝ^dis a constant vector. If F₁ has finite support, then for X ~ F₁ and Y ~ F₂,

\lim_{‖ c ‖ \to \infty} ℙ (S P D (X, F_{1}) \geq S P D (Y, F_{1})) = 1.

Proof. Using || $E$ S(x - ξ)||² = $E$ _ξ,ηS^⊥(x - ξ)S(x - η) where ξ, η are independent and have a common distribution and $E$ _ξ,ηis calculated under the joint probability of ξ and η, one has

ℙ(SPD(X, F₁) ≥ SPD(Y, F₁)) = ℙ(_ξ,η[S^⊥(X - ξ)S(X - η) - S^⊥(X - ξ + c)S(X - η + c)] ≤ 0).

It is easy to see S^⊥(X - ξ + c)S(X - η + c) → 1 as ||c|| → ∞. Combining the above yields the desired result and the proof is complete.

Fact 2 implies that if one cluster is shifted away further enough then we have the stochastic ordering (6) and hence (5) for large sample.

However, the inequality is a little too strong. Instead of (6) holding for all X ~ F₁ and Y ~ F₂, a less restrictive inequality would be to require (6) to hold on average, i.e.,

E_{F_{1}} D (X, F_{1}) \geq E_{F_{2}} D (Y, F_{1}) .

(7)

Analogously,

E_{F_{2}} D (Y, F_{2}) \geq E_{F_{1}} D (X, F_{2}) .

(8)

Indeed similarly to the proof of Fact 2, we may establish the above two inequalities which shall be discussed elsewhere.

In order to characterize the separatedness of the two clusters we first introduce the following notions.

Depth total, Within- and Between-Depth

Let D_|J|be the sum of the sample depths of all observations on J, i.e., D_|J|= ∑_j∈JD(X_j, J), and we call it the depth total on J. We call the depth total on J₁ and J₂,

\begin{matrix} D_{1}^{w} \equiv \sum_{i \in J_{1}} D (X_{i}, J_{1}), & D_{2}^{w} \equiv \sum_{j \in J_{2}} D (Y_{j}, J_{2}), \end{matrix}

the within-depth, and

\begin{matrix} D_{1}^{b} = \sum_{i \in J_{1}} D (X_{i}, J_{2}), & D_{2}^{b} = \sum_{j \in J_{2}} D (Y_{j}, J_{1}), \end{matrix}

the between-depth. Figure 3 is a graphic display of these notations.

Summing up i ∈ J₁, j ∈ J₂ through (5) yields

\frac{D_{1}^{w}}{| J_{1} |} \geq \frac{D_{2}^{b}}{| J_{2} |} .

(9)

Analogously,

\frac{D_{2}^{w}}{| J_{2} |} \geq \frac{D_{1}^{b}}{| J_{1} |} .

(10)

These two inequalities can be used to characterize the separatedness of two clusters J₁ and J₂. To exploit the inequalities simultaneously we introduce the following.

Relative average depth

R A D = \frac{D_{1}^{w}}{| J_{1} |} + \frac{D_{2}^{w}}{| J_{2} |} - \frac{D_{1}^{b}}{| J_{1} |} - \frac{D_{2}^{b}}{| J_{2} |} .

(11)

is called the relative average depth. If clusters F₁ and F₂ are separated, then the two inequalities (7) and (8) should hold. We believe that the two inequalities can be used to characterize the separatedness of two clusters of random variables. Note that if indeed Y is from the same distribution as F₁, namely, F₁ = F₂, then the equalities in (7) and (8) hold. In other words, a value of RAD close to zero indicates the cluster J is actually one cluster. Clearly RAD is bounded from above by 2. A value of RAD close to 2 indicates that the cluster J is comprised of two clusters J₁ and J₂. Summarizing our discussion above, we have the following result.

Selection criterion

A cluster with the largest value of RAD should be selected to split. If a cluster is less condensed, the RAD value will be larger. So the cluster with the largest RAD value will be the least condensed and thus should be selected for splitting.

Evaluation measures

Suppose that Z = (z_ij) is the m × n confusion matrix, where z_ijis the number of data points which are predicted from cluster C_ibut in fact are from the true cluster C_j. For generality, we use m and n where m and n can be different. But in our experiments, the number of actual clusters k is known, therefore m = n = k. $m_{j} = \sum_{i = 1}^{m} z_{i j}$ is the number of data points in the true cluster j and $n_{i} = \sum_{j = 1}^{n} z_{i j}$ is the number of data points in the predicted cluster i. Let N be the total number of data points.

One common measure of cluster quality is entropy. The entropy of predicted cluster i is defined as:

H (i) = - \frac{1}{\log k} \sum_{j = 1}^{k} \frac{z_{i j}}{n_{i}} \log (\frac{z_{i j}}{n_{i}}),

where k is the number of clusters.

The value of entropy ranges from 0 to 1. An entropy value of 0 means the cluster is comprised entirely of one class, while an entropy value near 1 implies that the cluster contains a uniform mixture of classes. The smaller the entropy value, the better the clustering performance.

Another measure of clustering we use is misclustering rate. Based on the confusion matrix, the accuracy j-th cluster is for z_ij/m_j. Since each true cluster contributes m_jto the total $N = \sum_{i = 1}^{m} m_{i} + \sum_{j = 1}^{n} n_{j}$ data points, its contribution has a weight m_j/N. The global accuracy [20] is the weighted sum,

\sum_{j = 1}^{n} \frac{m_{j}}{N} \frac{z_{j j}}{m_{j}} = \sum_{j = 1}^{n} \frac{z_{j j}}{N} .

Then the misclustering rate is $1 - \sum_{j = 1}^{n} \frac{z_{j j}}{N}$ .

Since we do not know how to match up the predicted clusters with the true ones, z_ijon the diagonal of the confusion matrix may not be the accurately predicted number of data. We use brute force to search for the best alignment between the predicted and the true clusters. The time complexity is O(k!) if there are k true clusters and k predicted clusters. This brute force approach is not a part of the algorithm itself, but is used to aid in a fair evaluation.

References

Parsons L, Haque E, Liu H: Subspace Clustering for High Dimensional Data: a Review. SIGKDD Explor Newsl. 2004, 6: 90-105. 10.1145/1007730.1007731.
Article Google Scholar
Jörnsten R, Vardi Y, Zhang CH: A Robust Clustering Method and Visualization Tool Based on Data Depth. 2002, Basel: Birkhäuser
Book Google Scholar
Garcia-Escudero LA, Gordaliza A: Robustness Properties of k Means and Trimmed k Means. Journal of the American Statistical Association. 1999, 94 (447): 956-969. 10.2307/2670010.
Google Scholar
Tukey W: Mathematics and the Picturing of Data. Proceedings of the International Congress of Mathematicians. 1975, 2: 523-531.
Google Scholar
Oja H: Descriptive Statistics for Multivariate Distributions. Statist Probab Lett. 1983, 1: 327-333. 10.1016/0167-7152(83)90054-8.
Article Google Scholar
Liu RY: On a Notion of Data Depth Based upon Random Simplices. The Annals of Statistics. 1990, 18: 405-414.
Article Google Scholar
Zuo Y, Serfling R: General Notions of Statistical Depth Function. The Annals of Statistics. 2000, 28 (2): 461-482. 10.1214/aos/1016218226.
Article Google Scholar
Koshevoy G, Mosler K: Zonoid Trimming for Multivariate Distributions. Annals of Statistics. 1997, 25 (5): 1998-2017. 10.1214/aos/1069362382.
Article Google Scholar
Zhang J: Some Extensions of Tukey's Depth Function. Journal of Multivariate Analysis. 2002, 82: 134-165. 10.1006/jmva.2001.2011.
Article Google Scholar
Chaudhuri P: On a Geometric Notion of Quantiles for Multivariate Data. Journal of the American Statistical Association. 1996, 91: 862-872. 10.2307/2291681.
Article Google Scholar
Vardi Y, Zhang CH: The Multivariate L1-median and Associated Data Depth. Proc Natl Acad Sci USA. 2000, 97 (4): 1423-1426. 10.1073/pnas.97.4.1423.
Article PubMed Central CAS PubMed Google Scholar
Serfling R: A Depth Function and a Scale Curve Based on Spatial Quantiles. 2002, Birkhäeuser: Boston: Birkhauser, 25-38.
Google Scholar
Koltchinskii VI: M-estimation, Convexity and Quantiles. Ann Statistics. 1997, 25: 435-477. 10.1214/aos/1031833659.
Article Google Scholar
Donoho DL, Huber P: The Notion of Breakdown Point. 1983, Belmont, CA: Wadsworth, 157-184.
Google Scholar
Hall P, Marron JS, Neeman A: Geometric Representation of High Dimension, Low Sample Size Data. J R Statist Soc B. 2005, 67: 427-444. 10.1111/j.1467-9868.2005.00510.x.
Article Google Scholar
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc Natl Acad Sci USA. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.
Article PubMed Central CAS PubMed Google Scholar
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR: Pediatric Lymphoblastic Leukemia by Gene Expression Profiling. Cancer Cell. 2002, 1: 133-143. 10.1016/S1535-6108(02)00032-6.
Article CAS PubMed Google Scholar
Ding Y, Wilkins D: Improving the Performance of SVM-RFE to Select Genes in Microarray Data. BMC Bioinformatics. 2006, 7 (Suppl 2):
Jörnsten R: Clustering and Classification Based on the L1 Data Depth. Journal of Multivariate Analysis. 2004, 90: 67-89. 10.1016/j.jmva.2004.02.013.
Article Google Scholar
Ding C, He X: Cluster Merging and Splitting in Hierarchical Clustering Algorithms. Proceedings of IEEE International Conference on Data Mining (ICDM'02). 2002
Google Scholar

Download references

Acknowledgements

We thank Dr. Yixin Chen for his valuable suggestion. We also thank Alon et al. and St. Jude Children's Research Hospital for the use of their data sets.

This article has been published as part of BMC Bioinformatics Volume 8 Supplement 7, 2007: Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S7.

Author information

Authors and Affiliations

Computer & Information Science Department, The University of Mississippi, University, MS, USA
Yuanyuan Ding & Dawn Wilkins
Department of Mathematics, The University of Mississippi, University, MS, USA
Xin Dang & Hanxiang Peng

Authors

Yuanyuan Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xin Dang
View author publications
You can also search for this author in PubMed Google Scholar
Hanxiang Peng
View author publications
You can also search for this author in PubMed Google Scholar
Dawn Wilkins
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanxiang Peng.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XD and HP contributed to the theoretical development. YD and DW contributed to the experimentation and development of the computer code. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2007_1939_MOESM1_ESM.eps

Additional file 1: The relationship between the number of runs and average entropy of the three algorithms on the Alon data. Additional file 1 demonstrates that when the algorithms are run more times, the average entropy values of all the algorithms get more stable. In this figure, 500 genes were selected. (EPS 14 KB)

12859_2007_1939_MOESM2_ESM.eps

Additional file 2: The relationship between the number of runs and average entropy of the three algorithms on the SJCRH data. Additional file 2 demonstrates that when the algorithms are run more times, the average entropy values of all the algorithms get more stable. In this figure, 1000 genes were selected. (EPS 14 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ding, Y., Dang, X., Peng, H. et al. Robust clustering in high dimensional data using statistical depths. BMC Bioinformatics 8 (Suppl 7), S8 (2007). https://doi.org/10.1186/1471-2105-8-S7-S8

Download citation

Published: 01 November 2007
DOI: https://doi.org/10.1186/1471-2105-8-S7-S8

Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine

Robust clustering in high dimensional data using statistical depths

Abstract

Background

Results

Conclusion

Background

Results and discussion

Simulation study

The bisecting k-spatialMedian algorithm

Subcluster selection rule

Applications

Data sets

Design of the experiment

The result on the Alon data

The result on the SJCRH data

The result on the noisy Alon data

Conclusion

Methods

Geometric structure of HDLSS data

Theoretical verification of subcluster selection rule

Depth total, Within- and Between-Depth

Relative average depth

Selection criterion

Evaluation measures

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

12859_2007_1939_MOESM1_ESM.eps

12859_2007_1939_MOESM2_ESM.eps

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us