- Proceedings
- Open access
- Published:
Robust clustering in high dimensional data using statistical depths
BMC Bioinformatics volume 8, Article number: S8 (2007)
Abstract
Background
Mean-based clustering algorithms such as bisecting k-means generally lack robustness. Although componentwise median is a more robust alternative, it can be a poor center representative for high dimensional data. We need a new algorithm that is robust and works well in high dimensional data sets e.g. gene expression data.
Results
Here we propose a new robust divisive clustering algorithm, the bisecting k-spatialMedian, based on the statistical spatial depth. A new subcluster selection rule, Relative Average Depth, is also introduced. We demonstrate that the proposed clustering algorithm outperforms the componentwise-median-based bisecting k-median algorithm for high dimension and low sample size (HDLSS) data via applications of the algorithms on two real HDLSS gene expression data sets. When further applied on noisy real data sets, the proposed algorithm compares favorably in terms of robustness with the componentwise-median-based bisecting k-median algorithm.
Conclusion
Statistical data depths provide an alternative way to find the "center" of multivariate data sets and are useful and robust for clustering.
Background
In gene expression studies, the number of samples in most data sets is limited, while the total number of genes assayed is easily ten or twenty thousand. Such high dimension and low sample size data arise not only commonly in genomics but also frequently emerge in various other areas of science. In radiology and biomedical imaging, for example, one is typically able to collect far fewer measurements about an image of interest than the number of pixels.
These HDLSS data present a substantial challenge to many methods of classical analysis, including cluster analysis. In high dimensional data, it is not uncommon for many attributes to be irrelevant. In fact, the extraneous data can make identifying clusters very difficult [1]. Robust clustering methods are needed that are resistant to small perturbations of the data and the inclusion of unrelated variables [2].
The bisecting k-means algorithm is a hybrid of hierarchical clustering and the k-means algorithm. It proceeds top-down, splitting a cluster into two in each step, after which it will select one cluster based on a selection rule (commonly the cluster with the largest variance) to further split. In each splitting step, it randomly picks a pair of data points that are symmetric about the "center" of the data and assigns all other data points to one cluster or the other based on distance to the two selected points, thus the algorithm is similar to the k-means algorithm. The center is usually the mean. This whole process continues until each point is a cluster or a predefined number of clusters is reached.
Similar to other commonly used methods that are based on mean, e.g. k-means, bisecting k-means is not robust because the mean is susceptible to outliers and noise [3]. As a common remedy, the bisecting k-median algorithm, which replaces the mean by the componentwise median, is less sensitive to outliers. However, the componentwise median may be a very poor center representative of data, because it disregards the interdependence information among the components and is calculated separately on each component (dimension). For example, the componentwise median of the points (a, 0, 0), (0, b, 0) and (0, 0, c) for arbitrary reals a, b, c is (0, 0, 0) which even does not lie on the plane passing through the three points.
A new center representative for multivariate data that is robust and takes into account the interdependence among the dimensions is clearly needed.
Of the various multivariate medians, however, those defined via statistical depth functions are advantageous because the theory of statistical depth has been quite nicely established, though it is still relatively young and still under development. Analogous to linear order in one dimension, statistical depth functions provide an ordering of all points from the center outward in a multivariate data set. Linear order induces an ordering and ranking for 1-dimensional observations. Median is the "deepest" point in the data set. In contrast, for dimension d ≥ 2, there is no natural order. As compensation, it is convenient and natural to orient to a "center", the deepest point, that is, the multivariate median. This leads to center-outward ordering of points and to a description in terms of nested contours. Tukey [4] first introduced halfspace depth. Oja [5] defined Oja depth. Liu [6] proposed simplicial depth. Zuo and Serfling [7] considered projection depth. Other notions include Zonoid depth [8], generalized Tukey depth [9], and spatial depth [10] among others. See [7] for a systematic exhibition.
Of the various depth functions, the spatial depth is especially appealing because of its computational ease and mathematical tractability, see Vardi [11], Serfling [12], Chaudhuri [10] and Koltchinskii [13] among others. The spatial depth (SPD) of a point x w.r.t. a distribution F is defined asSPD(x, F) = 1 - || F S(x - X)||, x ∈ ℝd,
where S(x) = x/||x|| is the spatial sign function (S(0) = 0) with Euclidean norm ||·||. The sample spatial depth is
where F n (x) is the empirical distribution function of the data X1,...,X n . Points deep inside the data cloud have high depth values, while the points on the outskirts have lower depth values.
Figure 1 illustrates the spatial depth. Let e i = S(y -x i ) = (y - x i ) = (||y - x i ||) where e i represents the unit vector from y to x i . When y is located deep inside the cloud of x's, summing up e i will result in a vector close to , since unit vectors have different directions and they cancel each other out. The depth of y is approaching 1. See the diagram on the left in Figure 1. When y is outside the data cloud (as in the diagram on the right in Figure 1), the sum of e i has a large norm, thus the depth is approaching 0. The point where the spatial depth attains its maximum value 1 is called the spatial median. The spatial median represents the geometric center of the data, in particular, for a symmetrical distribution, the spatial median is the symmetric center. The spatial depth and the spatial median possess many nice properties. Robustness is one of them.
From the definition of the sample spatial depth, it is not difficult to see that the depth value of a point x does not change if any observations are moved to ∞ along the rays connecting them to the point x. Thus the spatial depth and the spatial median are highly robust in the presence of outliers. In fact, the breakdown point of the spatial median is 1/2, depending on neither the data nor the dimension and reaching the highest possible value for the translation equivalent location estimator. Here the "breakdown point" is the prevailing quantitative measure of robustness proposed by Donoho and Huber [14]. Roughly speaking, the breakdown point is the minimum fraction of the "bad" data points that can render the estimator beyond any boundary. It is clear to see that one bad point of a data set is enough to ruin the sample mean. Thus, the breakdown point of mean is 1/n → 0, the lowest possible value. That is, the sample mean vector is not robust, hence neither is the clustering method k-means which is based on nonrobust sample means.
Unlike the componentwise median, the spatial median is equivariant under orthogonal transformations (e.g. rotations) of the data though it is not equivariant under general affine transformation. The spatial median may not be a reasonable estimate when the scale of different coordinates of the data are widely different. It is, however, very desirable for preprocessed gene data, where variables are isometric.
The complexity of the spatial median is O(n2) for sample size n regardless of the dimension. This independence of dimension is particularly important for HDLSS data because high dimension usually causes problems for classical methods.
In our bisecting k-spatialMedian algorithm, we propose the use of a robust spatial median to replace the non-robust mean or the less-robust componentwise median to determine the center of the data. The bisecting k-spatialMedian algorithm is shown to be more robust than the bisecting k-median algorithm in high dimension.
For the selection criterion, we replace the largest variance criterion, which is sensitive to outliers, and propose a depth-based notion, relative average depth (RAD), which characterizes the separatedness of a data set. With its range in [0, 2], a smaller value of the relative average depth indicates less separatedness and a larger value is an indication of higher separatedness. Indeed, in conjunction with the robust spatial median, we can use any existing selection criterion, including largest variance.
Results and discussion
Simulation study
To demonstrate the difference in performance between algorithms based on the spatial median and the componentwise median, we conduct a simulation of four clusters in ℝ3, see Figure 2. Clusters I and II are comprised of data points (X, 0, 0) with X generated from the uniform distributions U(1.5, 2) and U(2.5, 3); and clusters III and IV comprised of data points (0, Y, 0) and (0, 0, Z) with Y from U(0.5, 1.2) and Z from U(3.5, 4.5), where III and IV have the same sample size equaling the sum of the sample sizes of I and II. We observe that the bisecting k-median completely fails to separate the four clusters, while the proposed bisecting k-spatialMedian successfully finds the four clusters. As shown in Figure 2, the four clusters were perfectly identified by the bisecting k-spatialMedian algorithm. Since the output of the bisecting k-median is the whole data set, its graph is in one color, without identification of clusters.
The phenomenon observed in the above simulated data seems unrepresentative because the data structure appears so contrived. But actually this is a quite general structure for HDLSS data. In fact, Hall et al. [15] show that there is a tendency for HDLSS data to lie deterministically at the vertices of a regular simplex and all the randomness in the data appears as a random rotation of this simplex. Based on this geometric representation, we have shown that the angle between any two distinct data points centered at their common mean is approximately perpendicular, and all these centered data points will lie on the coordinate axes. See the Methods section for more details.
The bisecting k-spatialMedian algorithm
Based on the spatial median, we propose the bisecting k-spatialMedian algorithm. Specifically, the bisecting k-spatialMedian algorithm recursively splits a cluster by randomly choosing one point C L as the center of one subcluster. Let C be the spatial median of the whole data set. Then the center C R of the other subcluster is determined as the symmetric point of C L about C, namely, C R = C - (C L - C). Every point X in the cluster is assigned to the subcluster containing C L or C R according to the smaller Euclidean distance ||X - C L || or ||X - C R ||. This process is repeated until the convergence criterion is met, namely, the centers of the subclusters no longer change. After the cluster is split into two subclusters, a selection rule is needed to determine which subcluster is to be further split.
The basic bisecting k-spatialMedian algorithm follows:
INITIALIZE:
K: number of clusters
C: center (spatial median) of the data cluster
C L : center of left subcluster
C R : center of right subcluster
FOR i = 1 to K - 1 do
choose a cluster to split by the selection rule
randomly select a point C L as center of left subcluster
compute C R = C - (C L - C)
for j = 1 to MAXITER do
for each data point X i
if ||X i - C L || ≥ ||X i - C R ||
assign X i to the right subcluster
else
assign X i to the left subcluster
end
Let M L be the spatial median of the left subcluster
Let M R be the spatial median of the right subcluster
if M L == C L and M R == C R
break
C L = M L
C R = M R
end
END
Subcluster selection rule
In the bisecting k-spatialMedian algorithm, we need to decide which cluster is to be further split in each step. Selecting the one with the largest variance is a very common approach. Here we propose a new rule based on the statistical spatial depth.
Suppose that a data set is naturally composed of two clusters J1 and J2. Let be the sum of spatial depth values of all data points in J1 with respect to J1. Let be the sum of spatial depth values of all data points in J2 with respect to J2. Note that or represents "within-depth", because it is calculated with respect to the cluster to which the data points belong. Let be the sum of spatial depth values of all data points in J1 with respect to J2. Similarly, let be the sum of spatial depth values of all data points in J2 with respect to J1. or represents ''between-depth", because it is calculated with respect to the other cluster. See Figure 3 for a graphic display. The within-depth is larger when a cluster is more condensed whereas the between-depth is smaller when two clusters are further away from each other.
Let |J1| and |J2| represent the number of data points in J1 and J2 respectively. The relative average depth is defined as
As shown from Figure 3, if a data set is naturally composed of two clusters and thus should be split into two, the within-depth should be relatively large and the between-depth relatively small, therefore the relative average depth (RAD) which is essentially the averaged difference between the within-depth and the between-depth will be relatively large compared to the RAD of a data set that is more condensed and cannot be split into two clusters obviously. In fact we have shown that a larger value of RAD indicates less condenseness of a data set. See Section Methods for technical details. Hence we obtain a new selection rule: A cluster with the largest value of RAD should be selected to split.
The following simulation demonstrates the relationship between the value of RAD and the condenseness of a data set. As shown in Figure 4a, two clusters were generated from normal distributions with means μ1 = (0, 0) and μ2 = (4, 4), covariances Σ1 = (1, 0.5; 0.5, 1) and Σ2 = (1, -0.5; -0.5, 1) for the same sample size 200. Obviously the data comprises of two clusters and should be split as such. The relative average depth RAD = 0.7864. If the second cluster is moved from μ2 = (4, 4) to μ2 = (6, 6), the two clusters are further away from each other, as shown in Figure 4b. Compared with the previous situation, this new data should have higher priority to be selected for further splitting. The relative average depth increases to RAD = 0.8018. Table 1 lists the values of RAD with one cluster being moved further away from another one with μ1 = (0, 0). We can see that the RAD value increases slowly when the two clusters are more separated.
Applications
Data sets
We use the proposed bisecting k-spatialMedian algorithm to analyze two well known data sets. The first is the colon cancer data (Alon data) [16], which is comprised of expression levels of 2000 genes describing 62 samples (40 tumor and 22 normal colon tissues, Affymetrix oligonucleotide arrays). The second is a pediatric Acute Lymphoblastic Leukemia (ALL) data from St. Jude Children's Research Hospital (SJCRH) [17], which includes 12,625 gene expression measurements (Affymetrix arrays) per patient from 246 patients with six different subtypes of ALL.
In the investigation at SJCRH, 246 cases of pediatric ALL were analyzed on the U133 A and B chips, involving six primary subtypes of ALL: BCR-ABL, E2A-PBX1, Hyperdiploid > 50, MLL, T-ALL and TEL. The original data has patient information with two additional subtypes, which did not fit into one of the above primary diagnostic groups or were added for the analysis of relapse and secondary AML. Our study did not include these two subtypes.
Design of the experiment
Since the mean is known to lack robustness, we will focus on the comparison of the bisecting algorithms based on componentwise median and spatial median in this paper.
The two data sets were used to compare the performance of the proposed bisecting k-spatialMedian with the bisecting k-median. Since the class labels of the two data sets are known, the number K of clusters is also known. The Alon data set has two classes, so K = 2. For the ALL data from SJCRH, K = 6. The algorithms are applied on the two datasets and terminated when K clusters have been reached.
In order to investigate the performance of the proposed clustering algorithm for HDLSS data, we test them on the two data sets for various dimensions, i.e., different number of genes selected. For the ALL data which has 12265 genes, we test the dimensions = {100; 200; 500; 1000; 1500; 2000; 3000; 4000; 5000}; for the Alon data which has 2000 genes we test the dimensions = {50; 100; 200; 500; 1000; 2000}.
For each , we trim the data with only "most important" genes. We use the SVM-RFE-Annealing algorithm [18] to select the most important genes. All clustering algorithms are then applied to the trimmed data.
Validation of the clustering results is usually not easy. However, in situations where data are already categorized, as with these data, we can compare the predicted clusters from our algorithms with the true class labels. To display the results, we build a confusion matrix in which rows represent the predicted clusters while columns represent the true clusters. The number in the cell (i, j) is the number of observations that are from cluster j but are predicted to be from cluster i. The rows and columns are "matched" by a brute force algorithm, and this is optimistic. Two evaluation measures, Entropy and Misclustering rate, are used. See the Methods section for more details.
Because the bisecting divisive clustering algorithm randomly selects a point as the center of the subcluster C L , it is non-deterministic and therefore yields stochastic clustering results. To evaluate the stochastic clustering result, we ran each algorithm 20 times and calculated the average entropy and misclustering rates as the clustering measures. These algorithms select the next subcluster to split based on the criterion of the largest variance. We compare the performance of our proposed bisecting k-spatialMedian with bisecting k-median based on the same selection rule, the largest variance, on the two data sets. The performance of bisecting k-spatialMedian with the selection criterion of the relative average depth is also presented.
To investigate the robustness of our proposed procedure, we compare the sensitivity of the proposed algorithm to noise with the bisecting k-median algorithm. We add noise to the Alon data and then apply the three algorithms (bisecting k-median, bisecting k-spatialMedian with largest variance splitting rule, bisecting k-spatialMedian with RAD splitting rule) on it to investigate their performance.
We generated a percentage of random noise and added to the Alon data by changing the expression value of a point to either the maximum or minimum value of all data points. In this way, some data points are changed to have extreme values and more likely to become outliers. Experiments show that our proposed algorithms based on spatial median perform better than the bisecting k-median algorithm in this noisy environment.
The result on the Alon data
Figure 5 reports the entropy and the misclustering rates of the algorithms on the trimmed Alon data. These algorithms are the bisecting k-median (median), the bisecting k-spatialMedian (spatialMedian), the bisecting k-spatialMedian based on the selection criterion of the relative average depth (SM-RAD). The first two algorithms use the largest variance as selection rule.
From Figure 5a and 5b, we can see that both of the algorithms using spatial median have lower entropy and misclustering rates than the one using componentwise median in most of the cases. When we use more than 400 genes in clustering, the algorithms using spatial median are better than the one using componentwise median, which demonstrates that spatial median is more robust in higher dimensional data. Also the performance of the algorithm using median is decreasing dramatically with dimensions increasing from 200 to 1000, while the performance of the algorithms using spatial median does not degrade as much.
Figure 6 shows the entropy values with standard deviation of the three algorithms. We can see that the three algorithms display similar variation, about 0.2 in most cases. The very similar results are obtained by using misclustering rate.
Additional file 1 gives an example of the relationship of the number of runs and average entropy of the Alon data. In additional file 1, the entropy values get more stable with the number of runs increasing, which justifies the need of running the clustering algorithms multiple times. The average misclustering rate and the number of runs have the similar relationship.
The result on the SJCRH data
Similarly, Figure 7 reports the entropy and misclustering rates of the algorithms on the trimmed SJCRH data. We can see that in most of the cases after 500 genes are used, both of the algorithms using spatial median are better than the bisecting k-median. The largest difference between bisecting k-spatialMedian and median is more than 10%. The results are consistent with the results on the Alon data.
Similarly, Figure 8 shows the entropy values with standard deviation of the three algorithms. We can see that the three algorithms display similar variation, less than 0.1 in most cases, although the algorithm using median achieves the lowest standard deviation. Standard deviation appears to be more consistent with median than with spatialMedian on the SJCRH data. The very similar results are obtained by using misclustering rate.
Additional file 2 gives an example of the relationship of the number of runs and average entropy of the SJCRH data. In additional file 2, the entropy values get more stable with the number of runs increasing. The average misclustering rate and the number of runs have the similar relationship.
The result on the noisy Alon data
We randomly add noise to the Alon data to see how well the algorithms based on the componentwise median and the spatial median perform in a noisy environment.
To this end, we randomly pick 10% of data from the Alon data, and reset their values to be either the maximum or minimum value in the data matrix.
We applied the three algorithms to this noisy data and observed that all the algorithms have been influenced by the noise. However, the bisecting k-median is more susceptible to the noise, which can be demonstrated by the fact that it cannot separate the two clusters at all.
This process is repeated several times and the results are very consistent. We further increase the amount of noise from 10% to 20% and get a similar result.
Figure 9 shows that the algorithms based on spatial median have very similar entropy values and mis-clustering rates on the noisy Alon data. Since the bisecting k-median cannot separate the two clusters, its entropy value or misclustring rate is not available thus not shown in Figure 9.
Conclusion
The spatial depth function provides a robust location estimator whereas componentwise median may not work well in high dimension and low sample size data, which is illustrated by easily designed simulation. The experimental results on real data sets further verify that the spatial median based bisecting clustering algorithm is more robust to outliers and noise in high dimensional data, such as gene expression data, than the bisecting k-median algorithm.
Methods
Geometric structure of HDLSS data
In their 2005 article, Hall, Marron and Neeman [15] point out that for d-dimensional i.i.d. random vectors Z1,...,Z m whose coordinates are i.i.d. with the standard normal (0, 1), all distinct pairwise Euclidean distances ||Z i - Z j || d are approximately equal and all pairwise angles ang(Z i , Z j ) are approximately perpendicular for large d. Without normality assumptions they further demonstrate that all pairwise distances are still approximately equal under certain moment assumptions. Specially they give the following geometric representation. For an infinite sequence X = (X(1), X(2),...) of random variables, assume
-
(i)
There exists a constant M such that |X(i)|4 <M for all i = 1, 2,....
-
(ii)
There is a constant σ2 such that
(1) -
(iii)
The infinite sequence X is ρ mixing, for detail, see [15].
Let X(d) = (X(1),...,X(d)) be a coordinate projection of X into the d-dimensional space ℝdand let X1(d),...,X m (d) be independent and identical copies of X(d). Then for all distinct pairs X i ≠ X j , the distances are approximately equal:d-1/2||X i - X j || d → 2 σ, d → ∞.
Observing their result, we find, with μ = X i , that
d-1/2||X i - X j || d - d-1/2||X i - μ|| d - d-1/2||X j - μ|| d → 0,
as d → ∞. This shows, in view of the Pythagorean theorem, the following fact.
Fact 1. Under the above assumptions (i)–(iii), the pairwise angle between distinct X i - μ i and X j - μ j is approximately perpendicular:
ang(X i - μ, X j - μ) = π/2 + O p (d-1/2). (3)
It is well known that spatial depth function attains its maximum value at the symmetric center of a distribution under very mild assumptions and the spatial median is the maximizer. Thus the spatial median is the center of the regular simplex when the number of observations at every vertex is equal.
This exhibits that, for HDLSS data, the spatial depth can find the center and this helps find the right clusters, while a componentwise median may fail to find the symmetric center and thus the componentwise-median-based procedures may be unable to find the right clusters. In fact, we expanded the dimension of our data set from the previous simulation which has three dimensions as shown in Figure 2 and found that the componentwise-median-based bisecting k-median breaks down more easily with increasing dimension while the bisecting k-spatialMedian does not.
Theoretical verification of subcluster selection rule
Suppose that we have collected observations X j : j ∈ J = {1,...,n} which are points in ℝd. Suppose also that these observations are from two sources. We want to find a rule to measure the condenseness of the data, in other words, how different the two resources are. Statistically we suppose that X j : j ∈ J = {1,...,n} are independent observations from a population distribution F. Suppose that X j : j ∈ J1 and X j : j ∈ J2 are from population distributions F1 and F2 respectively with J1, J2 being partitions of J. For convenience we refer to these two subclusters of J as J1 and J2 respectively. We want to use the robust depth functions to measure the condenseness of J, or in other words, the separatedness of J1 and J2. Let D(x, F) be the population depth of a point x with respect to F. The sample depth is D(x, J) ≡ D(x, F n ) where F n is the empirical distribution of F.
One of the desirable properties for most of the depth functions is monotonicity relative to the deepest point, i.e., the depth-based multivariate median. Specifically, as a point x ∈ ℝdmoves away from the multivariate median M along any fixed ray through M, the depth at x decreases monotonically, namely,
D(x, F) ≤ D(M + α(x - M), F), x ∈ ℝd (4)
holds for all α ∈ [0, 1]. This property can be used to characterize the separatedness of the two clusters. For unambiguity let us write X i for the observations X i : i ∈ J1 and Y j for X j : j ∈ J2.
Suppose that clusters J1 and J2 are separated. Observe that, by the monotonicity (4), if X i is from cluster J1 and Y j from cluster J2 then the depth of X i should be larger than the depth of Y j , both with respect to cluster J1. Namely,
D(X i , J1) ≽ D(Y j , J1), i ∈ J1, j ∈ J2, (5)
where is the stochastic ordering in the sense that η ≽ ξ, if and only if ℙ (η ≥ ξ) ≥ 1/2 for two random variables η, ξ. The inequalities are useful in characterizing the separatedness of two clusters J1 and J2.
Note that D(X i , J1) and D(Y j , J1) are called within- and between-depth by [19] and [2]. The population version of (5) is
D(X, F1) ≽ D(Y, F1), X ~ F1, Y ~ F2. (6)
The inequality has clear geometric interpretation. With respect to distribution F1, the depth of random variable X from distribution F1 is larger than the depth of random variable Y from distribution F2. Indeed we have the following fact for the spatial depth.
Fact 2. Suppose F2 = F1(· - c) where c ∈ ℝdis a constant vector. If F1 has finite support, then for X ~ F1 and Y ~ F2,
Proof. Using ||S(x - ξ)||2 = ξ,ηS⊥(x - ξ)S(x - η) where ξ, η are independent and have a common distribution and ξ,ηis calculated under the joint probability of ξ and η, one has
ℙ(SPD(X, F1) ≥ SPD(Y, F1)) = ℙ(ξ,η[S⊥(X - ξ)S(X - η) - S⊥(X - ξ + c)S(X - η + c)] ≤ 0).
It is easy to see S⊥(X - ξ + c)S(X - η + c) → 1 as ||c|| → ∞. Combining the above yields the desired result and the proof is complete.
Fact 2 implies that if one cluster is shifted away further enough then we have the stochastic ordering (6) and hence (5) for large sample.
However, the inequality is a little too strong. Instead of (6) holding for all X ~ F1 and Y ~ F2, a less restrictive inequality would be to require (6) to hold on average, i.e.,
Analogously,
Indeed similarly to the proof of Fact 2, we may establish the above two inequalities which shall be discussed elsewhere.
In order to characterize the separatedness of the two clusters we first introduce the following notions.
Depth total, Within- and Between-Depth
Let D|J|be the sum of the sample depths of all observations on J, i.e., D|J|= ∑j∈JD(X j , J), and we call it the depth total on J. We call the depth total on J1 and J2,
the within-depth, and
the between-depth. Figure 3 is a graphic display of these notations.
Summing up i ∈ J1, j ∈ J2 through (5) yields
Analogously,
These two inequalities can be used to characterize the separatedness of two clusters J1 and J2. To exploit the inequalities simultaneously we introduce the following.
Relative average depth
is called the relative average depth. If clusters F1 and F2 are separated, then the two inequalities (7) and (8) should hold. We believe that the two inequalities can be used to characterize the separatedness of two clusters of random variables. Note that if indeed Y is from the same distribution as F1, namely, F1 = F2, then the equalities in (7) and (8) hold. In other words, a value of RAD close to zero indicates the cluster J is actually one cluster. Clearly RAD is bounded from above by 2. A value of RAD close to 2 indicates that the cluster J is comprised of two clusters J1 and J2. Summarizing our discussion above, we have the following result.
Selection criterion
A cluster with the largest value of RAD should be selected to split. If a cluster is less condensed, the RAD value will be larger. So the cluster with the largest RAD value will be the least condensed and thus should be selected for splitting.
Evaluation measures
Suppose that Z = (z ij ) is the m × n confusion matrix, where z ij is the number of data points which are predicted from cluster C i but in fact are from the true cluster C j . For generality, we use m and n where m and n can be different. But in our experiments, the number of actual clusters k is known, therefore m = n = k. is the number of data points in the true cluster j and is the number of data points in the predicted cluster i. Let N be the total number of data points.
One common measure of cluster quality is entropy. The entropy of predicted cluster i is defined as:
where k is the number of clusters.
The value of entropy ranges from 0 to 1. An entropy value of 0 means the cluster is comprised entirely of one class, while an entropy value near 1 implies that the cluster contains a uniform mixture of classes. The smaller the entropy value, the better the clustering performance.
Another measure of clustering we use is misclustering rate. Based on the confusion matrix, the accuracy j-th cluster is for z ij /m j . Since each true cluster contributes m j to the total data points, its contribution has a weight m j /N. The global accuracy [20] is the weighted sum,
Then the misclustering rate is .
Since we do not know how to match up the predicted clusters with the true ones, z ij on the diagonal of the confusion matrix may not be the accurately predicted number of data. We use brute force to search for the best alignment between the predicted and the true clusters. The time complexity is O(k!) if there are k true clusters and k predicted clusters. This brute force approach is not a part of the algorithm itself, but is used to aid in a fair evaluation.
References
Parsons L, Haque E, Liu H: Subspace Clustering for High Dimensional Data: a Review. SIGKDD Explor Newsl. 2004, 6: 90-105. 10.1145/1007730.1007731.
Jörnsten R, Vardi Y, Zhang CH: A Robust Clustering Method and Visualization Tool Based on Data Depth. 2002, Basel: Birkhäuser
Garcia-Escudero LA, Gordaliza A: Robustness Properties of k Means and Trimmed k Means. Journal of the American Statistical Association. 1999, 94 (447): 956-969. 10.2307/2670010.
Tukey W: Mathematics and the Picturing of Data. Proceedings of the International Congress of Mathematicians. 1975, 2: 523-531.
Oja H: Descriptive Statistics for Multivariate Distributions. Statist Probab Lett. 1983, 1: 327-333. 10.1016/0167-7152(83)90054-8.
Liu RY: On a Notion of Data Depth Based upon Random Simplices. The Annals of Statistics. 1990, 18: 405-414.
Zuo Y, Serfling R: General Notions of Statistical Depth Function. The Annals of Statistics. 2000, 28 (2): 461-482. 10.1214/aos/1016218226.
Koshevoy G, Mosler K: Zonoid Trimming for Multivariate Distributions. Annals of Statistics. 1997, 25 (5): 1998-2017. 10.1214/aos/1069362382.
Zhang J: Some Extensions of Tukey's Depth Function. Journal of Multivariate Analysis. 2002, 82: 134-165. 10.1006/jmva.2001.2011.
Chaudhuri P: On a Geometric Notion of Quantiles for Multivariate Data. Journal of the American Statistical Association. 1996, 91: 862-872. 10.2307/2291681.
Vardi Y, Zhang CH: The Multivariate L1-median and Associated Data Depth. Proc Natl Acad Sci USA. 2000, 97 (4): 1423-1426. 10.1073/pnas.97.4.1423.
Serfling R: A Depth Function and a Scale Curve Based on Spatial Quantiles. 2002, Birkhäeuser: Boston: Birkhauser, 25-38.
Koltchinskii VI: M-estimation, Convexity and Quantiles. Ann Statistics. 1997, 25: 435-477. 10.1214/aos/1031833659.
Donoho DL, Huber P: The Notion of Breakdown Point. 1983, Belmont, CA: Wadsworth, 157-184.
Hall P, Marron JS, Neeman A: Geometric Representation of High Dimension, Low Sample Size Data. J R Statist Soc B. 2005, 67: 427-444. 10.1111/j.1467-9868.2005.00510.x.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc Natl Acad Sci USA. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR: Pediatric Lymphoblastic Leukemia by Gene Expression Profiling. Cancer Cell. 2002, 1: 133-143. 10.1016/S1535-6108(02)00032-6.
Ding Y, Wilkins D: Improving the Performance of SVM-RFE to Select Genes in Microarray Data. BMC Bioinformatics. 2006, 7 (Suppl 2):
Jörnsten R: Clustering and Classification Based on the L1 Data Depth. Journal of Multivariate Analysis. 2004, 90: 67-89. 10.1016/j.jmva.2004.02.013.
Ding C, He X: Cluster Merging and Splitting in Hierarchical Clustering Algorithms. Proceedings of IEEE International Conference on Data Mining (ICDM'02). 2002
Acknowledgements
We thank Dr. Yixin Chen for his valuable suggestion. We also thank Alon et al. and St. Jude Children's Research Hospital for the use of their data sets.
This article has been published as part of BMC Bioinformatics Volume 8 Supplement 7, 2007: Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S7.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
XD and HP contributed to the theoretical development. YD and DW contributed to the experimentation and development of the computer code. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2007_1939_MOESM1_ESM.eps
Additional file 1: The relationship between the number of runs and average entropy of the three algorithms on the Alon data. Additional file 1 demonstrates that when the algorithms are run more times, the average entropy values of all the algorithms get more stable. In this figure, 500 genes were selected. (EPS 14 KB)
12859_2007_1939_MOESM2_ESM.eps
Additional file 2: The relationship between the number of runs and average entropy of the three algorithms on the SJCRH data. Additional file 2 demonstrates that when the algorithms are run more times, the average entropy values of all the algorithms get more stable. In this figure, 1000 genes were selected. (EPS 14 KB)
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Ding, Y., Dang, X., Peng, H. et al. Robust clustering in high dimensional data using statistical depths. BMC Bioinformatics 8 (Suppl 7), S8 (2007). https://doi.org/10.1186/1471-2105-8-S7-S8
Published:
DOI: https://doi.org/10.1186/1471-2105-8-S7-S8