Improved shrunken centroid classifiers for highdimensional classimbalanced data
 Rok Blagus^{1}Email author and
 Lara Lusa^{1}
DOI: 10.1186/147121051464
© Blagus and Lusa; licensee BioMed Central Ltd. 2013
Received: 16 August 2012
Accepted: 31 January 2013
Published: 23 February 2013
Abstract
Background
PAM, a nearest shrunken centroid method (NSC), is a popular classification method for highdimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall crossvalidated (CV) error rate.
Results
We show that when data are classimbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or classimbalance is larger and/or the differences between classes are smaller. To diminish the classimbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the classspecific predictive accuracies (gmeans).
Conclusions
The results obtained on simulated and real highdimensional classimbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real highdimensional classimbalanced data.
Background
The objective of class prediction (classification) is to develop a rule based on variables measured on a group of samples with known class membership (training set), which can be used to assign the class membership to new samples (test set). Many different classifiers exist, and they differ in the definition of the classification rule [1]. Nowadays classification rules are increasingly often developed using data that are highdimensional (the number of variables greatly exceeds the number of samples) and also classimbalanced (the number of samples belonging to each class is not the same). Highdimensional classification has become a popular task in the biomedical and bioinformatics community with the advent of highthroughput technologies in biomedicine a decade ago. For example, many researchers attempted to develop geneexpression classifiers based on microarray experiments for prognostic and predictive purposes in breast cancer [2]. The number of subjects included in the microarray based classification studies is usually in the range of hundreds, while the number of measured genes is in the tens of thousands. Nowadays, the newly available nextgeneration sequencing methods provide billions of short reads for each subject, further increasing the highdimensionality of data.
For highdimensional data Tibshirani et al.[3] proposed the nearest shrunken centroid method (NSC, known also as prediction for microarrays  PAM), which can be seen as a modification of the diagonal linear discriminant analysis (DLDA [4]). The classification rule of DLDA is based on the scaled distance between the expression profiles of new samples and class centroids (vectors of class specific means); PAM uses a very similar rule, but it shrinks the class centroids towards the overall means and it embeds a variable selection mechanism, which is generally useful in highdimensional classprediction [4]. The amount of shrinkage is usually determined minimizing the crossvalidated error rate on the training set [5, 6]. Since its proposal, PAM has been widely used in practice. The paper that first described the PAM methodology [3] has been cited about thousand times, mostly in journals from the biomedical field: just papers from the fields of Oncology, Biochemistry and Biotechnology account for about half of the citations (source: ISI Web of Knowledge, accessed in November 2012).
The classifiers trained on classimbalanced data tend to classify most of the new samples in the majority class [7] and this bias is further increased if data are highdimensional [8]. It is somehow surprising that while DLDA can perform fairly well with imbalanced data (provided that the number of variables is reduced by some type of variable selection method) [4, 8, 9], PAM is very sensitive to the classimbalance problem: it assigns most new samples to the majority class and achieves very poor accuracy for the minority class even when the level of classimbalance is only moderate [8]. Even when the differences between classes are large the predictive accuracy tends to be smaller in the minority class. For example, Reeve and colleagues [10] used PAM to build a classifier to distinguish rejection from nonrejection kidney transplant using gene expression microarray data, achieving a better predictive accuracy for the majority class of nonrejection transplants: the crossvalidated predictive accuracies were 80% (108 out of 135) in the nonrejection group and 69% in the rejection group (35 out of 51). Similarly, Korkola et al. [11] used PAM to predict the prognosis of 55 breast cancer patients and obtained a crossvalidated predictive accuracy of 76% (26 out of 34) for good prognosis patients and of 62% (13 out of 21) for poor prognosis patients.
The classimbalance bias of DLDA can be attributed to the larger variability of the estimate of the minority class centroid [8]; variable selection reduces the bias, but does not completely remove it if data are highdimensional. The bias increases when the classimbalance is larger and when more variables are measured because in these settings large discrepancies between sample values and true values are common in the minority class. Intuitively PAM should have an edge over DLDA in the classimbalanced scenario, as shrinking the class centroids towards the overall centroid should reduce the extreme mean values that arise by chance and consequently diminish the classimbalance bias.
Wang and Zhu [12] reinterpreted PAM in the framework of the LASSO regression [13] and proposed two classifiers that enable different amount of shrinkage for each variable (ALP: Adaptive L_{ ∞ } norm penalized NSC and AHP: Adaptive hierarchically penalized NSC). They used simulated and real data to show that their methods outperform PAM in most circumstances, but did not address specifically the classimbalance problem.
In this article we identify the features of the NSC classifiers that contribute to the classimbalance bias and propose modified methods, GMPAM, GMALP and GMAHP, to reduce the classimbalance bias. GM classifiers estimate the optimal shrinkage maximizing the crossvalidated geometric mean of the classspecific predictive accuracies (gmeans) and do not use the class prior correction that is embedded in the original classifiers.
The rest of the article is organized as follows. In the “Methods” section we present PAM, AHP and ALP classifiers. In section “Results” we first explain the pitfalls of the existing approach for the determination of the optimal threshold and present the novel algorithm; we apply the algorithm to three NSC classifiers and compare them with the existing approaches using simulated and real highdimensional data. We end with a discussion and conclusions in Sections “Discussion” and “Conclusion”.
Methods
where ${d}_{\mathit{\text{kj}}}=\frac{{\overline{x}}_{\mathit{\text{kj}}}{\overline{x}}_{j}}{{m}_{k}({s}_{j}+{s}_{0})}$, λ≥0 is a threshold parameter that needs to be tuned, and (·)_{+} is the positive part of (·).
π_{ k } is the proportion of class k samples in the population ($\sum _{k=1}^{K}{\pi}_{k}=1$), −2l o g(π_{ k }) is a class prior correction and ${L}_{k}=\sum _{j=1}^{p}{L}_{\mathit{\text{kj}}}$.
Variable j is effectively not considered in the classification rule (inactive variable) when all ${\overline{x}}_{\mathit{\text{kj}}}^{,}$ are shrunken to ${\overline{x}}_{j}$ as ${L}_{1j}=\cdots ={L}_{\mathit{\text{Kj}}}$; we call the other variables active.
where w_{ j }, ${w}_{j}^{\gamma}$ and ${w}_{\mathit{\text{kj}}}^{\theta}$ are prespecified weights and λ, λ_{ γ } and λ_{ θ } are threshold parameters (see Additional file 1 for the definition of γ_{ j } and θ_{ k j }).
The shrunken centroids, discriminant scores and classification rules are the same as in PAM; the classification rules that use (6) and (7) are denoted with ALP and AHP, respectively.
where L_{ k } is the discriminant score omitting the class prior correction.
In practice for high dimensional data the class prior correction contributes little to the discriminant scores (L_{ k }>>−2 log(π_{ k }) and δ_{ k }≈L_{ k } for large p), while it can bias the NSC classification towards the majority class if all or most of the variables are inactive (L_{ k }≈0 and δ_{ k }≈−2 log(π_{ k })). For these reasons we used equal class priors for all the classes (−2 log(1/K)), similarly as Huang et al.[14]. Moreover, in case of ties the class membership was assigned at random to one of the classes with the smallest discriminant scores.
Results
In this section we discuss the implications of estimating the optimal threshold for NSC classifiers by minimizing the crossvalidated overall error, when data are classimbalanced and highdimensional. We then present a modified approach for threshold estimation aimed at reducing the classimbalance problem for NSC classifiers, and show its effectiveness on simulated and real high dimensional classimbalanced data.
Threshold selection
In practice the threshold parameters of the NSC classifiers are estimated minimizing the crossvalidated error rate for different values of the threshold; the threshold value that produces the lowest error is used to shrink the centroids.
and it depends on the class specific predictive accuracies (${\text{PA}}_{k}=P\left(\mathcal{C}\right({\mathbf{x}}^{\ast})=k{y}^{\ast}=k)$) and on the level of class imbalance. Overall error and predictive accuracy (1error) are misleading measures of the classifiers performance when data are classimbalanced [15]: the predictive accuracies of the minority classes are given little weight and classifying all new samples in the majority class produces small overall error when the classimbalance is extreme.
The probability of classifying a new sample in the minority class is smaller when the class imbalance is more extreme and/or when more variables are measured. As a consequence, the error rate is a decreasing function of the number of variables and, when the number of variables is large, it approaches the proportion of the minority class samples in the population. For this particular setting the classification probabilities were derived also analytically, additionally assuming that the variances are known, see Additional file 2.
If we consider the shrunken centroid classifiers as a special form of DLDA this result would suggest that in the classimbalanced scenario the threshold selection based on the overall error will favor small threshold values (large number of variables), which in turn will lead to small probability of classification in the minority class and large bias in favor of the majority class.
The proposed approach
gmeans is an accuracy metric often used for classimbalanced data that captures the performance of the classifiers in all classes [7]. It gives the same weight to all the classes, it is independent of the class distribution of the test set and it penalizes the classifiers whose performance is heterogeneous across classes. Furthermore, for a fixed total ($\sum _{k=1}^{K}P{A}_{k}$), it has the maximum when the class specific predictive accuracies are equal [16].
In practice the class specific PA are estimated with ${\text{PA}}_{k}=1/{n}_{k}\sum _{i=1}^{n}{z}_{\mathit{\text{ik}}}\xb7{z}_{i\hat{y}}$, where ${z}_{i\hat{y}}$ is the indicator for a correctly classified sample i (${z}_{i\hat{y}}=1$ if $\mathcal{C}\left({x}_{i}\right)={y}_{i}$ and zero otherwise) and they depend on the selected threshold value. It is not feasible to evaluate the crossvalidated GM for all possible threshold values, therefore we limit our attention to a fixed number of thresholds. We consider T equally spaced threshold values, ranging from 0 (no shrinkage) to λ_{max}, the minimum threshold value that shrinks all the class centroids to the overall centroid, for all the variables (complete shrinkage). In the Additional file 1 we show how to derive λ_{max} for PAM, ALP and AHP.
The proposed approach for the estimation of the threshold can be used for each of the three NSC classifiers considered in this paper; the modified classifiers are denoted with GMPAM, GMALP and GMAHP, respectively. The proposed algorithm is presented below.
Results on the simulated data
In this section we present a series of selected results based on simulated data to assess the performance of the GM method and compare it with the original NSC classifiers.
In a two class classification scenario, we simulated 10,000 variables from a multivariate Gaussian distribution. We used a block exchangeable correlation structure, in which the variables in the same block were correlated (pairwise correlation equal to ρ=0.8) while the variables from different blocks were independent (similarly as in Guo et al.[17] and Pang et al.[18]); each block contained 100 variables and all variances were equal to 1. The mean values were equal to 0 for all variables in class 1 (μ_{1}=0). In the null case all the variables were non informative (μ_{1}=μ_{2}=0), while in the alternative case 100 variables were informative about class distinction (μ_{2}=0.5, 1, 2 or 5 for the informative variables and μ_{2}=0 for non informative variables).
The training sets contained 100 samples and the proportion of class 1 samples varied from 0.5 (balanced situation) to 0.9 (highly imbalanced situation). 10fold CV was used to estimate the optimal threshold parameter, using 30 different threshold values. The classifier trained on the complete data set with the estimated optimal threshold was used to make predictions on a large independent and balanced test set (n_{ t e s t }=1,000, ${k}_{1}^{\mathit{\text{test}}}=0.5$, where ${k}_{1}^{\mathit{\text{test}}}$ is the proportion of class 1 samples in the test set), simulated from the same distribution used for the training set. The performance on the test set was evaluated in terms of classspecific predictive accuracies, gmeans, the area under the ROC curve (AUC): these measures do not depend on the data distribution in the test set and can be estimated with equal precision when the test set classes are balanced. Furthermore, it was previously shown that matching the prevalence in the training and test set does not attenuate the classimbalance problem [8]. In the simulations we evaluated also the false discovery rate (FDR, proportion of non informative variables among active variables) and the false negative rate (FNR, proportion of informative variables among the non active variables). Each simulation was repeated 500 times.
Performance of the classifiers under the alternative hypothesis with large classimbalance ( k _{ 1 } = 0.9 ) and moderate differences between classes ( μ _{ 2 } =1 )
Method  λ ^{ ∗ } ^{a}  #, % info  # noninfo [%]  FDR  PA _{ 1 }  PA _{ 2 }  gmeans  AUC 

(n _{ 1 } = 90)  (n _{ 2 } = 10)  
PAM  0.05  99.94  9184.1 [92.77]  0.99  0.95  0.11  0.31  0.6 
(0.08)  (0.87)  (1136.28)  (0.00)  (0.02)  (0.05)  (0.07)  (0.04)  
GMPAM  1.29  58.81  602.1 [6.08]  0.62  0.63  0.64  0.62  0.69 
(0.51)  (43.6)  (1004.73)  (0.38)  (0.1)  (0.16)  (0.08)  (0.1)  
ALP  0.07  99.96  9004.2 [90.95]  0.99  0.95  0.11  0.3  0.61 
(0.19)  (0.54)  (2232.18)  (0.01)  (0.03)  (0.07)  (0.08)  (0.04)  
GMALP  3.76  63.08  408.8 [4.13]  0.59  0.67  0.61  0.63  0.68 
(2.21)  (41.09)  (816.3)  (0.36)  (0.1)  (0.17)  (0.09)  (0.09)  
AHP  0.36  96.89  6438.9 [65.04]  0.95  0.94  0.14  0.34  0.62 
(1.57)  (12.84)  (4235.52)  (0.11)  (0.04)  (0.1)  (0.1)  (0.05)  
GMAHP  5.29  37.45  266.5 [2.69]  0.42  0.78  0.5  0.6  0.69 
(3.94)  (37.96)  (1275.07)  (0.4)  (0.08)  (0.18)  (0.12)  (0.1) 
The GMNSC classifiers performed very similarly to NSC classifiers in the settings where the NSC classifiers were not biased towards the majority class, i.e. when the classes were balanced or were very different (data not shown). In the other situations the GMNSC classifiers reduced the gap between the class specific PA, obtaining larger minority class PA, gmeans and AUC, and greatly reducing the number of active variables; the removal of most of the non informative variables reduced the FDR and the bias towards the classification into the majority class (Table 1), while the removal of a part of the informative variables increased the false negative rate (FNR). The best performance was obtained when the GM threshold optimization was used with PAM, while the smallest improvement was seen for AHP. This can probably be attributed to the fact that the variables with larger d_{ k j } values are weighted and therefore shrunken less, which is not desirable for the non informative variables as large values of d_{ k j } arose by chance and should therefore be actually shrunken more; note that the smallest bias was observed for PAM (with GM approach), where all variables are shrunken for the same amount. Similar results were obtained simulating independent variables (data not shown).
When some variables were differentially expressed and the classimbalance was moderate, methods using the GM approach achieved slightly higher PA for the minority class, while this bias was only marginal if the original approach was used. Note that the overall centroid can be expressed as ${\overline{x}}_{j}=\sum _{k=1}^{K}{\overline{x}}_{\mathit{\text{kj}}}{n}_{k}/n$; it is a weighted average of class specific mean values with more weight given to the majority class, so the overall centroid is closer to the sample mean of the majority class. When the threshold is large this has a consequence of shifting the minority class centroid towards the sample mean of the majority class and hence some of the new samples from the majority class are closer to the minority class (shrunken) centroid than to the majority class (shrunken) centroid. Classifiers using the original approach do not suffer from this problem as the amount of shrinkage is small when the training set is classimbalanced. One way of diminishing this bias would be to define the overall centroid as ${\overline{x}}_{j}=\sum _{k=1}^{K}{\overline{x}}_{\mathit{\text{kj}}}/K$, and ${m}_{k}=\sqrt{\frac{1}{K}\left(\frac{K2}{{n}_{k}}+\frac{1}{K}\sum _{k=1}^{K}{n}_{k}^{1}\right)}$ as to assure that the denominator in calculation of d_{ k j } will be the appropriate standard error. We performed a limited set of simulations with this overall centroid definition for PAM and observed that the bias in favor of the minority class was removed. However, when there was large classimbalance, the results were slightly more biased in favor of the majority class (data not shown) than with the original overall centroid definition. One reason for poor performance in the case of large classimbalance is that relatively more weight was given to a less accurate estimate.
One of the possible strategies when dealing with classimbalanced data is to use case weighting in order to adjust for the classimbalance bias [7]. Since case weighting is not implemented in the NSC classifiers we performed a limited set of experiments with random oversampling to give the same weight to both classes. Class balanced training sets were obtained by replicating a subset of randomly selected samples from the minority class (replicating max(n_{1},n_{2})− min(n_{1},n_{2}) samples from the minority class and obtaining the training set of size 2 max(n_{1},n_{2})). PAM and GMPAM were trained on the oversampled training sets and evaluated on independent test sets. The simulation settings and the settings of PAM and GMPAM were the same as presented above (see Additional file 8 for the results).
Oversampling did not significantly change the performance of PAM, while it increased the class imbalance bias of GMPAM. After oversampling PAM and GMPAM performed exactly the same, achieving poor predictive accuracy for the minority class when the original training set was classimbalanced. The number of active variables in GMPAM was much larger than in the simulations where oversampling was not used. After oversampling the training set contains exact copies of the minority class samples. When the level of classimbalance is severe there are many copies of the same minority class sample in the training set and the predictive accuracy for the minority class obtained by using crossvalidation is (nearly) a resubstitution (training) estimate, as the same minority class samples can be used in the training and testing phase. The fact that the GMPAM approach favored the use of large number of variables can be explained by realizing that classifiers that use many variables minimize the resubstitution PA (in our case minority class PA) [19, 20]; because of the classimbalance bias the majority class PA is also increasing when using more variables. Consequently, the gmeans is an increasing function of the the number of variables, and is maximized when the amount of shrinkage is small. The use of large number of variables (small threshold) translates into a large classimbalance bias on the independent test set.
We considered also a three class scenario, simulating 5,000 variables from a multivariate Gaussian distribution; the correlation structure was the same as in the twoclass scenario. We considered the null case and the alternative case where class 2 was the minority class nested between class 1 and class 3 (μ_{1}=−μ_{3}=1 and μ_{2}=0 for 100 informative variables, and μ_{1}=μ_{2}=μ_{3}=0 for non informative variables; n_{2}=20, n_{1}=n_{3}=100). The test sets were balanced (n_{ t e s t }=1500) and the classifiers were trained and evaluated as described for the twoclass scenario. The null case results are in the Additional file 9, where additional simulation results with 1000 variables and balanced data (n_{1}=n_{2}=n_{3}=100) are also presented.
Multiclass classification results for the classimbalanced scenario in the alternative case
Method  λ ^{ ∗ } ^{a}  #, % info  # noninfo [%]  FDR  PA _{ 1 }  PA _{ 2 }  PA _{ 3 }  gmeans 

(n _{ 1 } = 100)  (n _{ 1 } = 20)  (n _{ 3 } = 100)  
PAM  6.6  59.23  0 [0.00]  0  0.86  0.04  0.86  0.29 
(0.58)  (33.68)  (0)  (0.00)  (0.02)  (0.02)  (0.02)  (0.05)  
GMPAM  1.4  99.64  834.56 [17.03]  0.5  0.75  0.31  0.75  0.55 
(0.96)  (5.75)  (1262.68)  (0.41)  (0.04)  (0.06)  (0.04)  (0.04)  
ALP  58.76  99.98  49 [1.00]  0.01  0.82  0.15  0.82  0.46 
(15.12)  (0.14)  (490)  (0.1)  (0.04)  (0.05)  (0.03)  (0.04)  
GMALP  12.77  99.64  415.68 [8.48]  0.19  0.74  0.34  0.74  0.57 
(12.88)  (3.6)  (1330.05)  (0.31)  (0.05)  (0.08)  (0.05)  (0.05)  
AHP  59.05  99.99  98 [2.00]  0.02  0.82  0.15  0.82  0.45 
(15.74)  (0.1)  (689.46)  (0.14)  (0.04)  (0.05)  (0.04)  (0.05)  
GMAHP  13.55  100  316.73 [6.46]  0.17  0.74  0.34  0.74  0.57 
(13.05)  (0)  (1164.99)  (0.28)  (0.05)  (0.08)  (0.05)  (0.04) 
Application to real highdimensional data sets
Gene expression breast cancer data sets
Data set  # genes  Classification task  n _{1}  n _{2}  n _{3}  k _{ m i n } ^{a} 

Ivshina  22,283  ER or ER+  34  211  0.14  
Grade 1, 2 or 3  68  166  55  0.19  
Grade 1, 2 or 3  40  10 to 80  40  0.25 to 0.50  
Wang  22,283  Relapse or not  179  107  0.37  
Korkola  9,524  Good or bad prognosis  34  21  0.38  
Sotiriou  7,650  ER+ or ER  10 to 50  10  0.50 to 0.17  
Grade 12 or 3  10 to 40  10  0.50 to 0.20 
Performance of the classifiers on real gene expression data sets for the two class classification tasks
Data set  Method  λ ^{∗} ^{a}  # genes  PA  PA_{1}  PA_{2}  gmeans  AUC 

Ivshina  PAM  0  22283  0.84  0.79  0.85  0.82  0.85 
(ER)  GMPAM  4.83  51  0.82  0.91  0.81  0.86  0.90 
ALP  0  22283  0.84  0.82  0.84  0.83  0.86  
GMALP  58.24  26  0.85  0.91  0.84  0.87  0.88  
AHP  185.56  20  0.89  0.88  0.89  0.88  0.90  
GMAHP  69.58  115  0.89  0.88  0.89  0.89  0.91  
Wang  PAM  3.71  14  0.61  0.60  0.62  0.61  0.62 
GMPAM  3.71  14  0.60  0.60  0.62  0.61  0.63  
ALP  8.26  654  0.56  0.57  0.55  0.56  0.63  
GMALP  8.26  654  0.56  0.56  0.56  0.56  0.63  
AHP  21.95  135  0.56  0.56  0.56  0.56  0.65  
GMAHP  21.95  135  0.56  0.56  0.55  0.56  0.63  
Korkola  PAM  0.19  7073  0.65  0.71  0.57  0.64  0.64 
GMPAM  0.19  7073  0.65  0.71  0.57  0.64  0.64  
ALP  4.87  155  0.64  0.65  0.62  0.63  0.64  
GMALP  4.87  155  0.69  0.74  0.62  0.67  0.69  
AHP  0.76  1308  0.58  0.68  0.43  0.54  0.58  
GMAHP  0.76  1308  0.62  0.71  0.48  0.58  0.60 
Relapse was more difficult to predict on Wang’s data set (Table 4). PAM used the smallest number of active genes and performed slightly better than ALP and AHP in terms of class specific PA and gmeans. Exactly the same results were obtained using NSC and GMNSC classifiers. This result is in line with the simulations, where we showed that for moderate classimbalance the performance of NSC and GMNSC classifiers was very similar.
PAM used the largest number of variables on Korkola’s data set (Table 4) but still performed better than AHP and ALP that used less variables. The performance of GMPAM and GMAHP was similar to the original methods, while GMALP outperformed ALP and achieved the best overall performance on this data set.
In order to explore the effect of classimbalance on real data, we obtained multiple training sets from the Sotiriou’s data set, varying the level of classimbalance. We used a fixed number of samples from the minority class (n_{ER}=10) and varied the number of samples from the majority class (n_{ER+}=10,20,…,50); the samples not included in the training set were used to estimate the accuracy measures. To account for the variability arising from random inclusion of samples in the training or test set, we repeated the procedure 250 times and averaged the results.
Results on the Ivshina data set for classification of Grade of the tumor
n _{ Grade 2 }  Method  λ ^{∗} ^{a}  # genes  PA  PA_{1}  PA_{2}  PA_{3}  gmeans 

10  PAM  4.25  963  0.29  0.90  0.12  0.89  0.45 
GMPAM  2.76  5768  0.34  0.86  0.20  0.88  0.51  
ALP  10.26  7425  0.33  0.88  0.18  0.88  0.51  
GMALP  28.39  3295  0.37  0.70  0.26  0.88  0.50  
AHP  25.61  9508  0.39  0.83  0.27  0.85  0.57  
GMAHP  76.40  640  0.45  0.75  0.36  0.86  0.61  
20  PAM  3.14  3023  0.34  0.89  0.18  0.88  0.51 
GMPAM  1.27  12299  0.42  0.83  0.30  0.85  0.59  
ALP  11.20  5541  0.40  0.85  0.26  0.88  0.57  
GMALP  10.67  9991  0.42  0.81  0.30  0.86  0.58  
AHP  30.66  7733  0.47  0.78  0.37  0.83  0.62  
GMAHP  48.52  3811  0.48  0.76  0.39  0.83  0.62  
40  PAM  1.55  9832  0.46  0.86  0.32  0.87  0.61 
GMPAM  0.49  17439  0.52  0.83  0.41  0.84  0.65  
ALP  12.43  6885  0.45  0.82  0.31  0.87  0.59  
GMALP  2.00  16919  0.51  0.80  0.41  0.83  0.64  
AHP  37.43  6366  0.52  0.73  0.44  0.83  0.64  
GMAHP  32.21  7148  0.53  0.74  0.44  0.82  0.64  
80  PAM  0.28  19617  0.58  0.77  0.48  0.83  0.67 
GMPAM  0.32  19311  0.58  0.78  0.47  0.83  0.67  
ALP  0.68  19149  0.59  0.75  0.50  0.82  0.67  
GMALP  0.76  18727  0.59  0.75  0.50  0.82  0.67  
AHP  15.20  11231  0.58  0.73  0.48  0.82  0.66  
GMAHP  14.69  10928  0.57  0.73  0.48  0.82  0.66 
Decreasing the number of Grade 2 samples had the effect of further decreasing the PA of Grade 2 and the gmeans; the PA of the other two classes, which were high when data were balanced, increased only moderately for most classifiers. The drop in the PA of Grade 2 was less pronounced for the GM classifiers. Similarly as for the other classification tasks the GM method was the most useful in improving the performance of PAM and ALP.
Discussion
In this paper we proposed a modified approach (GMNSC) to the estimation of the amount of centroid shrinkage for the NSC classifiers. The approach estimates the optimal shrinkage by maximizing the geometric mean of the classspecific predictive accuracies, rather than the overall accuracy. We used our approach with PAM and with two recently proposed NSC classifiers, ALP and AHP.
The motivation for the new approach is to alleviate the classimbalance problem of the NSC classifiers. We showed with a limited set of simulations that ALP and AHP, similarly to PAM [8], are biased towards the classification in the majority class when data are classimbalanced: they assign most new samples to the majority class and achieve poor predictive accuracy for the minority class, unless the differences between the classes are very large. Increasing the number of measured variables has the effect of further increasing the bias.
We identified the main reason for the biased NSC classification in the method used in practice for estimating the threshold parameter, which is based on the minimization of the crossvalidated overall error rate. The threshold parameter plays a fundamental role for NSC classifiers, as it determines how many variables are effectively used in the classification rule and by which amount the centroids are shrunken.
Simulation results and the analysis of three large data sets of breast cancer showed that the greatest gains were obtained by GMNSC when the NSC classifiers had a large bias towards the majority class, while GMNSC performed similarly to NSC in the absence of bias. GMNSC classifiers used less active variables when data were classimbalanced.
In the biomedical applications the improvements obtained using the GMNSC classifiers are relevant from the practical point of view. The reduction of the classimbalance bias results in more accurate prediction for the minority class samples, which are often the samples for which an accurate prediction is more important. Moreover, the inclusion of a smaller number of variables in the classifiers seems a desirable property in the biomedical applications where the aim is to develop prognostic or predictive models. Many researchers argued that it is advantageous to use microarraybased classifiers that include a small number of genes (see for example [24] and references therein). The reason is that classifiers that include numerous genes can be more difficult to transfer to the clinical practice because their interpretation and practical implementation is more difficult. At the same it was shown that classifiers that include few genes can perform well in practice [2426].
The current implementation of the NSC classifiers does not allow for case weighting so we performed random oversampling in the attempt to give equal weight to the classes. We observed that random oversampling had no effect on PAM, while it increased the classimbalance bias of GMPAM, substantially increasing the number of active variables. The reason for poor performance of GMPAM is that the gmeans used to determine the optimal threshold is, because of oversampling, not fully crossvalidated estimate as the same minority class samples are used when training and evaluating the classifier. The not properly crossvalidated gmeans is maximized by classifiers that use large number of variables. Although we did not perform the experiments with oversampling for ALP and AHP, we expect that the same conclusion would still apply, as the determination of the optimal threshold would likely suffer from the same problem. In general special care is needed when the tuning parameters are determined after the training set is oversampled; for example, Random Forests and Support Vector Machines require the optimization of the tuning parameters, which is normally done with crossvalidation.
We chose to select the optimal shrinkage by maximizing the gmeans of the classifiers, which is more appropriate than overall error for the assessment of the effectiveness of the classifiers trained on imbalanced data [15]. Other assessment measures were proposed for classimbalanced data: two popular alternatives in the twoclass problems are the Fmeasure and the area under the ROC curve (AUC), however their generalization to multiclass problems is not as straightforward as for gmeans. The Fmeasure is a function of predictive accuracy and predictive value of the positive class, the weight given to each measure depends on a parameter that is chosen by the user. Being a function of the predictive values it is sensitive to data distributions, which is not a desirable property when data are classimbalanced. AUC depends on the classspecific predictive accuracies, similarly to gmeans. A possible advantage of gmeans over AUC is its behavior when evaluating uninformative classifiers ($P\left(\mathcal{C}\right({x}^{\ast})=1y=1)=P(\mathcal{C}\left({x}^{\ast}\right)=1y=2)$ for two classes). In this case gmeans favors the classifiers that assign the same number of samples to each class, while AUC is approximately the same for all uninformative classifiers; as a consequence the estimation of the threshold parameter using AUC is very unstable when the differences between the classes are small. We considered also the maximization of the sum of the classspecific predictive accuracies ($\sum _{k=1}^{K}{\text{PA}}_{k}$), however this measure has a similar drawback as AUC as it can not distinguish between the uninformative classifiers. Experimental results for PAM showed that under the null hypothesis (and when the difference between the classes was small) this approach performed slightly worse than using gmeans to determine the optimal threshold, while the results were very similar when the difference between the classes was large (data not shown).
Tibshirani et al.[5] proposed the procedure for adaptive choice of threshold for PAM that enables different shrinkage for each class and showed that this approach can lead to smaller number of active variables. However, Wang and Zhu [12] observed that using the adaptive threshold procedure does not change the predictive accuracy of PAM. We obtained similar results and observed that while in the multiclass classification problems sometimes the number of active variables decreases, the predictive accuracy of PAM and GMPAM is not affected (data not shown). Therefore, the adaptive choice of threshold does not seem beneficial in decreasing the classimbalance problem of the NSC classifiers.
In this paper we focused on PAM, ALP and AHP; others proposed further modifications to NSC methods [17, 27], which were not evaluated in this study. However, we believe that all the classifiers that base their tuning on the minimization of overall error should present the same type of problems, and would benefit from using a tuning strategy based on gmeans or other cost functions that are less sensitive to the classimbalance problem.
Huang et al.[14] observed that the estimators of the discriminant scores for discriminant analysis are biased and derived a biascorrected discriminant score for DLDA and DQDA. Their findings are interesting in the context of classimbalanced highdimensional classification, as they show that the bias of the discriminant scores depends on the classimbalance in the training set and on the number of variables; the bias is larger in the minority class and when more variables are considered. The biascorrection outperforms the original approach, especially when the classimbalance is large; unfortunately this approach can not be extended straightforwardly to NSC classifiers as the distribution of the estimator of the shrunken class centroid is not known.
A computational issue is the estimation of the optimal threshold. We used an approach similar to what was used for PAM [5], evaluating a fixed number of threshold values and using crossvalidation. In most situations we evaluated 30 threshold values, equally spaced between 0 (all active variables) and the minimum threshold value that shrunk all the class centroids to the overall centroid (no active variables). This choice was a compromise between accuracy of estimation and computational burden, which was particularly high for AHP. However, we observed that this strategy might not be optimal in all situations since, especially for AHP, the relationship between the threshold and the number of active variables was highly nonlinear. Often the smallest positive threshold produced few active variables: the estimated number of active variables could be either very small or equal to the total number of variables, while intermediate solutions were not evaluated. This could explain why in some applications GM was not successful in improving the performance of AHP. Our observations would suggest that equally spacing the threshold values could be an effective choice when the number of variables distinguishing the classes is relatively small, as it is often the case for microarray data. In problems where many variables distinguish the classes a solution would be to equally space threshold values on the logarithmic scale, which would include more thresholds associated with a large number of active genes.
We presented the GMNCS approach limited to the case where each class is given equal importance, but the method can be extended and incorporate different misclassification costs for each class by weighting the classspecific predictive accuracies. This approach would be useful for problems where the cost of misclassification is not equal for each class.
Conclusion
We showed that three nearest shrunken centroid classifiers (PAM, ALP and AHP) achieve poor accuracy for the minority class when data are classimbalanced and highdimensional, unless the difference between classes is large. We proposed GMNSC, a straightforward yet effective approach to diminish the classimbalance problem of NSC classifiers, which consists in estimating the optimal amount of shrinkage by maximizing the gmeans of the classifiers, rather than its overall accuracy. We used simulated and real data to show that when the NCS classifiers are biased towards the majority class the GMNSC approach outperforms NSC, and it performs similarly to NSC otherwise. GMNSC classifiers generally select less variables which seems a desirable property in the biomedical applications where the aim is to develop prognostic or predictive models.
Our experiments with random oversampling showed no improvement for PAM while the classimbalance bias of GMPAM was increased. We therefore recommend that this strategy is not used with the NSC or GMNSC classifiers.
Abbreviations
 NSC:

nearest shrunken centroid
 PAM:

prediction for microarrays
 DLDA:

diagonal linear discriminant analysis
 ALPNSC:

adaptive L∞ norm penalized NSC
 AHPNSC:

adaptive hierarchically penalized NSC
 PA:

predictive accuracy
 AUC:

area under the ROC curve.
Declarations
Acknowledgements
The highperformance computation facilities were kindly provided by Bioinformatics and Genomics unit at Department of Molecular Biotechnology and Heath Sciences, University of Torino, Italy.
Authors’ Affiliations
References
 Bishop CM: Pattern Recognition and Machine Learning (Information Science and Statistics). 1st ed. edition. New York: Springer; 2007.Google Scholar
 Weigelt B, Pusztai L, Ashworth A, ReisFilho JS: Challenges translating breast cancer gene signatures into the clinic. Nat Rev Clin Oncol 2012, 9: 5864.View ArticleGoogle Scholar
 Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA 2002,99(10):65676572. 10.1073/pnas.082099299PubMed CentralView ArticlePubMedGoogle Scholar
 Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002,97(457):7787. 10.1198/016214502753479248View ArticleGoogle Scholar
 Tibshirani R, Hastie T, Narasimhan B, Chu G: Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 2003, 18: 104117. 10.1214/ss/1056397488View ArticleGoogle Scholar
 Wu B: Differential gene expression detection using penalized linear regression models: the improved SAM statistics. Bioinformatics 2006,21(8):15651571.View ArticleGoogle Scholar
 He H, Garcia EA: Learning from imbalanced data. IEEE Trans Knowledge Data Eng 2009,21(9):12631284.View ArticleGoogle Scholar
 Blagus R, Lusa L: Class prediction for highdimensional classimbalanced data. BMC Bioinformatics 2010, 11: 523. 10.1186/1471210511523PubMed CentralView ArticlePubMedGoogle Scholar
 Blagus R, Lusa L: Impact of classimbalance on multiclass highdimensional class prediction. Metodološki zvezki 2012, 9: 2545.Google Scholar
 Reeve J, Einecke G, Mengel M, Sis B, Kayser N, Kaplan B, Halloran PF: Diagnosing rejection in renal transplants: a comparison of molecular and histopathologybased approaches. Am J Transplant 2009,9(8):18021810. [http://dx.doi.org/10.1111/j.16006143.2009.02694.x] [] 10.1111/j.16006143.2009.02694.xView ArticlePubMedGoogle Scholar
 Korkola J, Blaveri E, DeVries S, Moore D, Hwang ES, Chen YY, Estep A, Chew K, Jensen R, Waldman F: Identification of a robust gene signature that predicts breast cancer outcome in independent data sets. BMC Cancer 2007, 7: 61. [http://www.biomedcentral.com/14712407/7/61] [] 10.1186/14712407761PubMed CentralView ArticlePubMedGoogle Scholar
 Wang S, Zhu J: Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics 2007,23(8):972979. 10.1093/bioinformatics/btm046View ArticlePubMedGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc (Ser B) 1996, 58: 267288.Google Scholar
 Huang S, Tong T, Zhao H: Biascorrected diagonal discriminant rules for highdimensional classification. Biometrics 2010,66(4):10961106. 10.1111/j.15410420.2010.01395.xPubMed CentralView ArticlePubMedGoogle Scholar
 Pepe MS: The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press; 2003.Google Scholar
 Lin WJ, Chen JJ: Classimbalanced classifiers for highdimensional data. Brief Bioinformatics 2012,14(1):1326.View ArticlePubMedGoogle Scholar
 Guo Y, Hastie T, Tibshirani R: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 2007, 8: 86100. 10.1093/biostatistics/kxj035View ArticlePubMedGoogle Scholar
 Pang H, Tong T, Zhao H: Shrinkagebased diagonal discriminant analysis and its applications in highdimensional data. Biometrics 2009,65(4):10211029. 10.1111/j.15410420.2009.01200.xPubMed CentralView ArticlePubMedGoogle Scholar
 Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Nat Cancer Inst 2003, 95: 1418. 10.1093/jnci/95.1.14View ArticlePubMedGoogle Scholar
 Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2003.Google Scholar
 Sotiriou C, Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based on gene expression profiles from a populationbased study. Proc Nat Acad Sci USA 2003,100(18):1039310398. 10.1073/pnas.1732912100PubMed CentralView ArticlePubMedGoogle Scholar
 Ivshina AV, George J, Senko O, Mow B, Putti TC, Smeds J, Lindahl T, Pawitan Y, Hall P: Nordgren H, et al: Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Res 2006,66(21):1029210301. 10.1158/00085472.CAN054414View ArticlePubMedGoogle Scholar
 Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder M E, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet 2005,365(9460):671679.View ArticlePubMedGoogle Scholar
 Wang X, Simon R: Microarraybased cancer prediction using single genes. BMC Bioinformatics 2011, 12: 391. 10.1186/1471210512391PubMed CentralView ArticlePubMedGoogle Scholar
 Wang X, Gotoh O: Accurate molecular classification of cancer using simple rules. BMC Med Genomics 2009, 2: 64. 10.1186/17558794264PubMed CentralView ArticlePubMedGoogle Scholar
 HaibeKains B, Desmedt C, Loi S, Culhane AC, Bontempi G, Quackenbush J, Sotiriou C: A threegene model to robustly identify breast cancer molecular subtypes. J Nat Cancer Inst 2012, 104: 311325. 10.1093/jnci/djr545PubMed CentralView ArticlePubMedGoogle Scholar
 Dabney AR: Classification of microarrays to nearest centroids. Bioinformatics 2005,21(22):41484154. 10.1093/bioinformatics/bti681View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.