 Research article
 Open Access
 Published:
Improved shrunken centroid classifiers for highdimensional classimbalanced data
BMC Bioinformatics volume 14, Article number: 64 (2013)
Abstract
Background
PAM, a nearest shrunken centroid method (NSC), is a popular classification method for highdimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall crossvalidated (CV) error rate.
Results
We show that when data are classimbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or classimbalance is larger and/or the differences between classes are smaller. To diminish the classimbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the classspecific predictive accuracies (gmeans).
Conclusions
The results obtained on simulated and real highdimensional classimbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real highdimensional classimbalanced data.
Background
The objective of class prediction (classification) is to develop a rule based on variables measured on a group of samples with known class membership (training set), which can be used to assign the class membership to new samples (test set). Many different classifiers exist, and they differ in the definition of the classification rule [1]. Nowadays classification rules are increasingly often developed using data that are highdimensional (the number of variables greatly exceeds the number of samples) and also classimbalanced (the number of samples belonging to each class is not the same). Highdimensional classification has become a popular task in the biomedical and bioinformatics community with the advent of highthroughput technologies in biomedicine a decade ago. For example, many researchers attempted to develop geneexpression classifiers based on microarray experiments for prognostic and predictive purposes in breast cancer [2]. The number of subjects included in the microarray based classification studies is usually in the range of hundreds, while the number of measured genes is in the tens of thousands. Nowadays, the newly available nextgeneration sequencing methods provide billions of short reads for each subject, further increasing the highdimensionality of data.
For highdimensional data Tibshirani et al.[3] proposed the nearest shrunken centroid method (NSC, known also as prediction for microarrays  PAM), which can be seen as a modification of the diagonal linear discriminant analysis (DLDA [4]). The classification rule of DLDA is based on the scaled distance between the expression profiles of new samples and class centroids (vectors of class specific means); PAM uses a very similar rule, but it shrinks the class centroids towards the overall means and it embeds a variable selection mechanism, which is generally useful in highdimensional classprediction [4]. The amount of shrinkage is usually determined minimizing the crossvalidated error rate on the training set [5, 6]. Since its proposal, PAM has been widely used in practice. The paper that first described the PAM methodology [3] has been cited about thousand times, mostly in journals from the biomedical field: just papers from the fields of Oncology, Biochemistry and Biotechnology account for about half of the citations (source: ISI Web of Knowledge, accessed in November 2012).
The classifiers trained on classimbalanced data tend to classify most of the new samples in the majority class [7] and this bias is further increased if data are highdimensional [8]. It is somehow surprising that while DLDA can perform fairly well with imbalanced data (provided that the number of variables is reduced by some type of variable selection method) [4, 8, 9], PAM is very sensitive to the classimbalance problem: it assigns most new samples to the majority class and achieves very poor accuracy for the minority class even when the level of classimbalance is only moderate [8]. Even when the differences between classes are large the predictive accuracy tends to be smaller in the minority class. For example, Reeve and colleagues [10] used PAM to build a classifier to distinguish rejection from nonrejection kidney transplant using gene expression microarray data, achieving a better predictive accuracy for the majority class of nonrejection transplants: the crossvalidated predictive accuracies were 80% (108 out of 135) in the nonrejection group and 69% in the rejection group (35 out of 51). Similarly, Korkola et al. [11] used PAM to predict the prognosis of 55 breast cancer patients and obtained a crossvalidated predictive accuracy of 76% (26 out of 34) for good prognosis patients and of 62% (13 out of 21) for poor prognosis patients.
The classimbalance bias of DLDA can be attributed to the larger variability of the estimate of the minority class centroid [8]; variable selection reduces the bias, but does not completely remove it if data are highdimensional. The bias increases when the classimbalance is larger and when more variables are measured because in these settings large discrepancies between sample values and true values are common in the minority class. Intuitively PAM should have an edge over DLDA in the classimbalanced scenario, as shrinking the class centroids towards the overall centroid should reduce the extreme mean values that arise by chance and consequently diminish the classimbalance bias.
Wang and Zhu [12] reinterpreted PAM in the framework of the LASSO regression [13] and proposed two classifiers that enable different amount of shrinkage for each variable (ALP: Adaptive L_{ ∞ } norm penalized NSC and AHP: Adaptive hierarchically penalized NSC). They used simulated and real data to show that their methods outperform PAM in most circumstances, but did not address specifically the classimbalance problem.
In this article we identify the features of the NSC classifiers that contribute to the classimbalance bias and propose modified methods, GMPAM, GMALP and GMAHP, to reduce the classimbalance bias. GM classifiers estimate the optimal shrinkage maximizing the crossvalidated geometric mean of the classspecific predictive accuracies (gmeans) and do not use the class prior correction that is embedded in the original classifiers.
The rest of the article is organized as follows. In the “Methods” section we present PAM, AHP and ALP classifiers. In section “Results” we first explain the pitfalls of the existing approach for the determination of the optimal threshold and present the novel algorithm; we apply the algorithm to three NSC classifiers and compare them with the existing approaches using simulated and real highdimensional data. We end with a discussion and conclusions in Sections “Discussion” and “Conclusion”.
Methods
Let x_{ i j } be the value of variable j (j=1,…,p) for sample i (i=1,...,n). Each sample belongs to one of K classes (1,2,…,K) and y_{ i } is the class of the i th sample. Let z_{ i k } be a class membership indicator variable (z_{ i k }=1 if y_{ i }=k, z_{ i k }=0 otherwise), and {n}_{k}=\sum _{i=1}^{n}{z}_{\mathit{\text{ik}}} be the number of samples in class k. The j th component of the centroid in class k is {\overline{x}}_{\mathit{\text{kj}}}=\sum _{i=1}^{n}{z}_{\mathit{\text{ik}}}{x}_{\mathit{\text{ij}}}/{n}_{k} and the j th component of the overall centroid is {\overline{x}}_{j}=\sum _{i=1}^{n}{x}_{\mathit{\text{ij}}}/n. The shrunken centroid is defined as
where s_{ j } is the pooled withinclass standard deviation for the j th variable, s_{0} is a constant (set at the median of s_{ j }), {m}_{k}=\sqrt{1/{n}_{k}1/n} and in PAM {\hat{d}}_{\mathit{\text{kj}}} is defined as
where {d}_{\mathit{\text{kj}}}=\frac{{\overline{x}}_{\mathit{\text{kj}}}{\overline{x}}_{j}}{{m}_{k}({s}_{j}+{s}_{0})}, λ≥0 is a threshold parameter that needs to be tuned, and (·)_{+} is the positive part of (·).
The classification rule of PAM for a new sample x^{∗} is
where δ_{ k }(x^{∗}) is the discriminant score for class k, defined as
π_{ k } is the proportion of class k samples in the population (\sum _{k=1}^{K}{\pi}_{k}=1), −2l o g(π_{ k }) is a class prior correction and {L}_{k}=\sum _{j=1}^{p}{L}_{\mathit{\text{kj}}}.
Variable j is effectively not considered in the classification rule (inactive variable) when all {\overline{x}}_{\mathit{\text{kj}}}^{,} are shrunken to {\overline{x}}_{j} as {L}_{1j}=\cdots ={L}_{\mathit{\text{Kj}}}; we call the other variables active.
Wang and Zhu [12] showed that if the observation x_{ i }=(x_{i 1},...,x_{ i p }) from class k follows a multivariate normal distribution (M V N(μ_{ k },Σ_{ k })) and the covariance matrices are the same across different classes and are diagonal ({\mathbf{\Sigma}}_{k}=\mathit{\text{diag}}({\sigma}_{1}^{2},\mathrm{...},{\sigma}_{p}^{2})) then (2) is a solution to
where {\stackrel{~}{x}}_{\mathit{\text{ij}}}=\frac{{x}_{\mathit{\text{ij}}}{\overline{x}}_{j}}{{m}_{k}({s}_{j}+{s}_{0})}, {\stackrel{~}{d}}_{\mathit{\text{kj}}}=\frac{{\mu}_{\mathit{\text{kj}}}{\overline{x}}_{j}}{{m}_{k}({s}_{j}+{s}_{0})} and \sum _{j=1}^{p}\sum _{k=1}^{K}\left{\stackrel{~}{d}}_{\mathit{\text{kj}}}\right is a penalty function. Based on the observation that (5) is a LASSO type estimator for {\hat{d}}_{\mathit{\text{kj}}}, Wang and Zhu [12] proposed two different penalty functions,
where w_{ j }, {w}_{j}^{\gamma} and {w}_{\mathit{\text{kj}}}^{\theta} are prespecified weights and λ, λ_{ γ } and λ_{ θ } are threshold parameters (see Additional file 1 for the definition of γ_{ j } and θ_{ k j }).
The shrunken centroids, discriminant scores and classification rules are the same as in PAM; the classification rules that use (6) and (7) are denoted with ALP and AHP, respectively.
PAM, ALP and AHP require the estimation of the threshold parameter λ, λ_{ γ } and λ_{ θ }. A normal procedure is to use the training data to estimate a crossvalidated (CV) error rate for different values of the threshold and use the threshold that produces the lowest overall error [5]. Note that when the threshold is zero, then the classification rules of PAM, ALP and AHP are essentially the same as the classification rule of DLDA (with the exception of an added constant s_{0} that is not considered in DLDA), which is defined as
where L_{ k } is the discriminant score omitting the class prior correction.
In practice for high dimensional data the class prior correction contributes little to the discriminant scores (L_{ k }>>−2 log(π_{ k }) and δ_{ k }≈L_{ k } for large p), while it can bias the NSC classification towards the majority class if all or most of the variables are inactive (L_{ k }≈0 and δ_{ k }≈−2 log(π_{ k })). For these reasons we used equal class priors for all the classes (−2 log(1/K)), similarly as Huang et al.[14]. Moreover, in case of ties the class membership was assigned at random to one of the classes with the smallest discriminant scores.
Results
In this section we discuss the implications of estimating the optimal threshold for NSC classifiers by minimizing the crossvalidated overall error, when data are classimbalanced and highdimensional. We then present a modified approach for threshold estimation aimed at reducing the classimbalance problem for NSC classifiers, and show its effectiveness on simulated and real high dimensional classimbalanced data.
Threshold selection
In practice the threshold parameters of the NSC classifiers are estimated minimizing the crossvalidated error rate for different values of the threshold; the threshold value that produces the lowest error is used to shrink the centroids.
The overall error is the probability of misclassifying new samples:
and it depends on the class specific predictive accuracies ({\text{PA}}_{k}=P\left(\mathcal{C}\right({\mathbf{x}}^{\ast})=k{y}^{\ast}=k)) and on the level of class imbalance. Overall error and predictive accuracy (1error) are misleading measures of the classifiers performance when data are classimbalanced [15]: the predictive accuracies of the minority classes are given little weight and classifying all new samples in the majority class produces small overall error when the classimbalance is extreme.
The highdimensionality of data can additionally contribute to making the overall error an inappropriate measure to minimize. For the sake of simplicity let us focus on DLDA; we consider a twoclass classification problem and assume that there is no real difference between the classes (null case) and that class 1 is the minority class (known π_{1}<0.5). Data are simulated from x_{ i }∼iidM V N(0,Σ=d i a g(1,...,1)) for both classes for n=100 samples, with π_{1}=0.10 or π_{1}=0.30; results are shown in Figure 1.
The probability of classifying a new sample in the minority class is smaller when the class imbalance is more extreme and/or when more variables are measured. As a consequence, the error rate is a decreasing function of the number of variables and, when the number of variables is large, it approaches the proportion of the minority class samples in the population. For this particular setting the classification probabilities were derived also analytically, additionally assuming that the variances are known, see Additional file 2.
If we consider the shrunken centroid classifiers as a special form of DLDA this result would suggest that in the classimbalanced scenario the threshold selection based on the overall error will favor small threshold values (large number of variables), which in turn will lead to small probability of classification in the minority class and large bias in favor of the majority class.
The proposed approach
We propose to select the optimal threshold as the value that maximizes the crossvalidated geometric mean of the class specific predictive accuracies (GM, gmeans),
gmeans is an accuracy metric often used for classimbalanced data that captures the performance of the classifiers in all classes [7]. It gives the same weight to all the classes, it is independent of the class distribution of the test set and it penalizes the classifiers whose performance is heterogeneous across classes. Furthermore, for a fixed total (\sum _{k=1}^{K}P{A}_{k}), it has the maximum when the class specific predictive accuracies are equal [16].
In practice the class specific PA are estimated with {\text{PA}}_{k}=1/{n}_{k}\sum _{i=1}^{n}{z}_{\mathit{\text{ik}}}\xb7{z}_{i\hat{y}}, where {z}_{i\hat{y}} is the indicator for a correctly classified sample i ({z}_{i\hat{y}}=1 if \mathcal{C}\left({x}_{i}\right)={y}_{i} and zero otherwise) and they depend on the selected threshold value. It is not feasible to evaluate the crossvalidated GM for all possible threshold values, therefore we limit our attention to a fixed number of thresholds. We consider T equally spaced threshold values, ranging from 0 (no shrinkage) to λ_{max}, the minimum threshold value that shrinks all the class centroids to the overall centroid, for all the variables (complete shrinkage). In the Additional file 1 we show how to derive λ_{max} for PAM, ALP and AHP.
The proposed approach for the estimation of the threshold can be used for each of the three NSC classifiers considered in this paper; the modified classifiers are denoted with GMPAM, GMALP and GMAHP, respectively. The proposed algorithm is presented below.
Algorithm 1 GM shrunken centroid classifier
Results on the simulated data
In this section we present a series of selected results based on simulated data to assess the performance of the GM method and compare it with the original NSC classifiers.
In a two class classification scenario, we simulated 10,000 variables from a multivariate Gaussian distribution. We used a block exchangeable correlation structure, in which the variables in the same block were correlated (pairwise correlation equal to ρ=0.8) while the variables from different blocks were independent (similarly as in Guo et al.[17] and Pang et al.[18]); each block contained 100 variables and all variances were equal to 1. The mean values were equal to 0 for all variables in class 1 (μ_{1}=0). In the null case all the variables were non informative (μ_{1}=μ_{2}=0), while in the alternative case 100 variables were informative about class distinction (μ_{2}=0.5, 1, 2 or 5 for the informative variables and μ_{2}=0 for non informative variables).
The training sets contained 100 samples and the proportion of class 1 samples varied from 0.5 (balanced situation) to 0.9 (highly imbalanced situation). 10fold CV was used to estimate the optimal threshold parameter, using 30 different threshold values. The classifier trained on the complete data set with the estimated optimal threshold was used to make predictions on a large independent and balanced test set (n_{ t e s t }=1,000, {k}_{1}^{\mathit{\text{test}}}=0.5, where {k}_{1}^{\mathit{\text{test}}} is the proportion of class 1 samples in the test set), simulated from the same distribution used for the training set. The performance on the test set was evaluated in terms of classspecific predictive accuracies, gmeans, the area under the ROC curve (AUC): these measures do not depend on the data distribution in the test set and can be estimated with equal precision when the test set classes are balanced. Furthermore, it was previously shown that matching the prevalence in the training and test set does not attenuate the classimbalance problem [8]. In the simulations we evaluated also the false discovery rate (FDR, proportion of non informative variables among active variables) and the false negative rate (FNR, proportion of informative variables among the non active variables). Each simulation was repeated 500 times.
The NSC classifiers assigned most new samples to the majority class when there was no difference between the classes (Additional files 3 and 4) or the difference between classes was small or moderate (Figure 2 and Table 1). The bias towards the majority class was smaller when the classes were more balanced, when the number of variables was smaller (the results of additional simulations using 1000 variables are presented in the Additional files 5, 6 and 7) or the difference between classes was larger. The NSC classifiers were not effective in removing the noninformative variables: the number of active variables markedly increased with classimbalance and most of the variables were not shrunken towards the overall centroid in the most imbalanced settings; as a consequence the FDR was close to 1. In general, AHP had less active variables and a slightly smaller FDR, but the overall performance of PAM, ALP and AHP was similar.
The GMNSC classifiers performed very similarly to NSC classifiers in the settings where the NSC classifiers were not biased towards the majority class, i.e. when the classes were balanced or were very different (data not shown). In the other situations the GMNSC classifiers reduced the gap between the class specific PA, obtaining larger minority class PA, gmeans and AUC, and greatly reducing the number of active variables; the removal of most of the non informative variables reduced the FDR and the bias towards the classification into the majority class (Table 1), while the removal of a part of the informative variables increased the false negative rate (FNR). The best performance was obtained when the GM threshold optimization was used with PAM, while the smallest improvement was seen for AHP. This can probably be attributed to the fact that the variables with larger d_{ k j } values are weighted and therefore shrunken less, which is not desirable for the non informative variables as large values of d_{ k j } arose by chance and should therefore be actually shrunken more; note that the smallest bias was observed for PAM (with GM approach), where all variables are shrunken for the same amount. Similar results were obtained simulating independent variables (data not shown).
When some variables were differentially expressed and the classimbalance was moderate, methods using the GM approach achieved slightly higher PA for the minority class, while this bias was only marginal if the original approach was used. Note that the overall centroid can be expressed as {\overline{x}}_{j}=\sum _{k=1}^{K}{\overline{x}}_{\mathit{\text{kj}}}{n}_{k}/n; it is a weighted average of class specific mean values with more weight given to the majority class, so the overall centroid is closer to the sample mean of the majority class. When the threshold is large this has a consequence of shifting the minority class centroid towards the sample mean of the majority class and hence some of the new samples from the majority class are closer to the minority class (shrunken) centroid than to the majority class (shrunken) centroid. Classifiers using the original approach do not suffer from this problem as the amount of shrinkage is small when the training set is classimbalanced. One way of diminishing this bias would be to define the overall centroid as {\overline{x}}_{j}=\sum _{k=1}^{K}{\overline{x}}_{\mathit{\text{kj}}}/K, and {m}_{k}=\sqrt{\frac{1}{K}\left(\frac{K2}{{n}_{k}}+\frac{1}{K}\sum _{k=1}^{K}{n}_{k}^{1}\right)} as to assure that the denominator in calculation of d_{ k j } will be the appropriate standard error. We performed a limited set of simulations with this overall centroid definition for PAM and observed that the bias in favor of the minority class was removed. However, when there was large classimbalance, the results were slightly more biased in favor of the majority class (data not shown) than with the original overall centroid definition. One reason for poor performance in the case of large classimbalance is that relatively more weight was given to a less accurate estimate.
One of the possible strategies when dealing with classimbalanced data is to use case weighting in order to adjust for the classimbalance bias [7]. Since case weighting is not implemented in the NSC classifiers we performed a limited set of experiments with random oversampling to give the same weight to both classes. Class balanced training sets were obtained by replicating a subset of randomly selected samples from the minority class (replicating max(n_{1},n_{2})− min(n_{1},n_{2}) samples from the minority class and obtaining the training set of size 2 max(n_{1},n_{2})). PAM and GMPAM were trained on the oversampled training sets and evaluated on independent test sets. The simulation settings and the settings of PAM and GMPAM were the same as presented above (see Additional file 8 for the results).
Oversampling did not significantly change the performance of PAM, while it increased the class imbalance bias of GMPAM. After oversampling PAM and GMPAM performed exactly the same, achieving poor predictive accuracy for the minority class when the original training set was classimbalanced. The number of active variables in GMPAM was much larger than in the simulations where oversampling was not used. After oversampling the training set contains exact copies of the minority class samples. When the level of classimbalance is severe there are many copies of the same minority class sample in the training set and the predictive accuracy for the minority class obtained by using crossvalidation is (nearly) a resubstitution (training) estimate, as the same minority class samples can be used in the training and testing phase. The fact that the GMPAM approach favored the use of large number of variables can be explained by realizing that classifiers that use many variables minimize the resubstitution PA (in our case minority class PA) [19, 20]; because of the classimbalance bias the majority class PA is also increasing when using more variables. Consequently, the gmeans is an increasing function of the the number of variables, and is maximized when the amount of shrinkage is small. The use of large number of variables (small threshold) translates into a large classimbalance bias on the independent test set.
We considered also a three class scenario, simulating 5,000 variables from a multivariate Gaussian distribution; the correlation structure was the same as in the twoclass scenario. We considered the null case and the alternative case where class 2 was the minority class nested between class 1 and class 3 (μ_{1}=−μ_{3}=1 and μ_{2}=0 for 100 informative variables, and μ_{1}=μ_{2}=μ_{3}=0 for non informative variables; n_{2}=20, n_{1}=n_{3}=100). The test sets were balanced (n_{ t e s t }=1500) and the classifiers were trained and evaluated as described for the twoclass scenario. The null case results are in the Additional file 9, where additional simulation results with 1000 variables and balanced data (n_{1}=n_{2}=n_{3}=100) are also presented.
In the null case the NSC classifiers assigned most new samples to the majority classes and the proportion of samples classified in the minority class decreased when more variables were considered. The GMNSC classifiers assigned approximately the same number of samples to each class. All the classifiers performed similarly on balanced data, classifying approximately the same number of samples to each class. In the alternative case the NSC classifiers obtained very low PA for the minority class and PAM performed worse than ALP and AHP (Table 2). The GMNSC classifiers performed better, substantially increasing the minority class PA and gmeans, in spite of using a larger number of noninformative variables. Note that class 2 is the hardest class to predict also with balanced data (see Additional file 9) and that the GMNSC classifiers achieved approximately the same classification results on balanced and imbalanced data, showing to be insensitive to class imbalance in this setting.
Application to real highdimensional data sets
We used four breast cancer microarray gene expression data to assess the performance of the GMNSC and NSC classifiers. We predicted the Estrogen receptor (ER) positivity, the histological grade (Grade 1 and 2 vs Grade 3, or as a three class prediction problem), the disease relapse and the prognosis of the breast cancer patients (good or bad). Grade, prognosis and relapse are harder to predict than ER status using breast cancer gene expression data [21]. Table 3 summarizes the main characteristics of the data sets originally published by Ivshina et al.[22], Wang et al.[23], Sotiriou et al.[21] and Korkola et al.[11], and the classification tasks addressed. We used 5fold CV to estimate the optimal threshold parameter and, if not noted otherwise, the accuracy measures were estimated using leaveoneout CV (LOOCV).
All the classifiers performed well in predicting the ER status on the Ivshina data set (Table 4). AHP used few active genes and had the best performance among the NSC classifiers, while PAM and ALP did not remove any of the genes nor did they shrink any of the components of the centroids. The gap between the class specific PA was small despite the large class imbalance, because the difference between the classes was large. Nevertheless, the GM classifiers used fewer active genes and improved the minority class PA, gmeans and AUC of PAM and ALP, and performed similarly to AHP. The absence of shrinkage for PAM and ALP in the prediction of ER status on the Ivshina’s data set can be attributed to the large class imbalance (k_{ER}=0.14) and to the fact that, unless there was very little or no shrinkage, the minority (ER) class had better class specific PA in this data set. The bias towards classification into the majority class, caused by the use of many non informative variables, appeared only when most or all the variables were included in the classifier (see Additional file 10). The majority class PA and consequently the overall PA were maximized by the classifier with no shrinkage.
Relapse was more difficult to predict on Wang’s data set (Table 4). PAM used the smallest number of active genes and performed slightly better than ALP and AHP in terms of class specific PA and gmeans. Exactly the same results were obtained using NSC and GMNSC classifiers. This result is in line with the simulations, where we showed that for moderate classimbalance the performance of NSC and GMNSC classifiers was very similar.
PAM used the largest number of variables on Korkola’s data set (Table 4) but still performed better than AHP and ALP that used less variables. The performance of GMPAM and GMAHP was similar to the original methods, while GMALP outperformed ALP and achieved the best overall performance on this data set.
In order to explore the effect of classimbalance on real data, we obtained multiple training sets from the Sotiriou’s data set, varying the level of classimbalance. We used a fixed number of samples from the minority class (n_{ER}=10) and varied the number of samples from the majority class (n_{ER+}=10,20,…,50); the samples not included in the training set were used to estimate the accuracy measures. To account for the variability arising from random inclusion of samples in the training or test set, we repeated the procedure 250 times and averaged the results.
In the balanced situation the class specific PA were approximately equal for all classifiers, indicating that the classes were roughly equally difficult to predict (Figure 3, see Additional file 11 for exact numerical results). For the NSC classifiers the PA of the minority class decreased when majority class samples were added to the training set, while the majority class PA increased, similarly as observed in our simulations. Moreover, the number of active genes increased substantially, especially for PAM and ALP. GMPAM and GMALP were effective in maintaining the minority class PA above the values achieved on classbalanced data and the gap between the class specific PA was very small even when the classimbalance was large. The number of active genes increased with classimbalance for GMPAM and GMALP but not as dramatically as for the original methods. GMAHP performed very similarly to AHP. Comparable results were obtained for the twoclass prediction of grade (Grade 1 or 2 vs Grade 3; Additional file 11).
We addressed also a threeclass classification problem on Ivshina’s data, predicting the grade of tumors (1, 2 or 3). On the complete data set all the classifiers performed very similarly (Additional file 11). Only GMPAM removed a part of the variables, while the other classifiers did not shrink the centroids at all. Grade 2 was the majority class but it had the worst PA, indicating that the potentially heterogeneous Grade 2 class is the most difficult to predict. Similarly as on the Sotiriou’s data, we varied the number of samples in Grade 2 class to try to isolate the classimbalance effect (n_{Grade 1}=n_{Grade 3}=40 and n_{Grade2}=10,20,40 or 80). When Grade 2 was the majority class the results were very similar to those obtained on the complete data set, and GMNSC and NSC classifiers performed similarly (Table 5, n_{2}=80). In the balanced setting Grade 2 had the lowest PA, confirming that Grade 2 was the most difficult class to predict; GM improved the performance of PAM and ALP.
Decreasing the number of Grade 2 samples had the effect of further decreasing the PA of Grade 2 and the gmeans; the PA of the other two classes, which were high when data were balanced, increased only moderately for most classifiers. The drop in the PA of Grade 2 was less pronounced for the GM classifiers. Similarly as for the other classification tasks the GM method was the most useful in improving the performance of PAM and ALP.
Discussion
In this paper we proposed a modified approach (GMNSC) to the estimation of the amount of centroid shrinkage for the NSC classifiers. The approach estimates the optimal shrinkage by maximizing the geometric mean of the classspecific predictive accuracies, rather than the overall accuracy. We used our approach with PAM and with two recently proposed NSC classifiers, ALP and AHP.
The motivation for the new approach is to alleviate the classimbalance problem of the NSC classifiers. We showed with a limited set of simulations that ALP and AHP, similarly to PAM [8], are biased towards the classification in the majority class when data are classimbalanced: they assign most new samples to the majority class and achieve poor predictive accuracy for the minority class, unless the differences between the classes are very large. Increasing the number of measured variables has the effect of further increasing the bias.
We identified the main reason for the biased NSC classification in the method used in practice for estimating the threshold parameter, which is based on the minimization of the crossvalidated overall error rate. The threshold parameter plays a fundamental role for NSC classifiers, as it determines how many variables are effectively used in the classification rule and by which amount the centroids are shrunken.
Simulation results and the analysis of three large data sets of breast cancer showed that the greatest gains were obtained by GMNSC when the NSC classifiers had a large bias towards the majority class, while GMNSC performed similarly to NSC in the absence of bias. GMNSC classifiers used less active variables when data were classimbalanced.
In the biomedical applications the improvements obtained using the GMNSC classifiers are relevant from the practical point of view. The reduction of the classimbalance bias results in more accurate prediction for the minority class samples, which are often the samples for which an accurate prediction is more important. Moreover, the inclusion of a smaller number of variables in the classifiers seems a desirable property in the biomedical applications where the aim is to develop prognostic or predictive models. Many researchers argued that it is advantageous to use microarraybased classifiers that include a small number of genes (see for example [24] and references therein). The reason is that classifiers that include numerous genes can be more difficult to transfer to the clinical practice because their interpretation and practical implementation is more difficult. At the same it was shown that classifiers that include few genes can perform well in practice [2426].
The current implementation of the NSC classifiers does not allow for case weighting so we performed random oversampling in the attempt to give equal weight to the classes. We observed that random oversampling had no effect on PAM, while it increased the classimbalance bias of GMPAM, substantially increasing the number of active variables. The reason for poor performance of GMPAM is that the gmeans used to determine the optimal threshold is, because of oversampling, not fully crossvalidated estimate as the same minority class samples are used when training and evaluating the classifier. The not properly crossvalidated gmeans is maximized by classifiers that use large number of variables. Although we did not perform the experiments with oversampling for ALP and AHP, we expect that the same conclusion would still apply, as the determination of the optimal threshold would likely suffer from the same problem. In general special care is needed when the tuning parameters are determined after the training set is oversampled; for example, Random Forests and Support Vector Machines require the optimization of the tuning parameters, which is normally done with crossvalidation.
We chose to select the optimal shrinkage by maximizing the gmeans of the classifiers, which is more appropriate than overall error for the assessment of the effectiveness of the classifiers trained on imbalanced data [15]. Other assessment measures were proposed for classimbalanced data: two popular alternatives in the twoclass problems are the Fmeasure and the area under the ROC curve (AUC), however their generalization to multiclass problems is not as straightforward as for gmeans. The Fmeasure is a function of predictive accuracy and predictive value of the positive class, the weight given to each measure depends on a parameter that is chosen by the user. Being a function of the predictive values it is sensitive to data distributions, which is not a desirable property when data are classimbalanced. AUC depends on the classspecific predictive accuracies, similarly to gmeans. A possible advantage of gmeans over AUC is its behavior when evaluating uninformative classifiers (P\left(\mathcal{C}\right({x}^{\ast})=1y=1)=P(\mathcal{C}\left({x}^{\ast}\right)=1y=2) for two classes). In this case gmeans favors the classifiers that assign the same number of samples to each class, while AUC is approximately the same for all uninformative classifiers; as a consequence the estimation of the threshold parameter using AUC is very unstable when the differences between the classes are small. We considered also the maximization of the sum of the classspecific predictive accuracies (\sum _{k=1}^{K}{\text{PA}}_{k}), however this measure has a similar drawback as AUC as it can not distinguish between the uninformative classifiers. Experimental results for PAM showed that under the null hypothesis (and when the difference between the classes was small) this approach performed slightly worse than using gmeans to determine the optimal threshold, while the results were very similar when the difference between the classes was large (data not shown).
Tibshirani et al.[5] proposed the procedure for adaptive choice of threshold for PAM that enables different shrinkage for each class and showed that this approach can lead to smaller number of active variables. However, Wang and Zhu [12] observed that using the adaptive threshold procedure does not change the predictive accuracy of PAM. We obtained similar results and observed that while in the multiclass classification problems sometimes the number of active variables decreases, the predictive accuracy of PAM and GMPAM is not affected (data not shown). Therefore, the adaptive choice of threshold does not seem beneficial in decreasing the classimbalance problem of the NSC classifiers.
In this paper we focused on PAM, ALP and AHP; others proposed further modifications to NSC methods [17, 27], which were not evaluated in this study. However, we believe that all the classifiers that base their tuning on the minimization of overall error should present the same type of problems, and would benefit from using a tuning strategy based on gmeans or other cost functions that are less sensitive to the classimbalance problem.
Huang et al.[14] observed that the estimators of the discriminant scores for discriminant analysis are biased and derived a biascorrected discriminant score for DLDA and DQDA. Their findings are interesting in the context of classimbalanced highdimensional classification, as they show that the bias of the discriminant scores depends on the classimbalance in the training set and on the number of variables; the bias is larger in the minority class and when more variables are considered. The biascorrection outperforms the original approach, especially when the classimbalance is large; unfortunately this approach can not be extended straightforwardly to NSC classifiers as the distribution of the estimator of the shrunken class centroid is not known.
A computational issue is the estimation of the optimal threshold. We used an approach similar to what was used for PAM [5], evaluating a fixed number of threshold values and using crossvalidation. In most situations we evaluated 30 threshold values, equally spaced between 0 (all active variables) and the minimum threshold value that shrunk all the class centroids to the overall centroid (no active variables). This choice was a compromise between accuracy of estimation and computational burden, which was particularly high for AHP. However, we observed that this strategy might not be optimal in all situations since, especially for AHP, the relationship between the threshold and the number of active variables was highly nonlinear. Often the smallest positive threshold produced few active variables: the estimated number of active variables could be either very small or equal to the total number of variables, while intermediate solutions were not evaluated. This could explain why in some applications GM was not successful in improving the performance of AHP. Our observations would suggest that equally spacing the threshold values could be an effective choice when the number of variables distinguishing the classes is relatively small, as it is often the case for microarray data. In problems where many variables distinguish the classes a solution would be to equally space threshold values on the logarithmic scale, which would include more thresholds associated with a large number of active genes.
We presented the GMNCS approach limited to the case where each class is given equal importance, but the method can be extended and incorporate different misclassification costs for each class by weighting the classspecific predictive accuracies. This approach would be useful for problems where the cost of misclassification is not equal for each class.
Conclusion
We showed that three nearest shrunken centroid classifiers (PAM, ALP and AHP) achieve poor accuracy for the minority class when data are classimbalanced and highdimensional, unless the difference between classes is large. We proposed GMNSC, a straightforward yet effective approach to diminish the classimbalance problem of NSC classifiers, which consists in estimating the optimal amount of shrinkage by maximizing the gmeans of the classifiers, rather than its overall accuracy. We used simulated and real data to show that when the NCS classifiers are biased towards the majority class the GMNSC approach outperforms NSC, and it performs similarly to NSC otherwise. GMNSC classifiers generally select less variables which seems a desirable property in the biomedical applications where the aim is to develop prognostic or predictive models.
Our experiments with random oversampling showed no improvement for PAM while the classimbalance bias of GMPAM was increased. We therefore recommend that this strategy is not used with the NSC or GMNSC classifiers.
Abbreviations
 NSC:

nearest shrunken centroid
 PAM:

prediction for microarrays
 DLDA:

diagonal linear discriminant analysis
 ALPNSC:

adaptive L∞ norm penalized NSC
 AHPNSC:

adaptive hierarchically penalized NSC
 PA:

predictive accuracy
 AUC:

area under the ROC curve.
References
 1.
Bishop CM: Pattern Recognition and Machine Learning (Information Science and Statistics). 1st ed. edition. New York: Springer; 2007.
 2.
Weigelt B, Pusztai L, Ashworth A, ReisFilho JS: Challenges translating breast cancer gene signatures into the clinic. Nat Rev Clin Oncol 2012, 9: 5864.
 3.
Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA 2002,99(10):65676572. 10.1073/pnas.082099299
 4.
Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002,97(457):7787. 10.1198/016214502753479248
 5.
Tibshirani R, Hastie T, Narasimhan B, Chu G: Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 2003, 18: 104117. 10.1214/ss/1056397488
 6.
Wu B: Differential gene expression detection using penalized linear regression models: the improved SAM statistics. Bioinformatics 2006,21(8):15651571.
 7.
He H, Garcia EA: Learning from imbalanced data. IEEE Trans Knowledge Data Eng 2009,21(9):12631284.
 8.
Blagus R, Lusa L: Class prediction for highdimensional classimbalanced data. BMC Bioinformatics 2010, 11: 523. 10.1186/1471210511523
 9.
Blagus R, Lusa L: Impact of classimbalance on multiclass highdimensional class prediction. Metodološki zvezki 2012, 9: 2545.
 10.
Reeve J, Einecke G, Mengel M, Sis B, Kayser N, Kaplan B, Halloran PF: Diagnosing rejection in renal transplants: a comparison of molecular and histopathologybased approaches. Am J Transplant 2009,9(8):18021810. [http://dx.doi.org/10.1111/j.16006143.2009.02694.x] [] 10.1111/j.16006143.2009.02694.x
 11.
Korkola J, Blaveri E, DeVries S, Moore D, Hwang ES, Chen YY, Estep A, Chew K, Jensen R, Waldman F: Identification of a robust gene signature that predicts breast cancer outcome in independent data sets. BMC Cancer 2007, 7: 61. [http://www.biomedcentral.com/14712407/7/61] [] 10.1186/14712407761
 12.
Wang S, Zhu J: Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics 2007,23(8):972979. 10.1093/bioinformatics/btm046
 13.
Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc (Ser B) 1996, 58: 267288.
 14.
Huang S, Tong T, Zhao H: Biascorrected diagonal discriminant rules for highdimensional classification. Biometrics 2010,66(4):10961106. 10.1111/j.15410420.2010.01395.x
 15.
Pepe MS: The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press; 2003.
 16.
Lin WJ, Chen JJ: Classimbalanced classifiers for highdimensional data. Brief Bioinformatics 2012,14(1):1326.
 17.
Guo Y, Hastie T, Tibshirani R: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 2007, 8: 86100. 10.1093/biostatistics/kxj035
 18.
Pang H, Tong T, Zhao H: Shrinkagebased diagonal discriminant analysis and its applications in highdimensional data. Biometrics 2009,65(4):10211029. 10.1111/j.15410420.2009.01200.x
 19.
Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Nat Cancer Inst 2003, 95: 1418. 10.1093/jnci/95.1.14
 20.
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2003.
 21.
Sotiriou C, Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based on gene expression profiles from a populationbased study. Proc Nat Acad Sci USA 2003,100(18):1039310398. 10.1073/pnas.1732912100
 22.
Ivshina AV, George J, Senko O, Mow B, Putti TC, Smeds J, Lindahl T, Pawitan Y, Hall P: Nordgren H, et al: Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Res 2006,66(21):1029210301. 10.1158/00085472.CAN054414
 23.
Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder M E, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet 2005,365(9460):671679.
 24.
Wang X, Simon R: Microarraybased cancer prediction using single genes. BMC Bioinformatics 2011, 12: 391. 10.1186/1471210512391
 25.
Wang X, Gotoh O: Accurate molecular classification of cancer using simple rules. BMC Med Genomics 2009, 2: 64. 10.1186/17558794264
 26.
HaibeKains B, Desmedt C, Loi S, Culhane AC, Bontempi G, Quackenbush J, Sotiriou C: A threegene model to robustly identify breast cancer molecular subtypes. J Nat Cancer Inst 2012, 104: 311325. 10.1093/jnci/djr545
 27.
Dabney AR: Classification of microarrays to nearest centroids. Bioinformatics 2005,21(22):41484154. 10.1093/bioinformatics/bti681
Acknowledgements
The highperformance computation facilities were kindly provided by Bioinformatics and Genomics unit at Department of Molecular Biotechnology and Heath Sciences, University of Torino, Italy.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
Both authors declare that they have no competing interests.
Authors’ contributions
RB derived the GM approach, performed the computations and wrote the manuscript; LL designed research and wrote the manuscript. Both authors read and approved the final manuscript.
Electronic supplementary material
Derivation of the expressions for
Additional file 1: λ_{ max }. In the additional information we derive the expressions for λ_{max} for the classifiers considered in the paper. See text for more details. (PDF 131 KB)
Classification error and the probability of classification in class 1.
Additional file 2: In the additional file we derive the classification error as a function of a probability of classification in class k and classprior probabilities. We also derive the probability of classification in class 1 for the example presented in the main text additionally assuming that the pooled variances are known. See text for more details. (PDF 145 KB)
Classification results under the null hypothesis for a large number of variables (
Additional file 3: p = 1 0 0 0 0) and the correlated scenario (ρ=0.8 ). In the additional file we show predictive accuracy for class 1 (PA_{1}) and PA for class 2 (PA_{2}) for different levels of classimbalance (k_{1}) in the training set containing 100 samples. There was no difference between the classes (μ_{2}=0). (PDF 14 KB)
Classification results for a large number of variables.
Additional file 4: In the additional file we show optimal threshold parameter (λ^{∗}), number of active irrelevant variables (# noninfo), PA for class 1 (PA_{1}) and PA for class 2 (PA_{2}), gmeans and AUC for different levels of classimbalance (k_{1}) in the training set containing 100 samples for a situation where there was no difference between the classes (μ_{2}=0; Table 1), where the difference between the classes was small (μ_{2}=0.5; 100 variables were differentially expressed; Table 2) and where the difference between the classes was moderate (μ_{2}=1; 100 variables were differentially expressed; Table 3). (PDF 121 KB)
Classification results under the null hypothesis for a small number of variables.
Additional file 5: The additional file shows PA for class 1 (PA_{1}) and PA for class 2 (PA_{2}) for different levels of classimbalance (k_{1}) in the training set containing 100 samples. There was no difference between the classes (μ_{2}=0). (PDF 15 KB)
Classification results under the alternative hypothesis for a small number of variables.
Additional file 6: The additional file shows PA for class 1 (PA_{1}) and PA for class 2 (PA_{2}) for different levels of classimbalance (k_{1}) in the training set containing 100 samples. The difference between the classes was moderate (μ_{2}=1) and 20 variables were differentially expressed (100 for AHP). (PDF 15 KB)
Classification results for a small number of variables.
Additional file 7: In the additional file we show the optimal threshold parameter (λ^{∗}), number of active irrelevant variables (# noninfo), PA for class 1 (PA_{1}) and PA for class 2 (PA_{2}), gmeans and AUC for different levels of classimbalance (k_{1}) in the training set containing 100 samples. There was no difference between the classes (μ_{2}=0; Table 1) and the differences between the classes were moderate (μ_{2}=1; 20 variables were differentially expressed (100 for AHP); Table 2). (PDF 117 KB)
Classification results for the oversampled training set for PAM and GMPAM.
Additional file 8: In the additional file we show the simulation results obtained by training PAM and GMPAM on oversampled training sets obtained by replicating randomly selected samples from the minority class in order to obtain the classbalanced training set. Simulation settings were the same as presented in the Additional file 4. See text for more details. (PDF 185 KB)
Classification results for the three class scenario.
Additional file 9: In the additional file we show the same information as in Additional file 4 for the three class scenario. In Tables 1 and 2 we report results for the the classbalanced scenario (n_{1}=n_{2}=n_{3}=100) and a large number of variables (p=5000) under the null and the alternative hypothesis, respectively. Table 3 reports the results for the classimbalanced scenario (n_{1}=n_{3}=100, n_{2}=20) under the null hypothesis and Table 4 reports the result under the alternative hypothesis. Results for smaller number of variables are in Table 5. (PDF 121 KB)
Error rate, class specific predictive accuracies and gmeans as a function of the threshold parameter for the Ivshina’s data set and prediction of ER.
Additional file 10: In the additional file we report the error rate (left panel), accuracy for ER class (P A_{ER}), accuracy for ER+ class (P A_{ER+}) and gmeans (right panel) for different values of the threshold parameter (λ) obtained on the Ivshina’s data set. See text for more details. (PDF 13 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Blagus, R., Lusa, L. Improved shrunken centroid classifiers for highdimensional classimbalanced data. BMC Bioinformatics 14, 64 (2013). https://doi.org/10.1186/147121051464
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121051464
Keywords
 Predictive Accuracy
 Classification Rule
 Minority Class
 Class Imbalance
 Discriminant Score