Class prediction for highdimensional classimbalanced data
 Rok Blagus^{1} and
 Lara Lusa^{1}Email author
DOI: 10.1186/1471210511523
© Blagus and Lusa; licensee BioMed Central Ltd. 2010
Received: 6 May 2010
Accepted: 20 October 2010
Published: 20 October 2010
Abstract
Background
The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of highdimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using classimbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on classimbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the highdimensionality poses additional challenges when dealing with classimbalanced prediction. We evaluate the performance of six types of classifiers on classimbalanced data, using simulated data and a publicly available data set from a breast cancer geneexpression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance.
Results
Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the classspecific predictive accuracies differ considerably. When the class imbalance is not too severe, downsizing and asymmetric bagging embedding variable selection work well, while oversampling does not. Variable normalization can further worsen the performance of the classifiers.
Conclusions
Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with classimbalanced data are exacerbated when dealing with highdimensional data. Researchers using classimbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.
Background
Highthroughput technologies measure simultaneously tens of thousands of variables for each of the observations included in the study; data produced by these technologies are often called highdimensional, because the number of variables greatly exceeds the number of observations. Microarrays are highdimensional tools commonly used in the biomedical field; they measure the expression of genes [1] or miRNAs [2], the presence of DNA copy number alterations [3] or of variation at a single site in DNA [4], across the entire genome of a subject.
Microarrays are frequently used for class prediction (classification). In these studies the goal is to develop a rule based on the measurements (variables) obtained from the microarrays from samples (observations) that belong to distinct and welldefined groups (classes); these rules can be used to predict the class membership of new samples for which the values of the variables are known but the classmembership is unknown. For example, many studies tried to predict the clinical outcome of breast cancer using geneexpression [5]; in this case the classes are the clinical outcome of breast cancer while the variables are the expression of the genes. Some of the classification methods most frequently used for microarray data are discriminant analysis methods, nearest neighbor (kNN, [6]) and nearest centroid classifiers [7], classification trees [8], random forests (RF, [9]) and support vector machines (SVM, [10]) (see [11] or [12] for an introduction to these methods).
An important aspect that specifically characterizes classification for highdimensional data is the need to perform some type of variable selection. Variable selection consists in the identification of a subset of variables that are used to define the classification rule, and it can be performed either before developing the classifier or it can be embedded in the classification method [13]. The importance of variable selection for highdimensional data rests on two facts: some classification rules cannot be derived if the number of variables is larger than the number of observations, and removing the variables that have little variability across observations improves the predictive accuracy [14].
In this paper we focus on classification problems for classimbalanced data, i.e., on data sets where the number of observations belonging to each class is not the same. Classimbalanced data are common in the biomedical field and they also arise when data are highdimensional. For example, using geneexpression microarray data, Ramaswamy et al. [15] classified primary versus metastatic adenocarcinomas: metastatic specimen comprised about 16% of the training set (64 versus 12 samples); Shipp et al. [16] developed a classifier to distinguish diffuse large Bcell lymphoma from follicular lymphoma using a data set with a 25% class imbalance (58 versus 19 samples); IIzuka et al. [17] predicted early intrahepatic recurrence or nonrecurrence for patients with hepatocellular carcinoma, with a training set with a 36% class imbalance (12 versus 21 samples). The classification methods used by these studies were some variants of the diagonal linear discriminant analysis (DLDA); the third study used also support vector machines.
Standard classification methods applied to classimbalanced data often produce classification rules that do not accurately predict the minority class [18]; for this reason the betweenclass imbalance problem has been receiving increasing attention in recent years and many different strategies were proposed for deriving classification rules for imbalanced data (see [19] for a review). However, their use is not widespread in practice and very often standard classification methods are used when the classes are imbalanced [20]. For example, Ramaswamy et al. [15] and Shipp et al. [16] did not modify the classification rules to take class imbalance into account, while IIzuka et al. [17] tried to adjust for by making training and test set equally imbalanced.
The aim of our study was to investigate how class imbalance affects classification for highdimensional data, and to evaluate if the highdimensionality poses additional challenges when dealing with classimbalanced data. We devoted special attention to the isolation of the possible effect of variable selection and to the investigation of the effectiveness of some strategies that were proposed to deal with class imbalance. To our knowledge the joint effect of highdimensionality and class imbalance on classification has not been thoroughly investigated.
The few works that dealt with the class imbalance problem for highdimensional data mostly focused on developing methods for variable selection [21], on the comparison of the performance of classifiers using different variable selection methods and/or classifiers [21–23], or on proposing and evaluating different strategies for adjusting classifiers trained on classimbalanced data [24–27].
To investigate the effect of class imbalance on highdimensional data, we evaluated the performance of six types of classifiers on imbalanced data. The classification methods were chosen among those most commonly used for highdimensional data and for the sake of simplicity we considered only classification problems where the number of classes was two (twoclass classification problems). The classifiers were evaluated both on simulated data and on a publicly available data set from a breast cancer gene expression microarray study [28]; we assessed both the overall and the class specific predictive accuracy of the classifiers. We simulated situations where there was no difference between classes (null case) and where the two classes were different (alternative case), varying the number of different variables and the magnitude of their difference. We used oversampling, downsizing and a variant of asymmetric bagging to correct the class imbalance problem.
In Results we present a series of selected simulation studies showing the consequences of using classimbalanced highdimensional data sets for classification, we show the performance of the corrections for class imbalance, and the results obtained on the breast cancer data. In Discussion we outline the problems related to classification for highdimensional data. In the Methods section we briefly describe the classification methods that we used and the strategies to deal with the class imbalance problem; we also describe the simulations that were performed and the breast cancer gene expression microarray data.
Results
The classifiers were developed on the training sets, while the predictive accuracy (PA, overall and class specific: PA_{1} for Class 1 and PA_{2} for Class 2), predictive values (PV_{1} and PV_{2}) and area under the ROC curve (AUC) were evaluated on the test sets. If not otherwise stated, the samples were normalized (meancentered), while the variables were not (see Methods), and the test sets were balanced (${k}_{1}^{test}$ = 0.5). The classification with RF and penalized logistic regression (PLR) were based on the 0.5 threshold, if not differently specified (see Methods). Each simulation was repeated 1000 times. Most of the figures show the results only for DLDA, PLR and one of the nearest neighbor classifier; the results for the other classifiers are shown in the Additional Files.
Simulations results: Null case
Under the null case there was no difference between the two classes, as all the variables were simulated from the same distribution (see Methods for details on data simulation). In the first set of simulations only p = 40 variables were generated, and all were used to derive the classification rule (G = p). The imbalance was the same in the training and in the test set (${k}_{1}^{train}={k}_{1}^{test}$).
The overall PA reached its minimum value when the data were balanced (PA = 0.5), and increased when the class imbalance of the test set became larger (Figure 1 and Eq. 5). The average class specific PA depended on the class imbalance of the training set but not on the class imbalance of the test set (Additional File 1); moreover, the overall PA was equal to 0.5 for all the classifiers when the test set was balanced, regardless of the imbalance of the training set (Additional File 1 and Eq. 5).
The effect of variable selection can be explained recognizing that the sampling variability is larger in the minority class. Sample mean values far from the true population values arise more frequently in the minority class, and the variables that show large differences between the classes are more likely to be selected. The new samples from the test set are therefore more similar to the samples of the majority class, and as a consequence they have a larger probability of being classified in that class. We observed this behavior not only for ttest with equal variances but also for other commonly used parametric and nonparametric variable selection methods (Additional File 2).
Among the classifiers that we considered, RF, SVM and PAM (Prediction Analysis of Microarrays) were the most sensitive to class imbalance when we did not perform variable selection (showing the largest difference between classspecific PA, Figure 1), while apparently variable selection had little or no effect on their classspecific PA (Figure 2). The reason is that these classifiers perform some type of variable selection automatically, therefore for these classifiers the results of Figure 1 embed variable selection. When the classification rules of RF and PLR were adjusted to take the class imbalance into account (RFTHR and PLRTHR, see Methods), the dependency of the class specific PA on classimblance diminished but it did not disappear (data not shown). Variable normalization (see Methods) did not change the null case results: regardless of the class imbalance, its impact on data was very limited since the true means of all variables were all equal (data not shown).
Simulation results: Alternative case
For the alternative case we considered situations in which some of the variables had different means in the two classes, varying the number of different variables (p_{ DE } ) and the mean difference (μ^{(2)}) (see Methods for details).
The noise introduced in the classifier by selecting nullvariables was only partially responsible for the decrease in the PA of the minority class (observed when the number of variables was increased). In an attenuated form this effect was still present even when all the variables were different between the two classes (p_{ DE } = p, Figure 3, right panels). Similarly to the null case, we were more likely to select variables for which the discrepancy between the true and the sample values was larger in the minority class; as a consequence we were less likely to classify new samples in this class. This behavior was not a peculiarity of the ttest with equal variances, but was observed also for the other variable selection methods that we considered (Additional File 2).
The classifier that showed the smallest decrease in the PA for the minority class was DLDA, which was practically insensitive to class imbalance when the number of variables was small (p = 40 and p_{ DE } = 20, 40, Figure 3); PAM, SVM and RF were the most sensitive to class imbalance also under the alternative case (Additional File 4 and 5).
Similarly to previous findings [14] we also observed that variable selection improved the performance of the classifiers under the alternative case: the class specific PA were consistently better when variable selection was performed, also for situations where there was a large class imbalance (see Additional File 6 for results where all the variables were included in the classifiers).
Solutions
All the solutions were evaluated using the same simulation settings described for Figure 3, left panels, with p = 1000.
Oversampling
The performance of 1NN, DLDA and PLR improved when oversampling was used together with variable normalization, but only when the test set was balanced (${k}_{1}^{test}$ = 0.5), therefore this result seems of limited practical utility (in Additional File 9 we give a possible explanation of this phenomenon for DLDA). Oversampling with variable normalization partly removed the dependence of class specific PA on class imbalance for RF and PAM when there was the same imbalance in training and test set (${k}_{1}^{test}={k}_{1}^{train}$), however only if the class imbalance was moderate (0.30 ≤ ${k}_{1}^{train}$ ≤ 0.70, Additional File 9).
Downsizing
In a second set of simulations we obtained a balanced training set by removing a subset of samples from the majority class (downsizing, see Methods). The PA of the minority class was greatly improved by downsizing and the classspecific PA became the same for both classes, regardless of the class imbalance in the original training set (Figure 5 and Additional File 8, third column). For example, using only 4 samples per class (${k}_{1}^{train}$ = 0.1), all the classifiers achieved a PA of about 0.70 for both classes, while the fulldata analysis assigned all the samples from the test set to the majority class (PA_{1} = 0 and PA_{2} = 1 using n_{1} = 4 and n_{2} = 76). The PV of the majority class further increased, while the PV of the minority class decreased substantially as the PA of the majority class moved away from 1 (see Eq. 6). The classifiers that were the most sensitive to class imbalance were those that benefitted the most from downsizing. For example, PA_{1} increased from 0.5 (fulldata) to 0.8 when we downsized the training set using PAM with ${k}_{1}^{train}$ = 0.2 (Additional File 8).
Importantly, the variability of the estimated PA obtained by downsizing increased with class imbalance; the 95% prediction intervals obtained for ${k}_{1}^{train}$ = 0.5 were between 0.8 and 1.0 while they were between 0.50 and 0.90 when ${k}_{1}^{train}$ = 0.10 (Additional File 10).
The PA (overall and classspecific) obtained by downsizing decreased as the class imbalance increased (as ${k}_{1}^{train}$ moved away from 0.50); this effect was not due to class imbalance but to the decrease in sample size of the training set.
Multiple downsizing (Asymmetric bagging with variable selection)
Neglecting information from the majority class as in simple downsizing is intuitively unappealing, therefore we considered multiple downsizing (MultDS), i.e., for each training set we repeatedly downsized the training set, randomly selecting the samples from the majority class and including all the samples from the minority class, developed a classifiers on each training set, and assigned new samples to the class to which they were classified more frequently (see Methods for details). The performance of MultDS (Figure 5, forth column) was similar but consistently better than downsizing in terms of average PA. The decrease of PA for imbalanced data due to smaller sample size was still present but less pronounced. In our simulation settings the PA of all classifiers did not vary for a wide range of imbalance levels (${k}_{1}^{train}$ = 0.20 to 0.80). MultDS had a smaller variability of PA compared to simple downsizing (Additional File 10); for example, when ${k}_{1}^{train}$ = 0.1 the average PA was around 0.85 for most classifiers and 95% prediction intervals were between 0.7 and 1.0.
To evaluate if the results obtained by MultDS were influenced by the number of samples left out to obtain a balanced training set we run a set of additional simulations where the sample size was smaller (n_{ train } = 50, n_{ test } = 20): at the same level of class imbalance MultDS worked better when the number of samples was larger; the observed differences increased with class imbalance and the performance of MultDS became similar to simple downsizing when the sample size was small (data not shown).
Use of different threshold for RF and PLR
We considered two variants of RF and PLR, where the threshold value used for classification was based on the class imbalance of the training set rather than on the fixed value of 0.5 (RFTHR and PLRTHR, see Methods). Using these variants the dependency of the class specific PA on the class imbalance was less pronouced but still present (Additional File 11).
An application to breast cancer microarray data
We used the published geneexpression microarray data set of Sotiriou et al. [28] to evaluate the effect of class imbalance on classification for real highdimensional data (see Methods for details on data). We considered two classification problems: the prediction of estrogen receptor (ER) status (ER+ vs ER) and of grade of breast cancer (1 or 2 vs 3).
We obtained different levels of class imbalance by repeatedly randomly selecting subsets of the samples from the complete data set: 500 different training and test sets were obtained for each situation.
Using the smaller but balanced training set (downsizing) the class specific PA were approximately the same (about 0.75 and 0.85, using 5 or 10 samples per class, respectively). For most classifiers the AUC and PV_{ER+}were smaller than those obtained using the larger imbalaced data, while the PV_{ ER }were larger. Similarly to the simulation studies, DLDA was the least sensitive to class imbalance. Oversampling did not remove the dependency of classspecific PA on class imbalance for most of the classifiers, with differences between classspecific PA as large as 0.22 (20 ER+ vs 10 ER, DQDA). Oversampling worked reasonably well only for 3NN and 5NN when the imbalance was not too large (data not shown); these results were in line with the simulation studies results.
We used MultDS for several different levels of class imbalance in the training set (Figure 6 and Additional File 14). PAM seemed to benefit the most from MultDS; however, the gains in PA achieved by using MultDS rather than simple downsizing were not as considerable as those observed in the simulation studies for the same level of class imbalance. This could be the consequence of using a smaller sample size in the breast cancer application.
The major advantage of MultDS over simple downsizing was the reduction of variability of the estimated predictive accuracy; for example, when the minority class included only 5 samples (upper panels of Figure 6) the prediction intervals obtained by simple downsizing included the value of 0.50 for all the classifiers, while the lower limit of the prediction intervals obtained by MultDS were above 0.60 for most classifiers even for the largest degree of class imbalance. Compared to simple downsizing, the PV_{ER+}slightly increased, while the PV_{ ER }decreased, ranging between 0.50 and 0.60; the AUC increased (Additional File 14).
Grade of breast cancer was more difficult to predict than ER status (overall PA about 0.60 using the small balanced training set and about 0.70 when using the larger training set, data not shown). The smaller differences in betweenclass geneexpression translated in larger sensitivity of the classifiers to class imbalance (PA for the minority class was between 0.10 and 0.20 when we used the most imbalanced data set), therefore the effect of class imbalance was stronger. Overall, the results obtained for grade prediction were in line with those obtained from the simulations.
Variable normalization further increased the effect of class imbalance, causing even more cases to be classified in the training set majority class (data not shown). Using normalized variables downsizing worked well, and so did oversampling for kNN, DLDA and PLR, but only if the test set was balanced; therefore, the practical importance of this result seems very limited.
Discussion
Our results showed that some of the classifiers that are more frequently used for class prediction with highdimensional data are highly sensitive to class imbalance. All the classifiers that we considered assign most of the new samples to the majority class from the training set, unless the difference between classes is large. This problem arises for two reasons: the probability of assigning a new sample to a given class depends on the prevalence of that class in the training set, and with variable selection this probability is further biased towards the majority class. As a consequence, when classifiers are trained on classimbalanced data there are usually large differences between class specific predictive accuracies; moreover, the overall predictive accuracy is not informative, especially when both the training and the test set are imbalanced ([29], chapter 2). In most circumstances, the unequal predictive accuracies produced by classimbalanced classifiers have the effect of slightly decreasing the difference in the classspecific predictive values, which is present when the predictive accuracies are equal and the classes are imbalanced; generally, the predictive values of the minority class increase, while there is a slight reduction for those from the majority class, which are large even when the classifer is uninformative. Similarly to our previous findings for another classifier [30], we observed that also in this setting all these properties are mantained even if the prevalence in the training and test set is matched. Normalizing (centering) the variables generally additionally biases the classification results, and only if the training and test set have the same imbalance it does not produce an additional negative effect. Using the embedded class imbalance corrections available for RF, SVM or PAM does not remove the dependency of the classification probabilities on class imbalance. Our results indicate that variable selection further increases the probability of assigning a new sample to the majority class; the reason is that the sampling variability is larger in the minority class and therefore the biggest deviations between the true and the observed values arise in this class. As a consequence, the selected variables are those that have the biggest departures from the true values in the minority class, either indicating differences between classes that do not exist (null variables), or amplifying some differences that exist (nonnull variables). However, at the same time variable selection plays also a positive role in highdimensional classification; similarly to previous findings [14] our results also indicate that the predictive accuracy of all the classifiers is improved if the classification rule is derived using only a selected subset of the measured variables.
The next question is if there are satisfactory remedies for these problems. A first set of solutions consisted in creating a balanced training set, either by replicating (oversampling) or by removing (downsizing) some of the samples. Oversampling does not remove or attenuate the class imbalance problem [31] also in our settings because we considered classification rules and a variable selection method that are hardly modified by the presence of replicated samples; the kNN classifiers (k > 1) are the only exeption. On the other hand, simple downsizing works well in removing the discrepancy between the classspecific predictive accuracies, but as expected it has a large variability and the predictive accuracy of the classifiers worsens when the effective sample size is reduced considerably because the class imbalance is large.
Our attempt in overcoming these problems was the combination of classifiers trained on balanced downsized training set. Multiple downsizing can be seen as a special case of asymmetric bagging [24], except that the variables are selected in each downsized training set. We based the combination of the classifiers on majority voting even though more complex methods are available; for example, EasyEnsemble [32] combines the outputs from the classifiers using AdaBoost [33, 34].
In practice multiple downsizing improves on simple downsizing: the major advantage is the reduction of variability of the estimated predictive accuracy [35]; in some situations we observed also a limited improvement of predictive accuracy. The relative benefit of multiple downsizing over simple downsizing depends on the amount of information discarded by simple downsizing, i.e., on the level of class imbalance but also on the number of leftout samples. The real data had smaller sample size compared to the simulated data and for that reason multiple downsizing was not as beneficial on real data as in the simulations.
We used penalized logistic regression (PLR) as a classification method and evaluated its predictive accuracy as the fraction of correctly classified samples (see the limitations of this approach in [36], page 247). The classification based on the 0.50 threshold on the predicted probabilities assigns most samples to the majority class from the training set, similarly to simple logistic regression. Using the threshold based on the imbalance from the training set, which works well for logistic regression, reduces but does not remove the classification bias towards the majority class, also when no variable selection is performed before fitting the PLR model; variable selection further increases this bias. Similar results hold for random forest.
We did not attempt to perform a comprehensive study on class imbalance for highdimensional data, but we focused solely on some types of classifiers and of classification strategies. We selected which classifiers to evaluate among those that are most commonly used for highdimensional data. It is possible that other methods that we did not consider might be less sensitive to class imbalance. Most of our results were based on one single method for variable selection, i.e., ttest with equal variances, which bases the selection of the genes on the difference between their means. Some of the effects of variable selection on classimbalanced classification might depend on this choice. However, our results show that also other parametric and nonparametric variable selection methods have the same type of problem on imbalanced data. This is because they all attempt to select, among a very large number of candidate variables, those that differ the most between the classes, using different metrics to define the difference. We decided to focus on some of the solutions for the class imbalance problem: oversampling, downsizing and multiple downsizing. We observed that these approaches performed well in removing the bias towards the classification into the majority class, with the exception of oversampling. More complex methods might be more effective in reducing the variability of the predictive accuracy. The development of guidelines for the design of class prediction studies with classimbalanced data is also a very important issue, which we considered only marginally in this paper.
Conclusions
Our results show that the naive use of classifiers on classimbalanced highdimensional data can produce classification results that are highly biased towards the classification in the majority class. The extent of this bias depends on the classification method, on the magnitude of the difference between classes, and on the level of class imbalance, and it is further increased when variable selection methods are used; variable normalization generally increases the bias and it should be avoided, unless the class imbalance is equal in training and test set. When class imbalance is moderate, and no correction for class imbalance is applied, our results indicate that DLDA performs well. In addition to its relative robustness to class imbalance, another advantage of DLDA is its simplicity and interpretability.
Our results suggest that using a balanced training set is a good choice for the design of highdimensional class prediction studies, also in situations where the proportion of samples from each class is not equal in the population. If class imbalance cannot be avoided, researchers should take the class imbalance problem into account, and appropriately adjust their classification rules. We showed that multiple downsizing can be effectively used if class imbalance is not too severe. This method is useful in removing the bias towards the classification in the majority class, and in reducing the variability of the predictive accuracy compared to simple downsizing. Further work is needed in order to asses if more complex approaches to the correction of class imbalance problem can further increase the classspecific predictive accuracies and predictive values, and reduce their variability.
Methods
Notation
Let x_{ ij } be the expression of j th variable (j = 1, ..., p) on i th individual (i = 1, ..., n). Some of the samples are known to belong to Class 1 (n_{1} samples) and others to Class 2 (n_{2}). Let κ_{ i } , ${k}_{i}^{train}$ and ${k}_{i}^{test}$ denote the proportion of samples from Class i in the population, in the training set and in the test set, respectively. We limit our attention to the G variables (G ≤ p) that are the most informative about the class distinction. We defined the most informative variables to be those with the largest absolute value of the univariate statistic derived from the twosided t test with equal variances. For sample i we denote the set of selected variables by x_{ i }. Let ${\overline{x}}_{j}^{(1)}$ and ${\overline{x}}_{j}^{(2)}$ denote the mean expression of the j th selected variable in Classes 1 and 2, respectively. Let x* represent the set of selected variables for a new sample.
Analysis
Statistical analysis and simulations were carried out using R language for statistical computing (R version 2.8.1) [37].
Classification methods
knearest neighbors
Nearest neighbor rules (kNN, [38]) are simple nonparametric methods that classify a new specimen based on the class labels of its nearest neighbors, i.e., of the specimens in the training set to which its variables are most similar. The class of the new specimen is predicted to be the majority class label of its k nearest neighbors [12]. In this paper we used the Euclidean distance to define the distance between samples and used 3 different kNN classifiers, with k = 1, 3 and 5.
Analysis was performed with the knn function of the class package in R.
Discriminant analysis
Discriminant analysis methods are used to find linear combination of variables that maximize the betweenclass variance and at the same time minimize the withinclass variance [11, 12]. Special cases of discriminant analysis are diagonal linear discriminant analysis (DLDA) and diagonal quadratic discriminant analysis (DQDA). DLDA makes the assumption that the variables are independent and have the same variability in both classes.
and to Class 2 otherwise. ${s}_{1g}^{2}$ and ${s}_{2g}^{2}$ are the estimated variances of variable g in Class 1 and Class 2, respectively.
The function stat.diag.da in the sma package was used to perform DLDA and DQDA.
Random forest
Random forest (RF, [39]) is a classifier consisting of an ensemble of classification trees; each of the T trees is built on a bootstrap sample drawn from the complete data, m among the p variables are randomly chosen and used to find the best split for each node, trees are not pruned, i.e., they are grown to the largest extent, and the new samples are assigned to a class for each of the trees. RF classifies the new samples in the class to which they were assigned most frequently by the T trees.
To take into account the class imbalance of the training set we considered also a different classifier, which assigned a new sample to Class 1 if the proportion of classification assignments to Class 1 was greater than the proportion of samples in Class 1 in the training set (${k}_{1}^{train}$). This corrected classifier is referred in the text as RFTHR. We assessed the performance of both classification rules.
We used the function randomForest in the randomForest package, used default values for the parameters (T = 500, m = $\sqrt{p}$) and defined class priors equal to the proportion of samples in each class in the training set  option classwt. The RFTHR classification was obtained using the option cutoff in the randomForest function.
Support vector machines
Linear support vector machines (SVM; [40]) attempt to find a linear combination of the variables that best separates the samples into two groups based on their class labels. When perfect separation is not possible, the optimal linear combination is determined by a criterion to minimize the number of misclassifications and simultaneously maximize the distance between the classes.
In this paper we used the functions svm and predict from the e1071 package, with linear kernel basis, we specified the class weights as the proportion of samples from each class and did not use the default variable scaling of the svm function.
PAM
Nearest shrunken centroids classification (also known as "Prediction Analysis of Microarrays", PAM, [41]) is an enhancement of the simple nearest centroid classification, where a new sample is classified in the class whose centroid is closest to, in squared distance. The centroid of a class is the vector containing the mean expression profile of all samples from that class in the training set.
and in Class 2 otherwise. In Eq. (3) s_{0} is the median value of the s_{ g } over the set of variables, and ${\overline{x}}_{g}^{,(1)}$ and ${\overline{x}}_{g}^{,(2)}$ are Class 1 and Class 2 shrunken centroids for variable g, respectively. The second term in the equation is a correction based on the class prior probability π_{ k } .
We used the functions pamr.train and pamr.predict from the pamr package. The threshold value in each step of the simulation was set to the largest threshold with the smallest number of misclassification error in the training set.
PLR
where y_{ i } denotes the response for i th individual (1 for Class 1 and 0 for Class 2), and p_{ i } is the probability of belonging to Class 1 for a sample with variables x_{ i }(p_{ i } = P (y_{ i } = 1x_{ i })), which is a function of the observed data x and of the regression coefficients (α and β). The penalization parameter λ can be chosen or estimated (by cross validation or using Akaike's information criterion), while α and β are estimated (using NewtonRaphson procedure). For a new sample with features x* the probabilities to belong to Class 1 or Class 2 are estimated ($\widehat{p}$ and 1 $\widehat{p}$) and the sample is classified in Class 1 if $\widehat{p}$ > 1 $\widehat{p}$, i.e., if $\widehat{p}$ > 0.5, and in Class 2 otherwise, with ties broken at random.
Another possibility is to classify a sample in Class 1 if $\widehat{p}>{k}_{1}^{train}$; we refer to the results obtained using this classification rule as those based on the theoretical threshold (PLRTHR).
We used the functions plr and predict.plr from the stepPlr package. We used λ = 1 for all the analyses; this choice was determined after an exploratory analysis that showed that in our simulation settings using any value for λ in the range from 0 to 1.5 had little effect on the classification rules. However, we observed that large values of λ worsened the performance of PLR when the training set was highly imbalanced.
Data simulation
We simulated p = 40, 1000 or 10000 independent variables for each of n = 100 samples. Under the null case all the variables were simulated independently from the standard normal distribution (mean μ= 0 and standard deviation σ = 1, N(0, 1)) and the class membership of the samples was randomly assigned. Under the alternative case, class membership was dependent on variables; for each sample, p_{0} variables were generated independently from N(0, 1) (nullvariables), while the remaining variables (p_{ DE } , nonnull variables) were generated independently from a normal distribution with mean μ^{(2)} and standard deviation σ = 1 for samples from Class 2, and from N(0, 1) for samples from Class 1. Different values of μ^{(2)} (μ^{(2)} = 0.1, 0.2,...,1.9, 2,3) and of p_{ DE } (p_{ DE } = 20, 40 or p) were considered.
The data set was split into a training set (n_{ train } = 80 samples) and a test set (n_{ test } = 20 samples). Different levels of imbalance between the two classes were considered, varying the proportion of samples from Class 1 from 5% to 95% (k_{1} = 0.05, 0.10,..., 0.95). We looked at situations where we had (i) imbalanced training sets (${k}_{1}^{train}$ = 0.05, ..., 0.95) and balanced test sets (${k}_{1}^{test}$ = 0.50), (ii) the same imbalance in training and test set (${k}_{1}^{train}$ = ${k}_{1}^{test}$ = 0.05, ..., 0.95) and (iii) balanced training set and imbalanced test set (${k}_{1}^{train}$ = 0.50, ${k}_{1}^{test}$ = 0.05, ..., 0.95). In a limited set of simulations we also used smaller sample size (n = 70, n_{ train } = 50, n_{ test } = 20). All the simulations were repeated 1,000 times.
Derivation of the classification rules
Normalization
We evaluated the effect of data normalization, developing classification rules (i) without normalizing data (i.e., using raw data x_{ ij } ), (ii) normalizing the samples (i.e., setting the mean expression for each sample equal to zero, using ${x}_{ij}^{s}={x}_{ij}\frac{1}{p}{\displaystyle {\sum}_{k=1}^{p}{x}_{ik}}$ and (iii) normalizing the variables (i.e., setting the mean expression for each variable equal to zero, using.${x}_{ij}^{v}={x}_{ij}\frac{1}{n}{\displaystyle {\sum}_{k=1}^{n}{x}_{kj}}$. Normalizations were performed separately on training and test set.
Variable selection
Variable selection was performed exclusively on the training set, selecting the G genes with the largest absolute value of the univariate twosample t test statistic (G = 40); we also considered the situation where all the variables were used (G = p). The classification rules were derived completely on the training set, using the variables selected on the training set and the six classification methods described in the section Classification methods.
Evaluation of the performance of the classifiers
The performance of the classifiers was evaluated on the test set. It is well know that for imbalanced data the proportion of correctly classified samples can be a misleading measure of the performance of a classifier ([29], chapter 2). For this reason three different measures of accuracy were considered: (i) overall predictive accuracy (PA, the number of correctly classified samples from the test set divided by the total number of samples in the test set), (ii) predictive accuracy of Class 1 (PA_{1}=P (Predict Class_{1}True Class_{1}), i.e., PA evaluated using only samples from Class 1), (iii) predictive accuracy of Class 2 (PA_{2}).
We derived the 95% prediction intervals for the overall PA as the 2.5th percentile to the 97.5th percentile from the distribution of PAs obtained from 1000 replications, as described in [43].
If not otherwise stated, we assumed that the proportion of samples in each class was the same in the training set and in the population (${\kappa}_{i}={k}_{i}^{train}$).
In biomedical research the two classes often refer to a disease status (positive or negative), and sensitivity (true positive fraction) and specificity (1false positive fraction) are used to describe the accuracy of the classifier. Assuming that Class 1 is the positive class, PA_{1} is the sensitivity of the classifier and PA_{2} is its specificity. Furthermore, PV_{1} is the positive predictive value (PPV) and PV_{2} is the negative predictive value (NPV).
We calculated also the area under the receiver operating characteristic (ROC) curve (AUC) ([29], chapter 4), using the hmisc package.
Solutions for the development of classifiers for classimbalanced data
Oversampling
Simple oversampling consists in obtaining a classbalanced training set replicating a subset of randomly selected samples from the minority class (replicating max(n_{1}, n_{2})  min(n_{1}, n_{2}) samples from the minority class and obtaining a training set of size 2max(n_{1}, n_{2})) [19, 44, 45]. The classification rule was derived on the resampled training set as described for the original data and evaluated on the test set. The classification rule is derived on the replicated training set as described for the original data and evaluated on the test set.
Downsizing
Simple downsizing consists in obtaining a classbalanced training set by removing a subset of randomly selected samples from the majority class (removing max(n_{1}, n_{2})  min(n_{1}, n_{2}) samples from the majority class, obtaining a training set of size 2min(n_{1}, n_{2})) [19, 45]. The classification rule is derived on the reduced training set as described for the original data and evaluated on the test set.
Multiple downsizing (MultDS, asymmetric bagging with embedded variable selection)
With multiple downsizing (MultDS) we tried to make use of the whole information available in the majority class by performing downsizing multiple times. 101 random selections of samples from the majority class were made and classification rule was derived on each of the 101 balanced training sets (note that the minority class was the same in each training set). The 101 downsized classifiers were combined by majority voting: the class assignments for new samples were obtained from each of the downsized classifier and the new samples were assigned to the class with the larger number of votes. Multiple downsizing can be seen as a special case of asymmetric bagging [24], except that the variables are selected in each downsized training set.
Microarray Data
Sotiriou et al. [28] analyzed cDNA gene expression profiles from 99 tumor specimens from breast cancer patients. In addition to gene expression values for 7650 genes (probes) preprocessed as described in Sotiriou et al. (2003), there was standard prognostic variable information available for each patient (the data are publicly available at http://linus.nci.nih.gov/~brb/DataArchive.html). Missing logexpression values were replaced with 0. Here we considered two twoclass prediction problems: the first was to predict estrogen receptor (ER) status, which was negative (ER) for 34 patients and positive (ER+) for 65 patients, according to ligandbinding assay; the second was to predict the grade of tumors, which was 1 or 2 for 54 patients and 3 for 45 patients.
On these data we evaluated the performance of all the classifiers used in the simulation studies, obtaining different levels of class imbalance in the training and test sets by including a selected subset of the samples in the analyses. For each setting (imbalance level) we replicated the analysis 500 times, by randomly selecting which samples to include in the training and in the test set. Variable selection consisted in selecting on each training set the 40 probes with the largest absolute value of the univariate statistic derived from the twosided t test with equal variances. Overall PA, class specific PA and PV, and AUC were obtained averaging the results obtained from the 500 analyses. We evaluated the classifiers both without applying any specific corrections for class imbalance, and using the solutions presented in the previous paragraph for the simulation studies (oversampling, downsizing and multiple downsizing). We evaluated the 95% prediction intervals for the class specific PAs using 500 replications.
Abbreviations
 AUC:

area under the ROC curve
 DLDA:

diagonal linear discriminant analysis
 DQDA:

diagonal quadratic discriminant analysis
 ER:

estrogen receptor
 ER+:

positive estrogen receptor status
 ER:

negative estrogen receptor status
 kNN:

nearest neighbor classifier with k neighbors
 LOOCV:

leaveoneout crossvalidation
 MultDS:

multiple downsizing
 PA:

predictive accuracy
 PA_{1}:

predictive accuracy for Class 1
 PA_{2}:

predictive accuracy for Class 2
 PAM:

prediction analysis of microarrays
 PLR:

penalized logistic regression
 PLRTHR:

penalized logistic regression with adjusted threshold
 PV:

predictive value
 PV_{1}:

predictive value for Class 1
 PV_{2}:

predictive value for Class 2
 RF:

random forests
 RFTHR:

random forests with adjusted threshold
 SVM:

support vector machines
Declarations
Authors’ Affiliations
References
 Brown P, Botstein D: Exploring the new world of the genome with DNA microarrays. Nat Genet 1999, 21(Suppl 1):33–37. 10.1038/4462View ArticlePubMedGoogle Scholar
 Lu J, Getz G, Miska EA, AlvarezSaavedra E, Lamb J, Peck D, SweetCordero A, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horvitz HR, Golub TR: MicroRNA expression profiles classify human cancers. Nature 2005, 435(7043):834–838. 10.1038/nature03702View ArticlePubMedGoogle Scholar
 Ishkanian AS, Malloff CA, Watson SK, DeLeeuw RJ, Chi B, Coe BP, Snijders A, Albertson DG, Pinkel D, Marra MA, Ling V, MacAulay C, Lam WL: A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 2004, 36(3):299–303. 10.1038/ng1307View ArticlePubMedGoogle Scholar
 Gunderson KL, Steemers FJ, Lee G, Mendoza LG, Chee MS: A genomewide scalable SNP genotyping assay using microarray technology. Nat Genet 2005, 37(5):549–554. 10.1038/ng1547View ArticlePubMedGoogle Scholar
 Massague J: Sorting out breastcancer gene signatures. N Engl J Med 2007, 356(3):294–297. 10.1056/NEJMe068292View ArticlePubMedGoogle Scholar
 Li L, Darden TA, Weingberg CR, Levine AJ, Pedersen LG: Gene assessment and sample classification for gene expression data using a genetic algorithm/knearest neighbor method. Comb Chem High Throughput Screen 2001, 4: 727–739.View ArticlePubMedGoogle Scholar
 Oberthuer A, Berthold F, Warnat P, Hero B, Kahlert Y, Spitz R, Ernestus K, Konig R, Haas S, Eils R, Schwab M, Brors B, Westermann F, Fischer M: Customized oligonucleotide microarray gene expressionbased classification of neuroblastoma patients outperforms current clinical risk stratification. J Clin Oncol 2006, 24(31):5070–5078. 10.1200/JCO.2006.06.1879View ArticlePubMedGoogle Scholar
 Tan PJ, Dowe DL, Dix TI: Building classification models from microarray data with treebased classification algorithms.In Proceedings of the 20th Australian joint conference on Advances in artificial intelligence Volume 4830 of Lecture Notes in Computer Science Edited by: Orgun MA, Thornton J. Springer; 2007, 589–598. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.9577&rep=rep1&type=pdf]Google Scholar
 Wu J, Liu H, Duan X, Ding Y, Wu H, Bai Y, Sun X: Prediction of DNAbinding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 2009, 25: 30–35. 10.1093/bioinformatics/btn583View ArticlePubMedPubMed CentralGoogle Scholar
 Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledgebased analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci 2000, 97: 262–267. 10.1073/pnas.97.1.262View ArticlePubMedPubMed CentralGoogle Scholar
 Speed TP: Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC; 2003.View ArticleGoogle Scholar
 Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y: Design and Analysis of DNA Microarray Investigations. New York: Springer; 2004.Google Scholar
 Saeys Y, Inza In, Larrañaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23(19):2507–2517. 10.1093/bioinformatics/btm344View ArticlePubMedGoogle Scholar
 Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Amer Statistical Assoc 2002, 97(457):77–87. 10.1198/016214502753479248View ArticleGoogle Scholar
 Ramaswamy S, Ross KN, Lander ES, Golub TR: A molecular signature of metastasis in primary solid tumors. Nat Genet 2003, 33: 49–54. 10.1038/ng1060View ArticlePubMedGoogle Scholar
 Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, TR G: Diffuse large Bcell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nat Med 2002, 8: 68–74. 10.1038/nm010268View ArticlePubMedGoogle Scholar
 Iizuka N, Oka M, YamadaOkabe H, Nishida M, Maeda Y, Mori N, Takao T, Tamesa T, Tangoku A, Tabuchi H, Hamada K, Nakayama H, Ishitsuka H, Miyamoto T, Hirabayashi A, Uchimura S, Hamamoto Y: Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. The Lancet 2003, 361(9361):923–929. 10.1016/S01406736(03)127754View ArticleGoogle Scholar
 Japkowicz N, Stephen S: The class imbalance problem: A systematic study. Intell Data Anal 2002, 6(5):429–449.Google Scholar
 He H, Garcia EA: Learning from imbalanced data. IEEE Trans Knowledge and Data Eng 2009, 21(9):1263–1284. 10.1109/TKDE.2008.239View ArticleGoogle Scholar
 Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16(10):906–914. 10.1093/bioinformatics/16.10.906View ArticlePubMedGoogle Scholar
 Yang K, Cai Z, Li J, Lin G: A stable gene selection in microarray data analysis. BMC Bioinformatics 2006, 7: 228. 10.1186/147121057228View ArticlePubMedPubMed CentralGoogle Scholar
 Levner I: Feature selection and nearest centroid classification for protein mass spectrometry. BMC Bioinformatics 2005, 6: 68. 10.1186/14712105668View ArticlePubMedPubMed CentralGoogle Scholar
 Meng HH, Li GZ, Wang RS, Zhao XM, Chen L: The imbalanced problem in massspectrometry data analysis.In LNOR 9: The Second International Symposium on Optimization and Systems Biology (OSB'08) Edited by: Du DZ, Zhang XS. Lijiang, China; 2008, 136–143. [http://www.aporc.org/LNOR/9/OSB2008F18.pdf]Google Scholar
 Tao D, Tang X, Li X, Wu X: Asymmetric bagging and random subspace for support vector machinesbased relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 2006, 28(7):1088–1099. 10.1109/TPAMI.2006.134View ArticlePubMedGoogle Scholar
 Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL: Classification by ensembles from random partitions of highdimensional data. Comput Stat Data Anal 2007, 51(12):6166–6179. 10.1016/j.csda.2006.12.043View ArticleGoogle Scholar
 AlShahib A, Breitling R, Gilbert D: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinformatics 2005, 4(3):195–203.View ArticlePubMedGoogle Scholar
 Li GZ, Meng HH, Lu WC, Yang JY, Yang MQ: Asymmetric bagging and feature selection for activities prediction of drug molecules. BMC Bioinformatics 2008, 9(Suppl 6):S7. 10.1186/147121059S6S7View ArticlePubMedPubMed CentralGoogle Scholar
 Sotiriou C, >Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based on gene expression profiles from a populationbased study. Proc Natl Acad Sci USA 2003, 100(18):10393–10398. 10.1073/pnas.1732912100View ArticlePubMedPubMed CentralGoogle Scholar
 Pepe MS: The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press; 2003.Google Scholar
 Lusa L, McShane LM, Reid JF, De Cecco L, Ambrogi F, Biganzoli E, Gariboldi M, Pierotti MA: Challenges in projecting clustering results across gene expression profiling datasets. J Natl Cancer Inst 2007, 99(22):1715–1723. 10.1093/jnci/djm216View ArticlePubMedGoogle Scholar
 Japkowicz N: The class imbalance problem: significance and strategies. Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI) 2000, 111–117. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.1693&rep=rep1&type=pdf]Google Scholar
 Liu XY, Wu J, Zhou ZH: Exploratory undersampling for classimbalance learning. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 2009, 39(2):539–550. 10.1109/TSMCB.2008.2007853View ArticleGoogle Scholar
 Freund Y, Schapire RE: A decisiontheoretic generalization of online learning and an application to boosting. 1995, 904: 23–37.Google Scholar
 Freund Y, Schapire RE: Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning 1996, 148–156.Google Scholar
 Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Data Mining, Inference, and Prediction. New York: Springer; 2003.Google Scholar
 Harrell F: Regression Modeling Strategies. New York: Springer; 2001.View ArticleGoogle Scholar
 R Development Core Team:R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2008. [ISBN 3–900051–07–0] [http://www.Rproject.org] [ISBN 3900051070]Google Scholar
 Fix E Jr: Discriminatory analysis. Nonparametric discrimination: Consistency properties. Tech. Rep. Project 21–49–004, Report Number 4, USAF School of Aviation Medicine, Randolf Field, Texas 1951.Google Scholar
 Breiman L: Random forests. Mach Learn 2001, 45: 5–32. 10.1023/A:1010933404324View ArticleGoogle Scholar
 Cortes C, Vapnik V: Supportvector networks. Mach Learn 1995, 20(3):273–297. [http://www.springerlink.com/content/k238jx04hm87j80g/fulltext.pdf]Google Scholar
 Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002, 99(10):6567–6572. 10.1073/pnas.082099299View ArticlePubMedPubMed CentralGoogle Scholar
 Zhu J, Hastie T: Classification of gene microarrays by penalized logistic regression. Biostatistics 2004, 5(3):427–443. 10.1093/biostatistics/kxg046View ArticlePubMedGoogle Scholar
 Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005, 365(9458):488–492. 10.1016/S01406736(05)178660View ArticlePubMedGoogle Scholar
 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP: SMOTE: synthetic minority oversampling technique. J Artif Intell Res 2002, 16(2002):341–378. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.6835&rep=rep1&type=pdf]Google Scholar
 Batista GEAPA, Prati RC, Monard MC: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 2004, 6: 20–29. 10.1145/1007730.1007735View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.