Skip to main content

Table 1 Summary of the theoretical properties of SMOTE for high-dimensional data

From: SMOTE for high-dimensional class-imbalanced data

Property Consequence of using SMOTE on high-dimensional data
E(SMOTE) = E(X) Little impact on classifiers that depend on mean values (DLDA);
var(SMOTE)= 2 3 var(X) Minority class variability is underestimated; negative impact on classifiers that use class-specific variances (DQDA); inflated statistical significance of statistical tests for comparing classes (t-test);
d(SMOTE, TEST) < d(X, TEST)d: Euclidean distance Test samples are classified mostly in the minority class for classifiers based on Euclidean distance (k-NN); variable selection is helpful in reducing this problem;
cor(SMOTE, X) ≥ 0; cor(SMOTEs, SMOTEt) ≥ 0 Training set samples are no longer independent; independence of samples is assumed by most classifiers (DLDA, PLR,...) and variable selection methods (t-test, Mann-Whitney,...)