### Hellinger distance under binormal assumption

The Hellinger distance is a measure of the distributional divergence [30]. Let *P* and *Q* be two probability measures that are absolutely continuous with respect to a third probability measure *λ*. The square of the Hellinger distance can be defined as follows:

$$ D_{H}^{2}(P,Q) = \int\left(\sqrt{\frac{dP}{d\lambda}}-\sqrt{\frac{dQ}{d\lambda}}\right)^{2}d\lambda $$

(1)

Here, *λ* is set be the Lebesgue measure, so that \(\frac {dP}{d\lambda }\) and \(\frac {dQ}{d\lambda }\) are two probability density functions. Based on the binormal assumption, *P* and *Q* are two normal distributions, and

$$ \left\{\begin{array}{l} \frac{dP}{d\lambda} = f_{1}(x) \sim N\left(\mu_{1}, \sigma_{1}^{2}\right),\\ \\ \frac{dQ}{d\lambda} = f_{0}(x) \sim N\left(\mu_{0}, \sigma_{0}^{2}\right) \end{array} \right. $$

(2)

Thus Eq. (1) can be rewritten as

$$\begin{array}{@{}rcl@{}} D_{H}^{2}(P,Q)&=&\int\left(\sqrt{f_{1}(x)}-\sqrt{f_{0}(x)}\right)^{2}dx\\ &=&2 - 2\int\sqrt{f_{1}(x)f_{0}(x)}dx\\ &=&2-2\sqrt{\frac{2\sigma_{1}\sigma_{0}}{\sigma_{1}^{2} + \sigma_{0}^{2}}}Exp\left\{-\frac{(\mu_{1}-\mu_{0})^{2}}{4\left(\sigma_{1}^{2} + \sigma_{0}^{2}\right)}\right\} \end{array} $$

(3)

In practice, the parameters *μ*_{1}, \(\sigma _{1}^{2}\), *μ*_{0}, and \(\sigma _{0}^{2}\) can be replaced by the corresponding sample statistics \(\bar {X}_{1}\), \(S_{1}^{2}\), \(\bar {X}_{0}\), and \(S_{0}^{2}\), respectively.

Without a loss of generality, let **y**=(*y*_{1},*y*_{2},⋯,*y*_{n})^{T} be binary categorical response and *y*_{i}=1 if it belongs to the minority class (positive) and *y*_{i}=−1 if it belongs to the majority class (negative); let *x*_{j} be the *j*^{th} feature (*j*=1,2,⋯,*p*) and **X**=(*x*_{1},*x*_{2},⋯,*x*_{p}); and let *β*=(*β*_{1},*β*_{2},⋯,*β*_{p})^{T} be the vector of estimate coefficients. The normality assumption here is on a linear combination of the predictor matrix **X** rather than each single feature, namely

$$ P = \textbf{X}\boldsymbol{\beta}\Big|(\textbf{y} = 1)=\beta_{1}{\boldsymbol{x}}_{1} + \beta_{2}{\boldsymbol{x}}_{2} + \cdots + \beta_{p}{\boldsymbol{x}}_{p}\Big|(\textbf{y} = 1), $$

(4)

$$ Q = \textbf{X}\boldsymbol{\beta}\Big|(\textbf{y} = -1)=\beta_{1}{\boldsymbol{x}}_{1} + \beta_{2}{\boldsymbol{x}}_{2} + \cdots + \beta_{p}{\boldsymbol{x}}_{p}\Big|(\textbf{y} = -1) $$

(5)

Obviously, the binormal assumption on a linear combination would be more likely to hold than a single variable, particularly for moderate to large size of features *p* according to the central limit theorem (CLT).

**Definition** 1. A quantity is called skew-insensitive if it is not influenced by the class priors.

**Definition** 2. Let *t*_{i}(*i*=1,2,⋯,*n*) be the original score of each observation and *c* be a constant (*c*≠0). A quantity is called translation-invariant if it remains unchanged when each score moves to *t*_{i}+*c*,(*i*=1,2,⋯,*n*).

**Property** 1. The Hellinger distance is skew-insensitive under binormal assumption.

Equation (3) shows that the computation of the Hellinger distance is not influenced by the class-imbalanced ratio. It is just relevant with the expectations *μ*_{0}, *μ*_{1} as well as variances \(\sigma _{0}^{2}\), \(\sigma _{1}^{2}\) of *P* and *Q*. The law of large numbers tells us that these four numerical characteristics are approached by their corresponding sample statistics if the sample size is large enough. They are independent of the class-imbalanced ratio. An example is given in Fig. 2 to exhibit this skew-insensitivity by means of calculating the magnitude of the Hellinger distance on two normal distributions. It can be seen from Fig. 2 that the value of the Hellinger distance stays consistent when the class-imbalanced ratio changes from 1 to 99, and such consistency tends to become increasingly true as the sample size increases. Namely, the magnitude of the Hellinger distance is not influenced by the class-imbalanced ratio. In fact, such skew-insensitivity has also been shown in the reference [31] in terms of comparing isometrics and giving a synthetic example.

**Property** 2. Hellinger distance is translation-invariant under binormal assumption.

Considering that two variances \(\sigma _{0}^{2}\) and \(\sigma _{1}^{2}\) as well as the difference *μ*_{1}−*μ*_{0} keep invariant as each score moves, the Hellinger distance will stay the same according to Eq. (3).

As mentioned above, Hellinger distance essentially captures the divergence between the feature value distributions of different classes and is not influenced by the class ratios under binormal assumption. This is the motivation why Hellinger distance is utilized for class-imbalanced data in this study. In addition, its translation-invariant is very useful to combat the output-shifting arisen when a standard classifier is used to distinguish class-imbalanced data. Unlike the usage of Hellinger distance in the previous work [5, 31], where the feature attributes should be discrete or to discretize the continuous features for the calculation of Hellinger distance, Hellinger distance in this study can be calculated directly based on continuous variables without discretization.

### Hellinger distance-based stable sparse selection (sssHD) and its algorithm

Considering the above questions from class-imbalance learning and being motivated by the properties of the Hellinger distance, we proposed a Hellinger distance-based stable sparse selection (sssHD) approach to perform feature selection when the category data is class-imbalanced. An ordinary classifier generally can not perform feature selection automatically, but a kind of sparse penalty coupled with the Hellinger distance metric can be embedded into the classifier to achieve such a task. For convenience, a linear SVM classifier is employed as an example to establish our sssHD algorithm.

SVM [32–35] has shown promising capability in solving many classification problems. It performs two-classification task by constructing a hyperplane in the multidimensional space to differentiate two classes with a maximal margin. Let **x**_{i}=(*x*_{i1},*x*_{i2},⋯,*x*_{ip})^{T} be the *i*^{th} instance and its class label *y*_{i}=1 or − 1 (*i*=1,2,⋯,*n*). The decision function of SVM can be expressed as follows:

$$ f(\textbf{x}_{i})=\bigg\{\begin{array}{lll} 1,\hspace{5mm}\boldsymbol{\beta}^{T}\textbf{x}_{i}+\beta_{0} > 0,\\ -1,\hspace{2mm}\boldsymbol{\beta}^{T}\textbf{x}_{i}+\beta_{0} \leq 0\\ \end{array} $$

(6)

where *β*_{0} is the constant coefficient, and *β*^{T}**x**_{i}+*β*_{0} is called the decision score in this study. The soft margin support vector classifier can be estimated by solving the following quadratic optimization problem:

$$ \min \hspace{6mm} \frac{1}{2}\|\boldsymbol{\beta}\|^{2}+C\sum_{i=1}^{n}\xi_{i}, $$

(7)

$$ s.t.\hspace{6mm}\bigg\{\begin{array}{ll} y_{i}(\boldsymbol{\beta}^{T}\textbf{x}_{i}+\beta_{0})\geq 1-\xi_{i},\\ \xi_{i}\geq0,\hspace{4mm}(i=1,2,\cdots,n),\\ \end{array} $$

(8)

where *ξ*=(*ξ*_{1},*ξ*_{2},⋯,*ξ*_{n})^{T} are slack variables which are associated with the misclassified individuals. Formula (7) with constraint (8) can be rewritten as

$$ \min \hspace{3mm} Loss + \lambda\|\boldsymbol{\beta}\|^{2}_{2}, $$

(9)

where \(Loss = \sum _{i=1}^{n}max(0, 1- y_{i}(\boldsymbol {\beta }^{T}\textbf {x}_{i}+\beta _{0}))\) is hinge loss, and \(\|\boldsymbol {\beta }\|_{2}^{2}=\sum _{j=1}^{p}\beta _{j}^{2}\) is the ridge penalty. The ridge penalty shrinks the estimation coefficients towards zero and, hence, possibly improves the model’s prediction accuracy; however, it can not perform feature selection automatically. Therefore, the ridge penalty should be replaced by a sparse regularization penalty to induce the sparsity for achieving feature selection. Sparse selection is a very popular technique to perform variable selection for high-dimensional data [24, 36–39]. Taking elastic-net [38] as an example, it is defined as follows:

$$ C_{\alpha}(\boldsymbol{\beta})=\frac{1}{2}(1-\alpha)\|\boldsymbol{\beta}\|_{2}^{2}+\alpha\|\boldsymbol{\beta}\|_{1}, $$

(10)

where \(\|\boldsymbol {\beta }\|_{1}=\sum _{j=1}^{p}|\beta _{j}|\) is LASSO penalty[24], and *α*∈[0,1]. Actually, the elastic-net penalty is a combination of ridge and LASSO penalties, which is particularly useful and effective for feature selection, especially when the data is strongly correlated and high-dimensional. Sparse support vector machine with elastic-net penalty can be expressed as

$$ \min \hspace{3mm} Loss + \lambda C_{\alpha}(\boldsymbol{\beta}), $$

(11)

where *λ* is the tuning parameter that controls the tradeoff between loss and penalty.

The optimal estimation \((\hat {\boldsymbol {\beta }}, \hat {\beta _{0}})\) in objective (11) is the function of *λ* and *α*. Consequently, the decision score \(\hat {\textbf {t}} =(\hat {t}_{1},\hat {t}_{2},\cdots,\hat {t}_{n})^{T}= \textbf {X}\hat {\boldsymbol {\beta }} + \hat {\beta _{0}}\) is also influenced by *λ* and *α*. Denoted \(\hat {\textbf {t}}\) by \(\hat {\textbf {t}}(\lambda, \alpha) = (\hat {\textbf {t}}_{0}(\lambda,\alpha), \hat {\textbf {t}}_{1}(\lambda,\alpha))^{T}\), where \(\hat {\textbf {t}}_{0}(\lambda,\alpha) =\{\hat {t}_{i}|y_{i} = -1\}\) and \(\hat {\textbf {t}}_{1}(\lambda,\alpha) =\{\hat {t}_{i}|y_{i} = 1\}\), the objective of sparse selection with Hellinger distance can be defined as

$$ \hat{\boldsymbol{\beta}} = \max \hspace{3mm} D_{H}\left(\hat{\textbf{t}}_{0}(\lambda,\alpha), \hat{\textbf{t}}_{1}(\lambda,\alpha)\right) $$

(12)

A potential question of sparse feature selection is its instability caused by the variation from the training data [40, 41]. Class-imbalance is going to exacerbate this drawback. A decent strategy to overcome such disadvantage is to combine sparse selection with subsampling. Meinshausen et al. [40] pointed out that such marriage yields finite sample family-wise error control and significantly improves selection methods. In this study, objective (12) is conducted many times with subsampling to achieve stable selection. Denoted \(\hat {\boldsymbol {\beta }}\) from objective (12) by \(\hat {\boldsymbol {\beta }}^{(k)}\) in the *k*^{th} subsampling (*k*=1,2,⋯,*K*). The importance of the features is measured by the inclusion frequency, which is denoted by **f**=(*f*_{1},*f*_{2},⋯,*f*_{p})^{T}, and is defined as follows:

$$ f_{j} = \frac{1}{K}\sum_{k=1}^{K}g\left(\hat{\beta}^{(k)}_{j}\right), \hspace{5mm}j = 1, 2, \cdots, p, $$

(13)

where \(g\left (\hat {\beta }^{(k)}_{j}\right) = 1\) if \(\hat {\beta }^{(k)}_{j} \neq 0\), otherwise \(g\left (\hat {\beta }^{(k)}_{j}\right) = 0\). All the features are ranked with their inclusion frequencies, and the feature with maximal inclusion frequency is the most important. More details of the sssHD algorithm is given in Algorithm 1. The ratios of subsampling from the majority (*r*_{0}) and minority (*r*_{1}) are set to be equal in this study to keep the class-imbalance ratio of the subset the same as the original data. sssHD is extremely general and can be easily extended; for example, sparse SVM can be placed by sparse logistic regression [42] or Fisher linear discriminant [43]; re-balance methods such as over-sampling [44] or under-sampling [45] could be connected if necessary; sparse regularization (Eq. (10)) also has many alternatives, such as SCAD [36], adaptive LASSO [39], group LASSO [46], and group bridge penalty [47].

### Assessment metrics and experimental methods

In class-imbalance learning, the majority and the minority are generally called as negative and positive, respectively. A binary classifier predicts all the instances as either positive or negative. Thus, it produces four types of outcome: true positive (*TP*), true negative (*TN*), false positive (*FP*) and false negative (*FN*). Several metrics can be defined according to these outcomes, such as

$$\begin{array}{*{20}l} &TPR = recall = \frac{TP}{TP + FN} = \frac{TP}{n_1};\\ &TNR = \frac{TN}{TN + FP} = \frac{TN}{n_{0}};\\ &FPR = \frac{FP}{FP + TN} = \frac{FP}{n_{0}};\\ &precision = \frac{TP}{TP + FP};\\ &G-mean = \sqrt{TPR \times TNR};\\ &F-measure = \frac{2(precision \times recall)}{precision + recall} \end{array} $$

As shown in above, *precision* is the proportion of true positives among the positive predictions; *recall* (*TPR*) measures the proportion of positives that are correctly identified; *G*−*m**e**a**n* is the geometric mean of *TPR* and *TNR*, which measures the accuracy on both the majority class and the minority class; *F*−*m**e**a**s**u**r**e* is a kind of combinations of *precision* and *recall*, and is high when both of them are high.

ROC curve can be created by plotting *TPR* on the y-axis against *FPR* on the x-axis at various threshold settings. Let *T*,*T*_{1} and *T*_{0} denote respectively the continuous outputs for total, positive and negative examples by the binary classifier (such as SVM); *T* is the mixture of *T*_{1} and *T*_{0}. Larger output values are associated with positive examples. So for a given threshold *c* (−*∞*<*c*<+*∞*), an example is predicted positive if its value is greater than *c*. Thus,

$$TPR(c) = P(T_{1}>c)=P(T>c|y=1),$$

$$FPR(c) = P(T_{0}>c)=P(T>c|y=0)$$

A ROC curve may be defined as the set of points:

$$ROC(\cdot) = \left\{\left(TPR(c), FPR(c)\right) \Big|-\infty< c< +\infty)\right\}$$

The area under ROC (*AUCROC*) is calculated here by using trapezoidal rule [48], where the point with the minimal value at this *FPR* is linked to the point with the maximal value at the next *FPR* value when there is more than one value at the same *FPR* value. Let *F*_{1},*F*_{2},⋯,*F*_{K} be all the different *FPR* values satisfying *F*_{1}<*F*_{2}<⋯<*F*_{K}, and \(T_{k}^{max}\) and \(T_{k}^{min}\) are the maximal and minimal *TPR* values corresponding to *F*_{k} (*k*=1,2,⋯,*K*), respectively. The empirical AUCROC with the lower trapezoidal rule is

$$ AUCROC = \sum_{k=1}^{K-1}\frac{1}{2}\left(T_{k}^{min} + T_{k+1}^{max}\right)(F_{k+1}-F_{k}) $$

(14)

In this study, *TPR*, *G*−*m**e**a**n*, *F*−*m**e**a**s**u**r**e*, *AUCROC* and *precision* are employed as assessment metrics to perform real data experiments. Cross validation is performed for each real data in computing these measures. In order to keep the invariant of the imbalance ratio in each fold, stratified sampling is utilized. Namely each fold contains the same size of negative (and positive) instances and their class-imbalance ratios are equal to the ratio of original data set.

To evaluate sssHD algorithm with the above real data sets, we compare it with other four filter-based feature selection methods: Fisher score [49], Relief [50], area under receiver operating characteristic (AUCROCfilter) [51] and area under precision-recall curve (AUCPRCfilter) [52].

**Fisher score**: the Fisher score could strongly depend on the directions of the spread of the data by calculating the difference of each feature’s mean values in two classes:

$$ F_{j} = \frac{\left |\mu_{1j} - \mu_{0j}\right |}{\sigma_{1j}^{2} + \sigma_{0j}^{2}}, \hspace{5mm}j = 1, 2, \cdots, p, $$

(15)

where *μ*_{0j}, *μ*_{1j}, \(\sigma _{0j}^{2}\), and \(\sigma _{1j}^{2}\) are respectively the mean and the variance of the *j*^{th} predictor of the majority and the minority. Attributes with a higher score are more important for separating the two classes. Fisher score has been utilized successfully in many classification issues [53, 54].

**Relief**: the Relief is a randomized algorithm that attempts to give each predictor a weight indicating its level of relevance to the target. In each iteration, Relief first needs to search two nearest neighbors for any selected example point (**x**_{i}): one from the same class (**nearhit**_{i}), and one from the other class (**nearmiss**_{i}); then, if Euclidean distance is employed, the weight vector **w**=(*w*_{1},*w*_{2},⋯,*w*_{p})^{T} is updated so that

$$ {\begin{aligned} w_{j} \longleftarrow w_{j} - (x_{ij} - nearhit_{ij})^{2} &+ (x_{ij}- nearmiss_{ij})^{2},\\&\quad \hspace{4mm}j = 1, 2, \cdots, p, \end{aligned}} $$

(16)

where *x*_{ij}, *n**e**a**r**h**i**t*_{ij} and *n**e**a**r**m**i**s**s*_{ij} correspond to the *j*^{th} component of **x**_{i}, **nearhit**_{i} and **nearmiss**_{i}, respectively. Attributes with a larger weight are more relevant with the response. The Relief method can be applied to a variety of complicated situations and now has several generalizations, such as ReliefF [55].

**AUCROCfilter**: as stated above, ROC can be used as a metric to evaluate the final model. In addition, ROC and its area could be used as a filter feature selection method when just considering a single predictor each time. To obtain the predicted class labels, ROC should be combined with a classifier. Obviously, an attribute with larger area under the curve is more important for classifying the target. An ROC-based filter feature selection strategy has been used for high-dimensional class-imbalanced data [51].

**AUCPRCfilter**: as an alternative of ROC, the precision-recall curve (PRC) has gained increased attention recently in class-imbalance learning [56]. The PRC curve is created by plotting the recall on the x-axis against precision on the y-axis at various threshold settings. The area under PRC (AUCPRC) can be seen as the average of the precision weighted by the probability of a given threshold and is utilized to define how a classifier performs over the whole space. Similarly, AUCPRC coupled with a classifier can be used individually as a filter-based feature selection method for each attribute. Attributes with larger AUCPRC are more significant for separating classes.