### Support Vector Machines

Suppose a training data set with input data vector **x**
_{
i
} ∈ ℝ^{
p
} and corresponding class labels *y*
_{
i
} ∈ {-1, 1}, *i* = 1,..., *n* is given. The SVM finds a maximal margin hyperplane such that it maximises the distance between classes. A linear hyperplane can always perfectly separate *n* samples in *n* + 1 dimensions. Since we can assume that high-dimensional data with *p* ≫ *n* is generally linear separable [6], increasing complexity by using non-linear kernels is usually not needed. Thus, we use a linear SVM model throughout the paper.

The linear SVM separates classes by a linear boundary

where **w** = (*w*
_{1}, *w*
_{2},..., *w*
_{
p
}) is a unique vector of coefficients of the hyperplane with ||**w**||_{2} = 1 and *b* denotes the intercept of the hyperplane. We use '·' to denote the inner product operator. The class assignment for a test data vector **x**
_{test} ∈ **R**
^{
p
} is given by *y*
_{test} = sign [*f* (**x**
_{test})].

#### Soft margin SVM

Soft margin SVM allows some data points to be on the wrong side of the margin. To account for erroneous decisions, slack variables

*ξ*
_{
i
} ≥ 0,

*i* = 1,...,

*n* are defined as the distance between a misclassified data point and the corresponding margin. For data points on the correct side of the margin

*ξ*
_{
i
} = 0, for data points inside the margin 0 <

*ξ*
_{
i
} ≤ 1 and for misclassified data points

*ξ*
_{
i
} > 1. The sum of non-zero

*ξ*
_{
i
} is penalized with a cost parameter

*C* and then added to the optimisation function penalty in the minimisation problem:

The optimisation problem (2) is called the

*soft margin SVM*. The cost parameter

*C* is a data dependent tuning parameter that controls the balance between minimizing the coefficients of the hyperplane and correct classification of the training data set.

*C* is often chosen by cross validation. Problem (2) can be solved by using convex optimisation techniques, namely by the method of Lagrange multipliers [

4]. Convex optimisation techniques provide a unique solution for hyperplane parameters

**w** and

*b*
where

*α*
_{
i
} ≥ 0,

*i* = 1,...,

*n* are Lagrange multipliers. The data points with positive

*α*
_{
i
}, are called

*support vectors* (SVs). All data points lying on the correct side of their margin have

*α*
_{
i
} = 0. Thus, they do not have any impact on the hyperplane, and we can rewrite Eq. (

3) as

where the set of indices of the support vectors S is determined by *S* := {*i* : *α*
_{
i
} > 0}.

The coefficient
can be calculated from
for any *i* with *α*
_{
i
} > 0. In praxis, an average of all solutions for
is used for numerical stability.

### SVM as a penalization method

Hastie et al. [

4] showed that the SVM optimisation problem is equivalent to a penalization problem which has the "

*loss and penalty*" form

where the loss term is described by a sum of the hinge loss functions *l* (*y*
_{
i
}, *f* (*x*
_{
i
})) = [1 - *y*
_{
i
}
*f* (**x**
_{
i
})]_{+} = max(1 - *y*
_{
i
}
*f* (**x**
_{
i
}), 0) for each sample vector **x**
_{
i
}, *i* = 1,..., *n*. The penalty term is denoted as pen_{
λ
} (**w**) and can have different forms:

### Ridge penalty

The penalty term for ordinary SVM uses the

*L*
_{2} norm:

The *L*
_{2} penalty shrinks the coefficients to control their variance. However, the ridge penalty provides no shrinkage of the coefficients to zero and hence no feature selection is performed.

### LASSO

The use of a

*L*
_{1} penalization function is originally proposed by Tibshirani [

8] for generalized linear models. The technique for parameter estimation with constraints is called LASSO ('least absolute shrinkage and selection operator'). Later, Bradley [

18] adapted the

*L*
_{1}-regularisation to SVM. Then, the penalty term has the form

As a result of singularity of the *L*
_{1} penalty function, *L*
_{1} SVM automatically selects features by shrinking coefficients of the hyperplane to zero.

However, the *L*
_{1} norm penalty has two limitations. First, the number of selected features is bounded by the number of samples. Second, it tends to select only one feature from a group of correlated features and drops the others.

Fung and Mangasarian [19] have published a fast *L*
_{1} SVM modification, the Newton Linear Programming Support Vector Machine (NLPSVM), which we use in our analyses.

### Smoothly clipped absolute deviation penalty (SCAD)

The SCAD penalty is a non-convex penalty function first proposed by Fan and Li [

20]. Later, Zhang et al. [

10] combined the SVM technique with the SCAD penalty for feature selection. The SCAD penalty function for a single coefficient

*w*
_{
j
} is defined as

where *w*
_{
j
} , *j* = 1,..., *p* are the coefficients defining the hyperplane and *a* > 2 and λ > 0 are tuning parameters. Fan and Li [21] showed that SCAD prediction is not sensitive to selection of the tuning parameter *a*. Their suggested value *a* = 3.7 is therefore used in our analyses.

The penalty term for SCAD SVM has the form

The SCAD penalty corresponds to a quadratic spline function with knots λ at and *a*λ. For small coefficients *w*
_{
j
}, *j* = 1,..., *p*, SCAD yields the same behaviour as *L*
_{1}. For large coefficients, however, SCAD applies a constant penalty, in contrast to *L*
_{1}. This reduces the estimation bias. Furthermore, the SCAD penalty holds better theoretical properties than the *L*
_{1} penalty [21].

### Elastic Net

To overcome the limitations of LASSO, Zou and Hastie [

9] proposed a linear combination of

*L*
_{1} and

*L*
_{2} penalties which they called

*Elastic Net*:

The Elastic Net penalty provides automatic feature selection similar to

*L*
_{1}, but is no longer bounded by the sample size. Moreover, at the same time this penalty manages to select highly correlated features (

*grouping effect*). Increasing

*λ*
_{1} reduces the number of features of the classifier whereas for large

*λ*
_{2} one observes better control of the grouping effect. Wang [

22] adapted the Elastic Net penalty to SVM classification problems. Therefore, the Elastic Net SVM optimisation problem can be written as

where *λ*
_{1}, *λ*
_{2} ≥ 0 are the corresponding tuning parameters.

### Elastic SCAD

Fan and Li [21] demonstrated the advantages of the SCAD penalty over the *L*
_{1} penalty. However, using the SCAD penalty might be too strict in selecting features for non-sparse data. A modification of the SCAD penalty analogously to Elastic Net could keep the advantages of the SCAD penalty, and, at the same time, avoid too restrictive sparsity limitations for non-sparse data.

We therefore propose a combination of the SCAD and the

*L*
_{2} penalties. The new penalty term has the form

*λ*_{1}, *λ*_{2} ≥ 0 are the tuning parameters. We expect that the Elastic SCAD will improve the SCAD method for less sparse data. According to the nature of the SCAD and *L*_{2} penalties, the Elastic SCAD should show good prediction accuracy for both, sparse and non-sparse data.

It can be shown that the combined penalty provides sparsity, continuity, and asymptotic normality when the tuning parameter for the ridge penalty converges to zero, i.e. *λ*
_{2} → 0. The asymptotic normality and sparsity of Elastic SCAD leads to the oracle property in the sense of Fan and Li [21].

The Elastic SCAD SVM optimisation problem has the form

where *λ*
_{1}, *λ*
_{2} ≥ 0 are the tuning parameters.

### Elastic SCAD SVM: Algorithm

By solving Eq. (9) the same problems as for SCAD SVM occur: the hinge loss function is not differentiable at zero and the SCAD penalty is not convex in **w**. The Elastic SCAD SVM objective function can be locally approximated by a quadratic function and the minimisation problem can be solved iteratively similar to the SCAD approach [10, 21].

For simplicity, we rename the SCAD penalty from

to

. Accordingly, the first-order derivative of the penalty is denoted by

. Denote the penalized objective function in Eq. (

9) by

For each

*i* (with respect to the fact that

) the loss term can be split according to

Given an initial value (

*b*
_{0};

**w**
_{0}) close to the minimum of

*A*(

*b*,

**w**), we consider the following local quadratic approximations:

When

*w*
_{j0 }is close to zero, set

; otherwise use the approximation for the SCAD penalty

where due to symmetrical nature of the SCAD penalty |*w*
_{
j
}| is used instead of *w*
_{
j
}.

It can be shown that both approximations and their original functions have the same gradient at the point (*b*
_{0}, **w**
_{0}). Therefore, the solution of the local quadratic function corresponds approximately to the solution of the original problem.

The local quadratic approximation of

*A*(

*b*,

**w**) has the form

By minimisation of

*A*(

*b*,

**w**) with respect to

**w** and

*b*, terms without optimisation parameters

**w** and

*b* can be dropped (due to derivatives of constants):

To write the equations in matrix form we define:

Moreover, we define the matrix

*X* = [

**1**,

**x**
_{1},...,

**x**
_{
p
}], where

**1** is the vector of 1s with length n and

**x**
_{
j
} is the

*j*th input vector. Set

Minimizing

*A* (

*b*,

**w**) is then equivalent to minimizing the quadratic function

The solution to Eq. (

10) satisfies the linear equation system

The Elastic SCAD SVM can be implemented by the following iterative algorithm.

**Step 1** Set *k* = 1 and specify the initial value (*b*^{(1)}, **w**^{(1)}) by standard *L*_{2} SVM according to Zhang et al. [10].

**Step 2** Store the solution of the *k*th iteration: (*b*_{0}, **w**_{0}) = (*b*^{(k)}, **w**^{(k)}).

**Step 3** Minimize Ã (*b*, **w**) by solving Eq. (11), and denote the solution as (*b*^{(k+1)}, **w**^{(k+1)}).

**Step 4** Let *k* = *k* + 1. Go to step 2 until convergence.

If elements
are close to zero, for instance, smaller than 10^{-4}, then the *j*th variable is considered to be redundant and in the next step will be removed from the model. The algorithm stops after convergence of (*b*
^{(k)}, **w**
^{(k)}).

### Choosing tuning parameters

All SVM problems with or without feature selection use one or two tuning parameters which balance the trade-off between data fit and model complexity. Since these parameters are data dependent, finding optimal tuning parameters is part of the classification task.

### Fixed grid search

Tuning parameters are usually determined by a grid search. The grid search method calculates a target value, e.g. the misclassification rate, at each point over a fixed grid of parameter values. This method may offer some protection against local minima but it is not very efficient. The density of the grid plays a critical role in finding global optima. For very sparse grids, it is very likely to find local optimal points. By increasing the density of the grid, the computation cost increases rapidly with no guaranty of finding global optima. The major disadvantage of the fixed grid approach lies in the systematic check of the misclassification rates in each point of the grid. There is no possibility to skip redundant points or to add new ones.

When more parameters are included in the model, the computation complexity is increased. Thus, the fixed grid search is only suitable for tuning of very few parameters.

### Interval search

Froehlich and Zell [15] suggested an efficient algorithm of finding a *global* optimum on the tuning parameter space using a method called EPSGO (Efficient Parameter Selection via Global Optimisation).

The main idea of the EPSGO algorithm is to treat the search for an optimal tuning parameter as a global optimisation problem. For that purpose, the Gaussian Process model is learned from the points in the parameter space which have been already visited. Thereby, training and testing of the GP is very efficient in comparison to the calculation of the original SVM models. New points in the parameter space are sampled by using the expected improvement criterion as described in the EGO algorithm [23], which avoids stacking in local minima. The stopping criteria of the EPSGO algorithm is either convergence of the algorithm or no change of the optimum during the last ten iterations.

### Stratified cross validation

Using *k*-fold cross validation, the data set is randomly split into *k* disjoint parts of roughly equal size, usually *k* = 5 or *k* = 10. In addition, the data is often split in a way that each fold contains approximately the same distribution of class labels as the whole data set, denoted by *stratified* cross validation. For each subset, one fits the model using the other *k* - 1 parts and calculates the prediction error of the selected *k*th part of the data.

The case *k* = *n* is called *leave one out cross validation* (LOO CV). The choice of *k* determines a trade-off between bias and variance of the prediction error. Kohavi [24] showed that ten-fold stratified cross validation showed better performance in terms of bias and variance compared to 10 < *k* < *n*. Hastie et al. [4] recommended to perform five- or ten-fold cross validation as a good compromise between variance and bias. We used both, five- and ten-fold stratified cross validation for simulation study and real applications, respectively.

In the next two sections the application of penalized SVM classification methods are compared. We used simulated and publicly available data to investigate the behaviour of different feature selection SVMs. For all comparisons the R pack-ages "penalizedSVM" [25] and "e1071" [26] were used which are freely available from the CRAN http://cran.r-project.org/, R version 2.10.1. The R package "e1071" is a wrapper for the well-known LIBSVM software [27]. We used five- and ten-fold stratified cross validation in combination with interval search for tuning parameters as described above.