 Methodology article
 Open Access
 Published:
Enhancing SVM for survival data using local invariances and weighting
BMC Bioinformatics volume 21, Article number: 193 (2020)
Abstract
Background
The necessity to analyze mediumthroughput data in epidemiological studies with small sample size, particularly when studying biomedical data may hinder the use of classical statistical methods. Support vector machines (SVM) models can be successfully applied in this setting because they are a powerful tool to analyze data with large number of predictors and limited sample size, especially when handling binary outcomes. However, biomedical research often involves analysis of timetoevent outcomes and has to account for censoring. Methods to handle censored data in the SVM framework can be divided into two classes: those based on support vector regression (SVR) and those based on binary classification. Methods based on SVR seem to be suboptimal to handle sparse data and yield results comparable to Cox proportional hazards model and kernel Cox regression. The limited work dedicated to assess methods based on of SVM for binary classification has been based on SVM learning using privileged information and SVM with uncertain classes.
Results
This paper proposes alternative methods and extensions within the binary classification framework, specifically, a conditional survival approach for weighting censored observations and a semisupervised SVM with local invariances. Using simulation studies and some real datasets, we evaluate those two methods and compare them with a weighted SVM model, SVM extensions found in the literature, kernel Cox regression and Cox model.
Conclusions
Our proposed methods perform generally better under a wide variety of realistic scenarios about the structure of biomedical data. Specifically, the local invariances method using the conditional survival approach is the most robust method under different scenarios and is a good approach to consider as an alternative to other timetoevent methods. When analysing real data is a method to be considered and recommended since outperforms other methods in proportional and nonproportional scenarios and sparse data, which is something usual in biomedical data and biomarkers analysis.
Background
Biomedical studies are oftentimes based on small sample sizes and on a medium to large number of variables. Support vector machine (SVM) models are a powerful tool to analyse this type of data because of their performance in analysis of sparse data, i.e., data with as many or more predictors than observations. SVMs have been widely applied for analysis of binary outcomes. As originally developed [1] these models are based on discriminating two classes of observations by a linear decision surface (hyperplane) and maximizing the distance between the hyperplane and the individual observations. If the classes are not separable by a linear surface, a nonlinear transformation can be obtained through mapping the data on a different dimension space (feature space). This nonlinear transformation can be obtained without explicitly mapping into the feature space through the use of a kernel function.
A common outcome in biomedical research is timetoevent. The challenge of analyzing timetoevent data is associated with occurrence of censoring as it is called the partially observed timetoevent of a participant whose followup ends before the event has occurred. There are different types of censoring but the most common is right censoring that occurs when an observation leaves the study before the end of followup or presenting an event, or when the study ends before the event has occurred. The most common traditional approach to analyze timetoevent data and handle censoring is the Cox proportional hazard regression [2]. This is a semiparametric model based on a partial likelihood function (similar to the ordinary likelihood functions) that is defined in terms of the hazard function and assumes that: i) the baseline hazard is common to all observations; ii) linearity and additivity of the predictors with respect to loghazard or logcumulative hazard, and iii) proportionality of the hazards across predictor classes or constant hazard ratios over time. Another important requirement to obtain unbiased estimations with proportional hazards models is that the minimum number of events is at least 5 [3,4,5].
When the data is sparse, proportional hazards regression may not converge and yield unreliable and biased point estimates and statistical tests. Under sparsity, SVM or a kernelized (i.e., penalized) version of the Cox model [6] may be more appropriate. Generally, extensions of SVM to handle timetoevent and censored data can be based on a regression (SVR) or a classification approach (SVM). Most work has focused on SVR [7,8,9] and on a ranking (ordinal) methodology [10,11,12] and suggested that both approaches were comparable to proportional hazards model, in nonsparse scenarios, and the kernel Cox regression and, thus, may not provide any gains in accuracy of predictions. Only two methods have extended the SVM to survival data and handled censoring based on a binary classification approach: SVM learning using privileged information [13] (LUPI) and uncertain classes [14], proposed in Shiao and Cherkassky [15] work. In both methods, the censored data is basically weighted using the followup time without considering the overall probability of the event at the end of the followup period.
In this paper, we propose an alternative extension to allow SVM to model timetoevent data based on a binary classification SVM. To do that, we assign a probability to the censored data using a conditional survival approach considering the survival probability at each censored time. Moreover, we propose using a semisupervised version of SVM with local invariances to model timetoevent data and compare the performance of the proposed approaches with the Cox proportional hazards regression, kernel Cox regression, and other SVM methods for survival analysis, such as LUPI and weighted SVM.
Related work
Traditional models for survival data – Kaplan Meier estimator and Cox proportional hazards model
In survival analysis, the nonnegative timetoevent (be death or any other event) of a subject can be defined by the continuous random variable T^{∗}. An important function related to the timetoevent data is the survival function S(T^{∗}) = P(T^{∗} ≥ t^{∗}), that is the probability of an individual to survive beyond time t^{∗}. Due to censoring, T^{∗} is not observable but instead the pair (T, δ), where T is the time to censoring or to the event of interest and δ is the censoring indicator (0 for censored data and 1 for event).
The empirical survival function is an estimate of the survival function and is commonly obtained by the nonparametric KaplanMeier estimator [16]. This can be obtained applying the product:
, where n is the total number of individuals, T_{(i)} are the order statistics of the observed times for ith observation and δ_{(i)} is the censoring indicator of ith observation. The estimator in (1) is a decreasing step function that changes only at event times.
A second important function in analyses of timetoevent data is the hazard function, being the Cox proportional hazards model [2] the most popular model used in analysis of survival data. It is defined in terms of the hazard function:
where λ(t x_{i}) is the hazard at time t of an observation i with covariates vector x_{i}, λ_{0}(t) is the baseline hazard function, β is the vector of coefficients of the model and 〈x_{i}, β〉 is the dot product between x_{i} and β, i.e., the linear predictor function. The model assumes a baseline hazard that is common to all observations in the study population. In this model, the hazard of a subject increases multiplicatively with covariates.
In the Cox proportional hazards model, the baseline hazard is modelled semiparametrically, i.e., the baseline hazard does not need to be specified and the optimization function is based on a partial likelihood. The Cox model is more robust to outliers than other models because it uses only the rank ordering of the failure and censoring times. The partial likelihood accounting for censored observations can be expressed as:
where R_{i} is the set of individuals at risk of having an event at time t_{i}, δ_{i} the censoring indicator of the observation with time t_{i} and x_{i} vector of covariates of observation i. Applying the logarithm transformation to the partial likelihood we obtain the log partial likelihood, which is maximized through NewtonRaphson algorithm. The maximum partial likelihood estimator is asymptotically unbiased, efficient and normally distributed [17].
Kernel Cox regression
This is a penalized version of the Cox model, in which a kernel is added to model the hazard as a function of covariates. For the general Cox model for observation i, at time t, with a vector of covariates x_{i}, the hazard can be expressed as
where λ_{0}(t) is the unspecified baseline hazard function and f(x_{i}) is an arbitrary function. Li and Luan [6] proposed using the log partial likelihood as a loss function and reformulate the problem as finding the function f in the penalized loglikelihood such that
where f is assumed to be from a Reproducing Kernel Hilbert Space, H, defined by a kernel function and a ξ > 0 regularization parameter [18]. The solution to this problem is given by the representer theorem [18] where the optimal f(x) has the form
The optimal α = (α_{1}, …, α_{n}) in (6) can be found by plugging (6) into (5), resulting in a convex optimization problem to which the solution can be found by any unconstrained optimization method. The term b is the intercept or bias usually computed as the average error between the target and predicted value.
Survival analysis using the SVM based on binary classification
Two approaches have been proposed in this class of models by Shiao and Cherkassky [15]: the LUPI approach developed by Vapnik and Vashist [13] and the SVM with uncertain classes developed by Niaf et al. [14]. LUPI uses the censoring information as privileged information (only available for the training data) and, thus, includes additional information in the training process to enrich the learning process. Two different spaces are described, the decision space and the correcting space (the one with the censoring information). SVM with uncertain classes allows to define less than perfectly the belonging class of observations, i.e., it allows some degree of confidence regarding the class.
Shiao and Cherkassky suggested measuring privileged information for LUPI and the SVM uncertainty using the proportion of followup time with which an ith censored subject contributes. Therefore, for the censored observation i, the weight or probability assigned to the observation is \( {W}_i=\frac{{\mathrm{T}}_{\mathrm{i}}}{\uptau} \), being τ the maximum followup time in the study cohort. For the event this value is fixed to be 0.
LUPI SVM
The LUPI approach is based on a triplet \( \left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i^{\ast },{y}_i\right) \) for i = 1, …, n observations, where \( {\boldsymbol{x}}_i\in {\mathbb{R}}^d,{\boldsymbol{x}}_i^{\ast}\in {\mathbb{R}}^k \) and y_{i} ∈ {±1}. The (x_{i}, y_{i}) are the usual training data and \( {\boldsymbol{x}}_i^{\ast } \) defines the privileged information only present in the training data, i.e., the information (variables) only present when modelling the data. The privileged information is not available when predicting the class of a new observation. In the LUPI approach two different spaces are described: i) the space related to x, known as decision space, which is the same feature space used in standard SVM and ii) the space related to x^{∗}, known as correcting space, which contains the privileged information about the training data and not available for predictions of future observations. The LUPI estimates the decision function and corrects it using the correcting function via privileged information. The main optimization problem is expressed as in equation (7).
where w is the weight vector of the separating hyperplane,x_{i} is the vector of covariates for subject i, ξ_{i} are the slack variables and b is the bias term of the hyperplane of the decision space. The analogous parameters, w^{∗},\( {\boldsymbol{x}}_i^{\ast } \) and b^{∗} are in the correcting space.
The decision function and the correcting functions depend on the decision and correcting space respectively. Although, the decision function has the same expression of the usual SVM, the coefficients of the LUPI decision function depend on kernels in both spaces. The SVM and the LUPI solutions are exactly the same when the privileged information is rejected (when γ tends to 0 in expression (7).
The time to followup and timetoevent are observable and known in the training set but not in the test set. Thus, the censoring information that is only present in the training set can be used as privileged information. Shiao and Cherkassky proposed using the pair (T_{i}, W_{i}) as the privileged information.
Uncertainty SVM
This method allows defining less than perfectly some observations, assigning them an uncertainty in their class. For these uncertainties a confidence level or probability regarding the class is provided. We will refer to the Uncertainty SVM onward in this manuscript as pSVM (probabilistic SVM).
The pSVM assigns observations to a class through a hinge loss and estimates probability of belonging to the class through the ϵinsensitive cost function. Given an observation i, we define the pair (x_{i}, l_{i}) as the training set of input vectors along with their corresponding group of classes. These classes can be defined as
where n is the number of observations with known classes (perfectly definite), (m − n − 1) is the number of observations with uncertain classes, and p_{i} is the uncertainty associated with x_{i} in a regression setting. More specifically, the posterior probability for class 1 is given by
The resulting associated optimization problem is
where w is the weight vector of the hyperplane,x_{i} is the vector of covariates for subject i, b is the bias term of the hyperplane, and y_{i} is the class of subject i. The terms \( {z}_i^{} \) and \( {z}_i^{+} \) are boundaries depending on p_{i}. If n = m the problem is reduced to a hard margin SVM. To allow misclassification in classes, slack variables \( {\xi}_i,{\xi}_i^{}\ \mathrm{and}\ {\xi}_i^{+} \) are introduced and the optimization problem expressed in (10) can be rewritten as \( \underset{\boldsymbol{w},\boldsymbol{\xi}, {\boldsymbol{\xi}}^{},{\boldsymbol{\xi}}^{+},b}{\operatorname{minimize}}\frac{1}{2}{\left\Vert \boldsymbol{w}\right\Vert}^2+C{\sum}_{i=1}^n{\xi}_i+\overset{\sim }{C}{\sum}_{\mathrm{i}=\mathrm{n}+1}^{\mathrm{m}}\left({\xi}_i^{}+{\xi}_i^{+}\right)\mathrm{subjectto}{y}_i\left(\left\langle \boldsymbol{w},{\boldsymbol{x}}_i\right\rangle +b\right)\ge 1{\xi}_i,\kern2.5em i=1,\dots, n{z}_i^{}{\xi}_i^{}\le \left\langle \boldsymbol{w},{\boldsymbol{x}}_i\right\rangle +b\le {z}_i^{+}+{\xi}_i^{+},\kern1.6em i=n+1,\dots, m{\xi}_i\ge 0,\kern1.5em i=1,\dots, n{\xi}_i^{}\ge 0,\kern0.5em i=n+1,\dots, m{\xi}_i^{+}\ge 0,\kern1.5em i=n+1,\dots, m \)
The proportional followup time approach computes the probability p_{i} for censored data, and subsequently \( {z}_i^{} \) and \( {z}_i^{+} \), as \( \frac{{\mathrm{T}}_{\mathrm{i}}}{\uptau} \), being τ the maximum followup time established in the study cohort. For an event, this value is fixed to be 0.
Weighted SVM
Another approach that has not been tested in the literature is to address the survivalSVM as a weighted SVM (wSVM) problem (see eq. 11). The basic idea of wSVM is to assign to each observation a different weight according to its relative importance in the class such that different data points contribute differently to the learning of the decision surface [19]. This methodology is particularly useful to handle outliers because upon detecting an outlier, we can diminish its effect in the estimation of the separating hyperplane.
where W_{i} is the weight or probability of each observation. The censored observation can be seen as a partial or weighted observation because, an observation censored just at the beginning of the study, for instance, is adding no information to the data and have weight close to 0. A censored observation just before the end of the followup period should be treated almost as complete observation (a weight close to 1).
Proposed approaches
Proposed weighting methods
Censored data has been handled through assigning a weight or probability to an observation assuming proportionality of followup time, i.e., linearly associated with the observed followup period. The approach consists in computing the weight as \( {\mathrm{W}}_{\mathrm{i}}=\frac{{\mathrm{T}}_{\mathrm{i}}}{\uptau} \), being τ the maximum followup time in the study cohort and T_{i} the censored time for an observation i. For the events, this value is fixed and is equal to nonevents who completed the followup period,i.e., subjects who were free of events by the end of the study period. Therefore, this method does not account for the overall survival probability. Use of this information in timetoevent data is important because, for example, a censored observation at the beginning of the followup will be more likely to have an event if the overall survival probability is 0.1 than if it is 0.9. The current proposed methods in the SVM literature do not use this information because of the proportionality on followup time approach. We propose weighting the censored observations based on the probability of surviving t_{i} + z years (12) (conditional survival probability), given that the participant i is still alive at t_{i} (censored time) that can be estimated through the KaplanMeier estimator as in eq. (1).
This modification would improve the accuracy of the method by including in the weighting process information about the overall survival probability of the cohort and the survival curve shape. More specifically, our proposal is to weight (or to assign a probability to be an event or nonevent) the censoring information using the conditional survival probability in the following way for each specific SVM method:
For the LUPI method our proposal is to define the weight (importance) of the privileged information based on the KaplanMeier estimation of eq. (12), i.e., \( {\boldsymbol{x}}_i^{\ast }={\hat{S}}_z\left({t}_i\right)=\frac{\hat{S}\left({t}_i+z\right)}{\hat{S}\left({t}_i\right)} \)
For pSVM our proposal is to compute the uncertain probability of the censored data based on the conditional probability of having the event using the KaplanMeier estimator (12).
For WSVM our proposal, following the same idea than the previous methods, is using a conditional survival approach, based on the KaplanMeier estimator of (12). Events and nonevents at the end of followup time will have a weight of 1.
Local invariances SVM
Alternatively, censoring can be treated as a semisupervised problem, an approach that has not been considered in the SVM literature. In the semisupervised setting, there are observations with class and others with class unknown. The goal is to learn from both types of data to find the decision surface that separates both classes. We propose to treat censored observations as unknown classes, i.e., observations we don’t know their event status within the followup period, and events and nonevents at the end of followup as known classes, i.e., observations with known event status.
In the nonSVM specific literature [20], a framework has been proposed of semisupervised learning in the reproducing kernel Hilbert space H (RKHS) associated with a given kernel function k, using local invariances that explicitly characterize the behaviour of the target function around both known and unknown data. Three types of invariances have been proposed: i) invariance to small changes in the observations, restricting the gradient of the function to be small at the observed data; ii) invariance to averaging across a small neighbourhood around observations, restricting the function value at each observation to be similar to the average value around a small neighbourhood of the corresponding observation; and iii) invariances to local transformation, like rotational and translational invariance (specially focused in problems such as handwritten digit recognition and vision problems). The third invariance is not relevant for survival analysis. The optimization problem (13) includes the hingeloss for known data and the ϵinsensitive loss for unknown data to obtain a semisupervised SVM with local invariances (inSVM).
where g ∈ H is the target function and z_{i} is the representer of the functional associated with the invariance [20] . In particular, given the following expression of the Gaussian kernel
where σ is the parameter of the Gaussian kernel, the evaluation functional of the representer of the derivative functional \( {L}_{x_{i,j}}(f)={\left.\frac{\partial f}{\partial {x}^j}\right}_{{\boldsymbol{x}}_i} \), for any f in the RKHS H associated with the Gaussian kernel is:
and the dot product between two representers of the functional derivative is expressed as:
where i and p are the subject indices and j and q are the indices of the specific variable in the specific x vector.
Another type of local invariance is the local averaging. So, considering the Gaussian kernel in (14) and the following Gaussian density
and given that the convolution of two Gaussian densities is a Gaussian density, the representer of the local averaging functional, \( {L}_{x_i}(f)={\int}_Xf(u)p\left({x}_iu\right) duf\left({x}_i\right) \), for any f in the RKHS H associated with the Gaussian kernel, shall be expressed as:
and the dot product between two representers of the averaging functional
where σ_{k} and σ_{p} are the sigma values specified for the Gaussian kernel and Gaussian density respectively, d is the number of covariates and \( {z}_{x_j}\left({\boldsymbol{x}}_i\right) \) is as defined in eq. (18).
Calculations and proofs associated with the inSVM methodology can be found in Lee et al. [20] .
Implementation
The SVM methods presented in this paper: LUPI, pSVM, wSVM and inSVM, had not been implemented in the widely used R software [21]. Therefore, we have written R functions that will be included in a R package.
Simulation studies
We conducted simulation studies to compare the proposed approaches in different scenarios. Simulations included varying sample size (50 and 300 subjects), 30 predictor variables (or features), and a proportional and nonproportional hazard of comparison groups. Moreover we varied the proportion of censoring (10–30%) and the distribution of the followup time (uniform, positive skewed and negative skewed). Those choices were based on realistic scenarios encountered in data we previously analysed. Based on the proportional hazards framework, the timetoevent was generated using the Gompertz distribution.
Specifically, the 30 predictor variables were generated following a multivariate normal distribution with mean defined by a realization of an uniform distribution U (0.03,0.06). The variables were classified in four groups according to their pairwise correlation: no correlation (around 0), low correlation (around 0.2), medium correlation (around 0.5) and high correlation (around 0.8). These four levels of correlation reflected correlation of predictors in the biomedical field such as transcriptional profile or the inflammatory process. These variables were used to compare two scenarios of timetoevent data using the Cox proportional hazards model. In the proportional hazards framework the timetoevent variable can be generated, based on the Gompertz distribution [22] as
where U follows a Uniform (0,1) distribution, β is the vector of coefficients associated with each variable, and α ∈ (−∞, ∞) and γ > 0 are the scale and shape parameters, respectively, of the Gompertz distribution. The values for these parameters were selected so that overall survival was around 0.6 at 18 months followup time.
To generate scenarios in which the hazard of comparison groups was not proportional, a noise has been added into the exp(〈β, x_{i}〉) term in eq. (20), forcing the hazard to be a shared frailty model [23]. The frailty was chosen so that there were 5 groups of observations with same size that shared a common frailty.
Tuning parameters and test performance
For the cost parameters C and \( \overset{\sim }{C} \), we selected the values 0.1, 1, 10, and 100, and for the Gaussian kernel parameters σ the values 0.25, 0.5, 1, 2, and 4. A twostep approach was used to estimate tuning parameters and evaluate operational characteristics of the SVM models using the best combination of tuning parameters: in the first step, for each combination of parameters, 10 training datasets were fitted and each of them was validated using 10 different validation datasets. The combination of parameters with largest accuracy was used to measure the performance of the models in the second step. In the second step, new 10 datasets were simulated for estimation of models given the best combination of tuning parameters found in the first step and for each of those, 10 testing datasets were simulated to compare the performance of the SVM models based on the following metrics: accuracy (proportion of correctly classified observations), Matthews’ correlation, normalized mutual information, area under the ROC curve (AUCROC), sensitivity, specificity and F1 score. Therefore, 100 datasets have been tested and used to compute the mean and the standard deviation of the metrics used as a summary performance of each method.
Reallife datasets
We applied our approaches to three datasets from the “Survival” package available in the R software repository [17]. Parameters were tuned and the accuracy, AUCROC, sensitivity, specificity and F1 score was estimated through a 5fold nestedcross validation repeating the process in 10 resampled datasets. The followup time was censored to the third quartile of the maximum observed followup time in each dataset. We used the same analytical methods and the same grid of tuning parameter values of the simulation studies described above. Briefly, datasets of the following studies were analyzed:
Lung Study: this study was conducted by the North Central Cancer Treatment Group (NCCTG) and aimed to estimate the survival of patients with advanced lung cancer. The available dataset was comprised of 167 observations, 89 events during the followup time of 420 days, and 10 variables. A total of 36 observations were censored before the end of followup. The overall survival probability at the end of followup period was 0.40.
Stanford2 Study: this dataset was extracted from the Stanford Heart Transplant study and was comprised of 157 observations, 4 variables and the maximum followup time is 1264 days. A total of 88 events were recorded and 29 observations were censored before the end of followup. The overall survival probability at the end of followup period was 0.41.
PBC Study: this study was nested in the Mayo Clinic trial of primary biliary liver cirrhosis (PBC) that was conducted between 1974 and 1984. A total of 424 PBC patients were referred to Mayo Clinic during the tenyear interval and met eligibility criteria for the randomized placebo controlled trial of the drug Dpenicillamine. The data subset used in the current paper contains 258 observations and 22 variables. From the whole cohort 93 observations experienced the event, 65 finalized the followup period without presenting an event and 100 were censored before the end of the followup time of 2771 days. The overall survival probability at the end of followup period was 0.57.
Results
Simulated datasets
In the four simulated scenarios with a sample size of 300 in which hazards of comparison groups were proportional, the Cox proportional hazards model and pSVM (linear kernel) performed comparably to inSVM (gradient and averaging). Specifically, the accuracy was 0.89 for the Cox proportional hazards model, 0.87 for the linear pSVM and 0.84 for inSVM (Table 1). The AUCROC of the three models ranged from 0.92 to 0.96. Generally, the distribution and proportion of censoring did not affect results, with the inSVMgradient being the most sensitive to the proportion of observations that were censored. LUPI methods (proposed KaplanMeier and proportional approach) performed similarly to pSVM using a radial kernel. The accuracy for a 10 and 30% censoring was 0.77.
Conversely, when the sample size was decreased to 50, the proportion of censored observations affected all metrics of predictive accuracy even for data simulated meeting the proportional hazards assumption (Table 2). pSVM, inSVM and kernel Cox regression had the best performance in the 10% censoring scenario with an accuracy of approximately 0.75. The Cox model, wSVM and pSVMradial had the worse performance with an accuracy of 0.62–0.67. Predictive accuracy was slightly decreased with increases in the proportion of censoring to 30% except for wSVM.
Performance of all approaches was worse under nonproportional hazards (Tables 3 and 4). The largest difference between proportionality compared to nonproportionality was in the 300 observations scenario (Table 3) compared to the 50 observations scenario (Table 4).
In all scenarios, approaches based on conditional survival performed better than those based on proportional followup time, particularly when the sample size was 50 observations and especially when hazards were nonproportional. Overall differences between both methods were small (around 0.02 units in accuracy and around 0.02 units in AUCROC) but consistent.
The inSVM, based on both gradient and averaging approach, performed closest to the best method within each scenario. Although the averaging approach was slightly better and more insensitive to the proportion of censored observations, there were no clear differences between the averaging and gradient approach.
Other scenarios yielded comparable results and are presented in supplementary Tables S1, S2, S3, S4, S5, S6, S7 and S8.
Reallife datasets
In the three compared datasets the conditional survival approach attained the largest predictive accuracy based on accuracy values and AUCROC (Table 5) when compared with the proportionality approach within each method. Within each dataset the performance of the LUPI method was one of the best, with almost no difference between the conditional survival and the proportionality approach.
The inSVM method averaging approach performed better than gradient in both accuracy and AUCROC metrics in all three datasets, being the former one of the best methods within each ones of the datasets.
Discussion and conclusions
In this article we proposed alternative methods and extensions within the SVM for binary classification framework for dealing with censored data. Specifically, a conditional survival approach for weighting censored observations when fitting SVM through LUPI, Uncertainty SVM, Weighted SVM, and a semisupervised SVM with local invariances. The former takes into account the events and followup period including more information in the weighting process than using a proportionality of time approach. The latter is a semisupervised SVM with local invariances method that allows using two types of invariances: gradient over variables and averaging over observations. We showed that both approaches outperformed the other studied methods on most compared metrics.
As expected, when the sample size was as limited as 50 observations and the proportional hazards assumption was violated, the Cox proportional hazards model had a poorer performance. Results with the wSVM, were highly dependent on the proportion of censoring but not so much on the distribution of time to censor. Moreover, wSVM results were comparable to the LUPI results, and that has also been observed by Lapin et al. [24]. This similarity may be explained by the common unique information (censored data) used by both methods. This similarity suggests that the wSVM method may be more advantageous in practice because is much less time consuming, although is less robust than the LUPI method.
When applying the LUPI approach, we have included the censoring data as privileged information in the correcting space. Our results were consistent with Shiao and Cherkassky [15], i.e., LUPI performs worse than the Cox proportional hazards model and pSVM in all compared scenarios. Actually, some of our simulated scenarios were similar to simulated scenarios used by Shiao and Cherkassky. The correcting space is used as complementary information to be combined with the decision space. Therefore, is not directly used to define the class of the observations, as it is in pSVM or wSVM. We agree with SerraToro et al. [25], that further work is needed to fully understand the LUPI approach and how the correcting and decision spaces interact.
The performance of the pSVM and the Cox proportional hazards model was similar when the sample size was larger and better than the kernel Cox regression, being the linear kernel slightly superior to the radial kernel, as observed by Shiao and Cherkassky [15]. Perhaps a finer grid search could benefit the overall performance of the nonlinear approach. Ours and Shiao and Cherkassky [15] results were consistent with regards to the superior performance of the linear pSVM performs when compared to the pSVM using Gaussian kernel.
The conditional survival approach proposed by us performs better than the proportional followup time approach in all compared scenarios. The conditional method takes into account the events and followup period, hereby, it includes more information and is more accurate in the weighting estimation than the proportionality of time approach. The latter is assuming linearity and does not take into account specificities of the data, for instance, variability in survival due to intrinsic data. However, one aspect to be remarked is that the conditional approach is assuming that the survival probability of the test data is similar to the training data. This is a reasonable assumption but depending on the difference between survival probabilities, the prediction accuracy may be affected.
With respect to the proposed inSVM approach (both gradient and averaging), in the 300 observations scenarios, results are pretty similar to Cox, kernel Cox regression and pSVM. However, in scenarios in which the number of observations was small and close to the number of variables, the inSVM outperformed all other approaches in all compared metrics, and it was one of the most robust approaches to varying number of variables and violations of proportionality of hazards. Although, inSVM is a semisupervised approach that does not account for censoring, its performance is comparable to other methods that account for censoring. That could be explained because we are assuming that censoring is independent from the events and representative of the data. Therefore, patterns in the observed data that are applicable to the censored observations and the local invariances assumptions should be valid. Additionally, an advantage of this approach is that no extra assumptions about the censoring distribution are necessary. The main drawback of the local invariances approach is that it is computationally intensive, specially the gradient approach.
All simulated data was based on balanced data, i.e., the proportion of events and nonevents were similar. SVM models are sensitive to data imbalance between classes. Therefore, future investigation shall consider imbalanced scenarios.
Given the significant number of compared methods and data, the presented work has been restricted to the two most commonly used linear and Gaussian kernels. Further work shall evaluate the performance of the proposed methods using other kernels. Additionally, we addressed overfitting through standard procedures: by simulating completely different datasets to test parameters and validate models, and by applying nested crossvalidation to estimate and validate parameters when analysing real data. However, future work may assess the performance of the proposed methods including even more simulation scenarios and a larger range of parameter values.
From the compared methods the proposed inSVM method using the conditional survival approach is the most robust under different scenarios and is a good approach to consider as an alternative to other timetoevent methods. When analysing sparse data is a method to be considered and recommended since outperforms other methods even when the proportional hazards assumption is not met, a situation that often occurs in biomedical data and biomarkers analysis.
Availability of data and materials
Simulated datasets during the current study are available from the corresponding author on reasonable request.
PBC, Lung and Stanford datasets are freely available at https://CRAN.Rproject.org/package=survival.
Abbreviations
 SVM:

Support vector machines
 SVR:

Support vector regression
 LUPI:

Learning using privileged information
 inSVM:

Invariances support vector machine
 pSVM:

Probabilistic support vector machine
 wSVM:

Weighted support vector machines
References
 1.
Cortes C, Vapnik V. Supportvector networks. Mach Learn. 1995;3(20):273–97.
 2.
Cox DR. Regression models and lifetables. J R Stat Soc Ser B Methodol. 1972;34(2):187–220.
 3.
Concato J, Peduzzi P, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. J Clin Epidemiol. 1995;48(12):1495–501.
 4.
Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol. 1995;48(12):1503–10.
 5.
Vittinghoff E, McCulloch CE. Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol. 2007;165(6):710–8.
 6.
Li H, Luan Y. Kernel Cox regression models for linking gene expression profiles to censored survival data. Pac Symp Biocomput. 2003;8(2):65–76.
 7.
Shivaswamy PK, Chu W, Jansche M. A support vector approach to censored targets. In: Seventh IEEE International Conference on Data Mining (ICDM 2007); 2007. p. 655–60.
 8.
Khan FM, Zubek V. Support Vector Regression for censored data (SVRc): A novel tool for survival analysis. In: Proceedings of the 2007 Seventh IEEE international conference on data mining. Vol IEEE International Conference; 2008. p. 863–8.
 9.
Van Belle V, Pelckmans K, Suykens JAK, Van Huffel S. Additive survival leastsquares support vector machines. Stat Med. 2010;29(2):296–308.
 10.
Van Belle V, Pelckmans K, Suykens JAK, Van Huffel S. Support vector machines for survival analysis. In: Proceedings of the Third International Conference on Computational Intelligence in Medicine and Healthcare (CIMED2007); 2007. p. 1–8.
 11.
Van Belle V, Pelckmans K, Suykens JAK, Van Huffel S. Survival SVM: a practical scalable algorithm. In: ESANN.; 2008:89–94.
 12.
Evers L, Messow CM. Sparse kernel methods for highdimensional survival data. Bioinformatics (Oxford, England). 2008;24(14):1632–8. https://doi.org/10.1093/bioinformatics/btn253.
 13.
Vapnik VN, Vashist V. 2009 Special issue: a new learning paradigm: learning using privileged information. Neural Network. 2009;22(56):544557.
 14.
Niaf E, Flamary R, Lartizien C, Canu S. Handling uncertainties in SVM classification. Statistical Signal Processing Workshop (SSP). 2011:757–760.
 15.
Shiao HT, Cherkassky V. SVMbased approaches for predictive modeling of survival data. In: Proceedings of the International Conference on Data Mining (DMIN). 2013:1.
 16.
Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53(282):457–81.
 17.
Therneau T, Grambsch P. Modeling survival data: extending the Cox model. New York: Springer; 2000.
 18.
Scholkopf B, Smola AJ. Learning with kernels: support vector machines, Regularization, Optimization, and beyond. Cambridge: MIT press; 2001.
 19.
Yang X, Song Q, Wang Y. A weighted support vector machine for data classification. Int J Pattern Recognit Artif Intell. 2007;21(5):961–76.
 20.
Lee W, Zhang X, Teh Y. Semisupervised learning in reproducing kernel Hilbert spaces using local invariances. NUS Technical Report TRB3/06. 2006.
 21.
R Core Team. R: A Language and Environment for Statistical Computing. 2014. http://www.rproject.org/.
 22.
Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Stat Med. 2005;24(11):1713–23. https://doi.org/10.1002/sim.2059.
 23.
Duchateau L, Janssen P. Statistics for biology and health. The frailty model. New York: Springer Science Business Media, LLC; 2007.
 24.
Lapin M, Hein M, Schiele B. Learning using privileged information: SVM+ and weighted SVM. Neural Netw. 2014;53:95–108.
 25.
SerraToro C, Traver VJ, Pla F. Exploring some practical issues of SVM+: is really privileged information that helps? Pattern Recogn Lett. 2014;42:40–6.
Acknowledgements
R code of the local invariances approach was part of Hector Sanz’s thesis.
Funding
Not applicable.
Author information
Affiliations
Contributions
HS designed the study and carried out all programming work. FR supervised and provided input on all aspects of the study. CV provided helpful information from the design of the study perspective. FR and HR contributed algorithms for kernel methods. HS, FR and CV discussed the results and wrote the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1 Table S1. Proportional hazards, positive skew, 10 and 30% censoring and 300 observations scenarios results.
Mean (standard deviation) of accuracy, Matthews’ correlation, normalized mutual information (NMI), area under the ROC curve (AUC), sensitivity (Sn), specificity (Sp) and F1score (F1) is shown.
Additional file 2 Table S2. Proportional hazards, negative skew, 10 and 30% censoring and 300 observations scenarios results.
Mean (standard deviation) of accuracy, Matthews’ correlation, normalized mutual information (NMI), area under the ROC curve (AUC), sensitivity (Sn), specificity (Sp) and F1score (F1) is shown.
Additional file 3 Table S3. Nonproportional hazards, negative skew, 10 and 30% censoring and 300 observations scenarios results.
Mean (standard deviation) of accuracy, Matthews’ correlation, normalized mutual information (NMI), area under the ROC curve (AUC), sensitivity (Sn), specificity (Sp) and F1score (F1) is shown.
Additional file 4 Table S4. Nonproportional hazards, positive skew, 10 and 30% censoring and 300 observations scenarios results.
Mean (standard deviation) of accuracy, Matthews’ correlation, normalized mutual information (NMI), area under the ROC curve (AUC), sensitivity (Sn), specificity (Sp) and F1score (F1) is shown.
Additional file 5 Table S5. Proportional hazards, negative skew, 10 and 30% censoring and 50 observations scenarios results.
Mean (standard deviation) of accuracy, Matthews’ correlation, normalized mutual information (NMI), area under the ROC curve (AUC), sensitivity (Sn), specificity (Sp) and F1score (F1) is shown.
Additional file 6 Table S6. Proportional hazards, positive skew, 10 and 30% censoring and 50 observations scenarios results.
Mean (standard deviation) of accuracy, Matthews’ correlation, normalized mutual information (NMI), area under the ROC curve (AUC), sensitivity (Sn), specificity (Sp) and F1score (F1) is shown.
Additional file 7 Table S7. Nonproportional hazards, negative skew, 10 and 30% censoring and 50 observations scenarios results.
Mean (standard deviation) of accuracy, Matthews’ correlation, normalized mutual information (NMI), area under the ROC curve (AUC), sensitivity (Sn), specificity (Sp) and F1score (F1) is shown.
Additional file 8 Table S8. Nonproportional hazards, positive skew, 10 and 30% censoring and 50 observations scenarios results.
Mean (standard deviation) of accuracy, Matthews’ correlation, normalized mutual information (NMI), area under the ROC curve (AUC), sensitivity (Sn), specificity (Sp) and F1score (F1) is shown.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sanz, H., Reverter, F. & Valim, C. Enhancing SVM for survival data using local invariances and weighting. BMC Bioinformatics 21, 193 (2020). https://doi.org/10.1186/s1285902034812
Received:
Accepted:
Published:
Keywords
 Support vector machines
 Survival analysis
 Kernel
 Classification