Enhancing SVM for survival data using local invariances and weighting


 
 The necessity to analyze medium-throughput data in epidemiological studies with small sample size, particularly when studying biomedical data may hinder the use of classical statistical methods. Support vector machines (SVM) models can be successfully applied in this setting because they are a powerful tool to analyze data with large number of predictors and limited sample size, especially when handling binary outcomes. However, biomedical research often involves analysis of time-to-event outcomes and has to account for censoring. Methods to handle censored data in the SVM framework can be divided into two classes: those based on support vector regression (SVR) and those based on binary classification. Methods based on SVR seem to be suboptimal to handle sparse data and yield results comparable to Cox proportional hazards model and kernel Cox regression. The limited work dedicated to assess methods based on of SVM for binary classification has been based on SVM learning using privileged information and SVM with uncertain classes.
 
 
 This paper proposes alternative methods and extensions within the binary classification framework, specifically, a conditional survival approach for weighting censored observations and a semi-supervised SVM with local invariances. Using simulation studies and some real datasets, we evaluate those two methods and compare them with a weighted SVM model, SVM extensions found in the literature, kernel Cox regression and Cox model.
 
 
 Our proposed methods perform generally better under a wide variety of realistic scenarios about the structure of biomedical data. Specifically, the local invariances method using the conditional survival approach is the most robust method under different scenarios and is a good approach to consider as an alternative to other time-to-event methods. When analysing real data is a method to be considered and recommended since outperforms other methods in proportional and non-proportional scenarios and sparse data, which is something usual in biomedical data and biomarkers analysis.



Background
Biomedical studies are oftentimes based on small sample sizes and on a medium to large number of variables. Support vector machine (SVM) models are a powerful tool to analyse this type of data because of their performance in analysis of sparse data, i.e., data with as many or more predictors than observations. SVMs have been widely applied for analysis of binary outcomes. As originally developed [1] these models are based on discriminating two classes of observations by a linear decision surface (hyperplane) and maximizing the distance between the hyperplane and the individual observations. If the classes are not separable by a linear surface, a non-linear transformation can be obtained through mapping the data on a different dimension space (feature space). This non-linear transformation can be obtained without explicitly mapping into the feature space through the use of a kernel function.
A common outcome in biomedical research is time-to-event. The challenge of analyzing time-to-event data is associated with occurrence of censoring as it is called the partially observed time-to-event of a participant whose follow-up ends before the event has occurred. There are different types of censoring but the most common is right censoring that occurs when an observation leaves the study before the end of follow-up or presenting an event, or when the study ends before the event has occurred. The most common traditional approach to analyze time-to-event data and handle censoring is the Cox proportional hazard regression [2]. This is a semi-parametric model based on a partial likelihood function (similar to the ordinary likelihood functions) that is defined in terms of the hazard function and assumes that: i) the baseline hazard is common to all observations; ii) linearity and additivity of the predictors with respect to log-hazard or log-cumulative hazard, and iii) proportionality of the hazards across predictor classes or constant hazard ratios over time. Another important requirement to obtain unbiased estimations with proportional hazards models is that the minimum number of events is at least 5 [3][4][5].
When the data is sparse, proportional hazards regression may not converge and yield unreliable and biased point estimates and statistical tests. Under sparsity, SVM or a kernelized (i.e., penalized) version of the Cox model [6] may be more appropriate. Generally, extensions of SVM to handle time-to-event and censored data can be based on a regression (SVR) or a classification approach (SVM). Most work has focused on SVR [7][8][9] and on a ranking (ordinal) methodology [10][11][12] and suggested that both approaches were comparable to proportional hazards model, in non-sparse scenarios, and the kernel Cox regression and, thus, may not provide any gains in accuracy of predictions. Only two methods have extended the SVM to survival data and handled censoring based on a binary classification approach: SVM learning using privileged information [13] (LUPI) and uncertain classes [14], proposed in Shiao and Cherkassky [15] work. In both methods, the censored data is basically weighted using the follow-up time without considering the overall probability of the event at the end of the followup period.
In this paper, we propose an alternative extension to allow SVM to model time-toevent data based on a binary classification SVM. To do that, we assign a probability to the censored data using a conditional survival approach considering the survival probability at each censored time. Moreover, we propose using a semi-supervised version of SVM with local invariances to model time-to-event data and compare the performance of the proposed approaches with the Cox proportional hazards regression, kernel Cox regression, and other SVM methods for survival analysis, such as LUPI and weighted SVM.

Related work
Traditional models for survival data -Kaplan Meier estimator and Cox proportional hazards model In survival analysis, the non-negative time-to-event (be death or any other event) of a subject can be defined by the continuous random variable T * . An important function related to the time-to-event data is the survival function S(T * ) = P(T * ≥ t * ), that is the probability of an individual to survive beyond time t * . Due to censoring, T * is not observable but instead the pair (T, δ), where T is the time to censoring or to the event of interest and δ is the censoring indicator (0 for censored data and 1 for event).
The empirical survival function is an estimate of the survival function and is commonly obtained by the non-parametric Kaplan-Meier estimator [16]. This can be obtained applying the product: , where n is the total number of individuals, T (i) are the order statistics of the observed times for i-th observation and δ (i) is the censoring indicator of i-th observation. The estimator in (1) is a decreasing step function that changes only at event times.
A second important function in analyses of time-to-event data is the hazard function, being the Cox proportional hazards model [2] the most popular model used in analysis of survival data. It is defined in terms of the hazard function: where λ(t| x i ) is the hazard at time t of an observation i with covariates vector x i , λ 0 (t) is the baseline hazard function, β is the vector of coefficients of the model and 〈x i , β〉 is the dot product between x i and β, i.e., the linear predictor function. The model assumes a baseline hazard that is common to all observations in the study population. In this model, the hazard of a subject increases multiplicatively with covariates.
In the Cox proportional hazards model, the baseline hazard is modelled semiparametrically, i.e., the baseline hazard does not need to be specified and the optimization function is based on a partial likelihood. The Cox model is more robust to outliers than other models because it uses only the rank ordering of the failure and censoring times. The partial likelihood accounting for censored observations can be expressed as: where R i is the set of individuals at risk of having an event at time t i , δ i the censoring indicator of the observation with time t i and x i vector of covariates of observation i.
Applying the logarithm transformation to the partial likelihood we obtain the log partial likelihood, which is maximized through Newton-Raphson algorithm. The maximum partial likelihood estimator is asymptotically unbiased, efficient and normally distributed [17].

Kernel Cox regression
This is a penalized version of the Cox model, in which a kernel is added to model the hazard as a function of covariates. For the general Cox model for observation i, at time t, with a vector of covariates x i , the hazard can be expressed as where λ 0 (t) is the unspecified baseline hazard function and f(x i ) is an arbitrary function. Li and Luan [6] proposed using the log partial likelihood as a loss function and reformulate the problem as finding the function f in the penalized log-likelihood such that where f is assumed to be from a Reproducing Kernel Hilbert Space, H, defined by a kernel function and a ξ > 0 regularization parameter [18]. The solution to this problem is given by the representer theorem [18] where the optimal f(x) has the form The optimal α = (α 1 , …, α n ) in (6) can be found by plugging (6) into (5), resulting in a convex optimization problem to which the solution can be found by any unconstrained optimization method. The term b is the intercept or bias usually computed as the average error between the target and predicted value.

Survival analysis using the SVM based on binary classification
Two approaches have been proposed in this class of models by Shiao and Cherkassky [15]: the LUPI approach developed by Vapnik and Vashist [13] and the SVM with uncertain classes developed by Niaf et al. [14]. LUPI uses the censoring information as privileged information (only available for the training data) and, thus, includes additional information in the training process to enrich the learning process. Two different spaces are described, the decision space and the correcting space (the one with the censoring information). SVM with uncertain classes allows to define less than perfectly the belonging class of observations, i.e., it allows some degree of confidence regarding the class.
Shiao and Cherkassky suggested measuring privileged information for LUPI and the SVM uncertainty using the proportion of follow-up time with which an i-th censored subject contributes. Therefore, for the censored observation i, the weight or probability assigned to the observation is W i ¼ T i τ , being τ the maximum follow-up time in the study cohort. For the event this value is fixed to be 0.

LUPI SVM
The LUPI approach is based on a triplet ðx i ; x Ã i ; y i Þ for i = 1, …, n observations, where x i ∈ℝ d ; x Ã i ∈ℝ k and y i ∈ {±1}. The (x i , y i ) are the usual training data and x Ã i defines the privileged information only present in the training data, i.e., the information (variables) only present when modelling the data. The privileged information is not available when predicting the class of a new observation. In the LUPI approach two different spaces are described: i) the space related to x, known as decision space, which is the same feature space used in standard SVM and ii) the space related to x * , known as correcting space, which contains the privileged information about the training data and not available for predictions of future observations. The LUPI estimates the decision function and corrects it using the correcting function via privileged information. The main optimization problem is expressed as in equation (7).
where w is the weight vector of the separating hyperplane, x i is the vector of covariates for subject i, ξ i are the slack variables and b is the bias term of the hyperplane of the decision space. The analogous parameters, w * , x Ã i and b * are in the correcting space.
The decision function and the correcting functions depend on the decision and correcting space respectively. Although, the decision function has the same expression of the usual SVM, the coefficients of the LUPI decision function depend on kernels in both spaces. The SVM and the LUPI solutions are exactly the same when the privileged information is rejected (when γ tends to 0 in expression (7).
The time to follow-up and time-to-event are observable and known in the training set but not in the test set. Thus, the censoring information that is only present in the training set can be used as privileged information. Shiao and Cherkassky proposed using the pair (T i , W i ) as the privileged information.

Uncertainty SVM
This method allows defining less than perfectly some observations, assigning them an uncertainty in their class. For these uncertainties a confidence level or probability regarding the class is provided. We will refer to the Uncertainty SVM onward in this manuscript as pSVM (probabilistic SVM).
The pSVM assigns observations to a class through a hinge loss and estimates probability of belonging to the class through the ϵ-insensitive cost function. Given an observation i, we define the pair (x i , l i ) as the training set of input vectors along with their corresponding group of classes. These classes can be defined as fori ¼ 1; …; n: where n is the number of observations with known classes (perfectly definite), (m − n − 1) is the number of observations with uncertain classes, and p i is the uncertainty associated with x i in a regression setting. More specifically, the posterior probability for class 1 is given by The resulting associated optimization problem is where w is the weight vector of the hyperplane, x i is the vector of covariates for subject i, b is the bias term of the hyperplane, and y i is the class of subject i. The terms z − i and z þ i are boundaries depending on p i . If n = m the problem is reduced to a hard margin SVM. To allow misclassification in classes, slack variables ξ i ; ξ − i and ξ þ i are introduced and the optimization problem expressed in (10) can be rewritten as The proportional follow-up time approach computes the probability p i for censored data, and subsequently z − i and z þ i , as T i τ , being τ the maximum follow-up time established in the study cohort. For an event, this value is fixed to be 0.

Weighted SVM
Another approach that has not been tested in the literature is to address the survival-SVM as a weighted SVM (wSVM) problem (see eq. 11). The basic idea of wSVM is to assign to each observation a different weight according to its relative importance in the class such that different data points contribute differently to the learning of the decision surface [19]. This methodology is particularly useful to handle outliers because upon detecting an outlier, we can diminish its effect in the estimation of the separating hyperplane.
where W i is the weight or probability of each observation. The censored observation can be seen as a partial or weighted observation because, an observation censored just at the beginning of the study, for instance, is adding no information to the data and have weight close to 0. A censored observation just before the end of the follow-up period should be treated almost as complete observation (a weight close to 1).

Proposed approaches
Proposed weighting methods Censored data has been handled through assigning a weight or probability to an observation assuming proportionality of follow-up time, i.e., linearly associated with the observed follow-up period. The approach consists in computing the weight as W i ¼ T i τ , being τ the maximum follow-up time in the study cohort and T i the censored time for an observation i. For the events, this value is fixed and is equal to non-events who completed the follow-up period,i.e., subjects who were free of events by the end of the study period. Therefore, this method does not account for the overall survival probability. Use of this information in time-to-event data is important because, for example, a censored observation at the beginning of the follow-up will be more likely to have an event if the overall survival probability is 0.1 than if it is 0.9. The current proposed methods in the SVM literature do not use this information because of the proportionality on follow-up time approach. We propose weighting the censored observations based on the probability of surviving t i + z years (12) (conditional survival probability), given that the participant i is still alive at t i (censored time) that can be estimated through the Kaplan-Meier estimator as in eq. (1).
This modification would improve the accuracy of the method by including in the weighting process information about the overall survival probability of the cohort and the survival curve shape. More specifically, our proposal is to weight (or to assign a probability to be an event or non-event) the censoring information using the conditional survival probability in the following way for each specific SVM method: For the LUPI method our proposal is to define the weight (importance) of the privileged information based on the Kaplan-Meier estimation of eq. (12), i.e., x Ã i ¼Ŝ z ðt i Þ ¼Ŝ ðt i þ zÞ Sðt i Þ For pSVM our proposal is to compute the uncertain probability of the censored data based on the conditional probability of having the event using the Kaplan-Meier estimator (12). For WSVM our proposal, following the same idea than the previous methods, is using a conditional survival approach, based on the Kaplan-Meier estimator of (12). Events and non-events at the end of follow-up time will have a weight of 1.

Local invariances SVM
Alternatively, censoring can be treated as a semi-supervised problem, an approach that has not been considered in the SVM literature. In the semi-supervised setting, there are observations with class and others with class unknown. The goal is to learn from both types of data to find the decision surface that separates both classes. We propose to treat censored observations as unknown classes, i.e., observations we don't know their event status within the follow-up period, and events and non-events at the end of follow-up as known classes, i.e., observations with known event status.
In the non-SVM specific literature [20], a framework has been proposed of semisupervised learning in the reproducing kernel Hilbert space H (RKHS) associated with a given kernel function k, using local invariances that explicitly characterize the behaviour of the target function around both known and unknown data. Three types of invariances have been proposed: i) invariance to small changes in the observations, restricting the gradient of the function to be small at the observed data; ii) invariance to averaging across a small neighbourhood around observations, restricting the function value at each observation to be similar to the average value around a small neighbourhood of the corresponding observation; and iii) invariances to local transformation, like rotational and translational invariance (specially focused in problems such as handwritten digit recognition and vision problems). The third invariance is not relevant for survival analysis. The optimization problem (13) includes the hinge-loss for known data and the ϵ-insensitive loss for unknown data to obtain a semi-supervised SVM with local invariances (inSVM).
where g ∈ H is the target function and z i is the representer of the functional associated with the invariance [20] . In particular, given the following expression of the Gaussian kernel where σ is the parameter of the Gaussian kernel, the evaluation functional of the representer of the derivative functional L x i; j ð f Þ ¼ ∂ f ∂x j j x i , for any f in the RKHS H associated with the Gaussian kernel is: and the dot product between two representers of the functional derivative is expressed as: where i and p are the subject indices and j and q are the indices of the specific variable in the specific x vector. Another type of local invariance is the local averaging. So, considering the Gaussian kernel in (14) and the following Gaussian density and given that the convolution of two Gaussian densities is a Gaussian density, the representer of the local averaging functional, for any f in the RKHS H associated with the Gaussian kernel, shall be expressed as: and the dot product between two representers of the averaging functional where σ k and σ p are the sigma values specified for the Gaussian kernel and Gaussian density respectively, d is the number of covariates and z x j ðx i Þ is as defined in eq. (18).
Calculations and proofs associated with the inSVM methodology can be found in Lee et al. [20] .

Implementation
The SVM methods presented in this paper: LUPI, pSVM, wSVM and inSVM, had not been implemented in the widely used R software [21]. Therefore, we have written R functions that will be included in a R package.

Simulation studies
We conducted simulation studies to compare the proposed approaches in different scenarios. Simulations included varying sample size (50 and 300 subjects), 30 predictor variables (or features), and a proportional and non-proportional hazard of comparison groups. Moreover we varied the proportion of censoring (10-30%) and the distribution of the follow-up time (uniform, positive skewed and negative skewed). Those choices were based on realistic scenarios encountered in data we previously analysed. Based on the proportional hazards framework, the time-to-event was generated using the Gompertz distribution.
Specifically, the 30 predictor variables were generated following a multivariate normal distribution with mean defined by a realization of an uniform distribution U (0.03,0.06). The variables were classified in four groups according to their pairwise correlation: no correlation (around 0), low correlation (around 0.2), medium correlation (around 0.5) and high correlation (around 0.8). These four levels of correlation reflected correlation of predictors in the biomedical field such as transcriptional profile or the inflammatory process. These variables were used to compare two scenarios of time-to-event data using the Cox proportional hazards model. In the proportional hazards framework the time-to-event variable can be generated, based on the Gompertz distribution [22] as where U follows a Uniform (0,1) distribution, β is the vector of coefficients associated with each variable, and α ∈ (−∞, ∞) and γ > 0 are the scale and shape parameters, respectively, of the Gompertz distribution. The values for these parameters were selected so that overall survival was around 0.6 at 18 months follow-up time.
To generate scenarios in which the hazard of comparison groups was not proportional, a noise has been added into the exp(〈β, x i 〉) term in eq. (20), forcing the hazard to be a shared frailty model [23]. The frailty was chosen so that there were 5 groups of observations with same size that shared a common frailty.

Tuning parameters and test performance
For the cost parameters C andC, we selected the values 0.1, 1, 10, and 100, and for the Gaussian kernel parameters σ the values 0.25, 0.5, 1, 2, and 4. A two-step approach was used to estimate tuning parameters and evaluate operational characteristics of the SVM models using the best combination of tuning parameters: in the first step, for each combination of parameters, 10 training datasets were fitted and each of them was validated using 10 different validation datasets. The combination of parameters with largest accuracy was used to measure the performance of the models in the second step. In the second step, new 10 datasets were simulated for estimation of models given the best combination of tuning parameters found in the first step and for each of those, 10 testing datasets were simulated to compare the performance of the SVM models based on the following metrics: accuracy (proportion of correctly classified observations), Matthews' correlation, normalized mutual information, area under the ROC curve (AUC-ROC), sensitivity, specificity and F1 score. Therefore, 100 datasets have been tested and used to compute the mean and the standard deviation of the metrics used as a summary performance of each method.

Real-life datasets
We applied our approaches to three datasets from the "Survival" package available in the R software repository [17]. Parameters were tuned and the accuracy, AUC-ROC, sensitivity, specificity and F1 score was estimated through a 5-fold nested-cross validation repeating the process in 10 resampled datasets. The follow-up time was censored to the third quartile of the maximum observed follow-up time in each dataset. We used the same analytical methods and the same grid of tuning parameter values of the simulation studies described above. Briefly, datasets of the following studies were analyzed: Lung Study: this study was conducted by the North Central Cancer Treatment Group (NCCTG) and aimed to estimate the survival of patients with advanced lung cancer. The available dataset was comprised of 167 observations, 89 events during the follow-up time of 420 days, and 10 variables. A total of 36 observations were censored before the end of follow-up. The overall survival probability at the end of follow-up period was 0.40. Stanford2 Study: this dataset was extracted from the Stanford Heart Transplant study and was comprised of 157 observations, 4 variables and the maximum followup time is 1264 days. A total of 88 events were recorded and 29 observations were censored before the end of follow-up. The overall survival probability at the end of follow-up period was 0.41. PBC Study: this study was nested in the Mayo Clinic trial of primary biliary liver cirrhosis (PBC) that was conducted between 1974 and 1984. A total of 424 PBC patients were referred to Mayo Clinic during the ten-year interval and met eligibility criteria for the randomized placebo controlled trial of the drug Dpenicillamine. The data subset used in the current paper contains 258 observations and 22 variables. From the whole cohort 93 observations experienced the event, 65 finalized the follow-up period without presenting an event and 100 were censored before the end of the follow-up time of 2771 days. The overall survival probability at the end of follow-up period was 0.57.

Simulated datasets
In the four simulated scenarios with a sample size of 300 in which hazards of comparison groups were proportional, the Cox proportional hazards model and pSVM (linear kernel) performed comparably to inSVM (gradient and averaging). Specifically, the accuracy was 0.89 for the Cox proportional hazards model, 0.87 for the linear pSVM and 0.84 for inSVM ( Table 1). The AUC-ROC of the three models ranged from 0.92 to 0.96. Generally, the distribution and proportion of censoring did not affect results, with the inSVM-gradient being the most sensitive to the proportion of observations that were censored. LUPI methods (proposed Kaplan-Meier and proportional approach) performed similarly to pSVM using a radial kernel. The accuracy for a 10 and 30% censoring was 0.77.
Conversely, when the sample size was decreased to 50, the proportion of censored observations affected all metrics of predictive accuracy even for data simulated meeting the proportional hazards assumption (Table 2). pSVM, inSVM and kernel Cox regression had the best performance in the 10% censoring scenario with an accuracy of approximately 0.75. The Cox model, wSVM and pSVM-radial had the worse performance with an accuracy of 0.62-0.67. Predictive accuracy was slightly decreased with increases in the proportion of censoring to 30% except for wSVM.
Performance of all approaches was worse under non-proportional hazards (Tables 3 and  4). The largest difference between proportionality compared to non-proportionality was in the 300 observations scenario (Table 3) compared to the 50 observations scenario (Table 4).
In all scenarios, approaches based on conditional survival performed better than those based on proportional follow-up time, particularly when the sample size was 50 observations and especially when hazards were non-proportional. Overall differences between both methods were small (around 0.02 units in accuracy and around 0.02 units in AUC-ROC) but consistent.
The inSVM, based on both gradient and averaging approach, performed closest to the best method within each scenario. Although the averaging approach was slightly better and more insensitive to the proportion of censored observations, there were no clear differences between the averaging and gradient approach.

Real-life datasets
In the three compared datasets the conditional survival approach attained the largest predictive accuracy based on accuracy values and AUC-ROC (Table 5) when compared Table 1 Accuracy results in a 300 observations proportional hazards, zero skew, 10 and 30% censoring. Prediction accuracy of all tested approaches when simulated data was generated with 300 observations and the following assumptions: proportional hazards, zero skew, 10 and 30% censoring. The table summarizes the mean (and standard deviation) of the following metrics: accuracy, Matthews' correlation, normalized mutual information (NMI), area under the Receiver Operating Characteristic curve (AUC-ROC), sensitivity (Sn), specificity (Sp) and F1-score (F1)  Table 3 Accuracy results in a 300 observations non-proportional hazards, zero skew, 10 and 30% censoring. Prediction accuracy of all tested approaches when simulated data was generated with 300 observations and the following assumptions: non-proportional hazards, zero skew, 10 and 30% censoring. The table summarizes  with the proportionality approach within each method. Within each dataset the performance of the LUPI method was one of the best, with almost no difference between the conditional survival and the proportionality approach. The inSVM method averaging approach performed better than gradient in both accuracy and AUC-ROC metrics in all three datasets, being the former one of the best methods within each ones of the datasets.

Discussion and conclusions
In this article we proposed alternative methods and extensions within the SVM for binary classification framework for dealing with censored data. Specifically, a conditional survival approach for weighting censored observations when fitting SVM through LUPI, Uncertainty SVM, Weighted SVM, and a semi-supervised SVM with local invariances. The former takes into account the events and follow-up period including more information in the weighting process than using a proportionality of time approach. The latter is a semi-supervised SVM with local invariances method that allows using two types of invariances: gradient over variables and averaging over observations. We showed that both approaches outperformed the other studied methods on most compared metrics.
As expected, when the sample size was as limited as 50 observations and the proportional hazards assumption was violated, the Cox proportional hazards model had a poorer performance. Results with the wSVM, were highly dependent on the proportion of censoring but not so much on the distribution of time to censor. Moreover, wSVM results were comparable to the LUPI results, and that has also been observed by Lapin et al. [24]. This similarity may be explained by the common unique information (censored data) used by both methods. This similarity suggests that the wSVM method may be more advantageous in practice because is much less time consuming, although is less robust than the LUPI method.
When applying the LUPI approach, we have included the censoring data as privileged information in the correcting space. Our results were consistent with Shiao and Cherkassky [15], i.e., LUPI performs worse than the Cox proportional hazards model and pSVM in all compared scenarios. Actually, some of our simulated scenarios were similar to simulated scenarios used by Shiao and Cherkassky. The correcting space is used as complementary information to be combined with the decision space. Therefore, is not directly used to define the class of the observations, as it is in pSVM or wSVM. We agree with Serra-Toro et al. [25], that further work is needed to fully understand the LUPI approach and how the correcting and decision spaces interact.
The performance of the pSVM and the Cox proportional hazards model was similar when the sample size was larger and better than the kernel Cox regression, being the linear kernel slightly superior to the radial kernel, as observed by Shiao and Cherkassky [15]. Perhaps a finer grid search could benefit the overall performance of the nonlinear approach. Ours and Shiao and Cherkassky [15] results were consistent with regards to the superior performance of the linear pSVM performs when compared to the pSVM using Gaussian kernel.
The conditional survival approach proposed by us performs better than the proportional follow-up time approach in all compared scenarios. The conditional method takes into account the events and follow-up period, hereby, it includes more information and is more accurate in the weighting estimation than the proportionality of time Table 5 Real-life datasets metrics. approach. The latter is assuming linearity and does not take into account specificities of the data, for instance, variability in survival due to intrinsic data. However, one aspect to be remarked is that the conditional approach is assuming that the survival probability of the test data is similar to the training data. This is a reasonable assumption but depending on the difference between survival probabilities, the prediction accuracy may be affected. With respect to the proposed inSVM approach (both gradient and averaging), in the 300 observations scenarios, results are pretty similar to Cox, kernel Cox regression and pSVM. However, in scenarios in which the number of observations was small and close to the number of variables, the inSVM outperformed all other approaches in all compared metrics, and it was one of the most robust approaches to varying number of variables and violations of proportionality of hazards. Although, inSVM is a semi-supervised approach that does not account for censoring, its performance is comparable to other methods that account for censoring. That could be explained because we are assuming that censoring is independent from the events and representative of the data. Therefore, patterns in the observed data that are applicable to the censored observations and the local invariances assumptions should be valid. Additionally, an advantage of this approach is that no extra assumptions about the censoring distribution are necessary. The main drawback of the local invariances approach is that it is computationally intensive, specially the gradient approach.
All simulated data was based on balanced data, i.e., the proportion of events and non-events were similar. SVM models are sensitive to data imbalance between classes. Therefore, future investigation shall consider imbalanced scenarios.
Given the significant number of compared methods and data, the presented work has been restricted to the two most commonly used linear and Gaussian kernels. Further work shall evaluate the performance of the proposed methods using other kernels. Additionally, we addressed overfitting through standard procedures: by simulating completely different datasets to test parameters and validate models, and by applying nested cross-validation to estimate and validate parameters when analysing real data. However, future work may assess the performance of the proposed methods including even more simulation scenarios and a larger range of parameter values.
From the compared methods the proposed inSVM method using the conditional survival approach is the most robust under different scenarios and is a good approach to consider as an alternative to other time-to-event methods. When analysing sparse data is a method to be considered and recommended since outperforms other methods even when the proportional hazards assumption is not met, a situation that often occurs in biomedical data and biomarkers analysis.