Correntropy induced loss based sparse robust graph regularized extreme learning machine for cancer classification

Background As a machine learning method with high performance and excellent generalization ability, extreme learning machine (ELM) is gaining popularity in various studies. Various ELM-based methods for different fields have been proposed. However, the robustness to noise and outliers is always the main problem affecting the performance of ELM. Results In this paper, an integrated method named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) is proposed. The introduction of correntropy induced loss improves the robustness of ELM and weakens the negative effects of noise and outliers. By using the L2,1-norm to constrain the output weight matrix, we tend to obtain a sparse output weight matrix to construct a simpler single hidden layer feedforward neural network model. By introducing the graph regularization to preserve the local structural information of the data, the classification performance of the new method is further improved. Besides, we design an iterative optimization method based on the idea of half quadratic optimization to solve the non-convex problem of CSRGELM. Conclusions The classification results on the benchmark dataset show that CSRGELM can obtain better classification results compared with other methods. More importantly, we also apply the new method to the classification problems of cancer samples and get a good classification effect.

is to randomly initialize an input weight matrix A and remain fixed throughout the process. Then, by using a nonlinear piecewise continuous activation function g(x) , the data of the input layer is mapped into the feature space of the ELM, and a hidden layer output matrix H = [h(x 1 ), h(x 2 ), . . . , h(x N )] T is obtained. Finally, by solving a ridge regression problem [13], the output weights β = [β 1 , β 2 , . . . , β L ] T connecting with the hidden layer and the output layer can be determined [14].Since there is no need to iteratively solve the output weight matrix, compared with the traditional backpropagation algorithm, ELM can achieve better generalization performance at a faster speed [2,3,7]. Because of the advantages of simple theories, high efficiency, and low manual intervention, ELM has been used as a tool for various applications, such as image classification [15,16], label learning [17], image quality assessment [18], traffic sign recognition [19], and so on.
Although it has been widely used, the robustness and sparseness of the ELM algorithm are still the hot topic. Huang et al. proposed RELM in [5] and in their method, L 2 -norm was introduced to simultaneously constrain the loss function and the output weight matrix. Their experimental results provided that RELM was better than the original ELM. However, the square loss based on L 2 -norm will amplify the negative impact of noise and outliers, and lead to inaccurate results. In [9], Li et al. introduced the L 2,1 -norm into ELM as a loss function and the regularization constraint. Hence, a new method named LR21ELM is proposed. The classification results showed that the robustness of the L 2,1 -norm was significantly better than the L 2 -norm.
As a local similarity measure, correntropy is proposed based on the information theory and the kernel method [20]. Through a nonlinear feature mapping, correntropy can project the data from the input space into the feature space. It also computes the L 2norm distance and defines a correntropy induced metric (CIM) in the feature space [21]. The correntropy induced loss [22] is defined as C(t i , f(x i )) = 1 − exp −(t i − f(x i )) 2 2σ 2 , where t i is the target vector, f(x i ) is the prediction matrix and σ is the kernel bandwidth. Figure 1 depicts the correntropy induced loss function for different kernel bandwidths within the same error range. We can observe that correntropy induced loss is a non-convex, bounded, and robust loss function [23].
The robustness of correntropy to noise and outliers has been proved theoretically and experimentally. Ren et al. [21] integrated the correntropy loss and hinge loss (CH-loss) into ELM and proposed a robust extreme learning machine with the CH-loss (CHELM). They verified the robustness of the method at different noise levels. The results showed that correntropy loss could effectively reduce the influence of noise on classification results. In [24], Zhao et al. proposed the C-loss based ELM (CELM) and applied their method to estimate the power of small-scale turbojet engines. Chen et al. [25] introduced the correntropy loss to the multilayer ELM and proposed a robust multilayer ELM auto-encoder. The results showed that the feature extraction ability of the method was improved with the improvement of robustness.
In this paper, by integrating the correntropy induced loss into the ELM instead of the original L 2 -norm, an integrated model named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) is proposed. Different from the traditional ELM, we use L 2,1 -norm instead of L 2 -norm to constrain the output weight matrix to reduce the complexity of the neural network model. Moreover, the graph regularization is integrated into our method so that the neural network model can learn local structural information between data. This paper mainly makes the following research: (1) A new correntropy induced loss based sparse robust graph regularized extreme learning machine is proposed. Compared with the original ELM, the introduction of correntropy induced loss can improve the robustness. And the L 2,1 -norm is used as a sparse constraint to regularize the output weight matrix β , which can reduce the complexity of the model. To fully preserve the manifold structure information between the original data, the graph regularization is introduced into our method. (2) Based on the theory of [26], we design an iterative optimization method to cope with the non-convex problem of CSRGELM. The convergence and the computational time complexity of the new method are proved, respectively. We also design some experiments to prove the robustness of the method. It is observed that the robustness and classification ability of CSRGELM is better than that of ELM based on the traditional L 2 -norm loss function. Compared to other robust ELMs, CSRGELM can also achieve competitive results. (3) We first perform the classification experiments on five benchmark datasets and evaluate the performance of CSRGELM through multiple evaluation measures. The results show that in most datasets, the classification results of CSRGELM are superior to other methods. (4) The new method is applied to the cancer sample classification problems of integrated TCGA datasets. Whether on integrated binary datasets or integrated multiclass classification datasets, the classification performance of CSRGELM is superior to other methods. The experimental results prove that CSRGELM can be a powerful tool for studying biological omics data.

Results
Firstly, five benchmark datasets are used to evaluate the classification performance of RELM, L 2,1 -RFELM, LR21ELM, CELM [24], and CSRGELM. And then, CSRGELM is applied to the cancer sample classification tasks of the TCGA integrated datasets. In the experiments, the sigmoid function is chosen as the activation function. The evaluation criteria for testing classification performance are commonly used measures: Accuracy (Acc); Precision (Pre); Recall; F-measure (F-mea). Next, we will introduce the content of the experiment in detail.

Evaluation criteria
According to the Table 1, the definition of each measure are as follows: For a multi-class dataset, we use one of the classes as the positive class and the remaining as the negative class to compute the accuracy, precision, recall, and F-measure. Finally, the average of every measure for all classes is obtained. All methods are conducted in MATLAB R2016a with 64 GB of memory and 3.60-GHz computer.

Datasets
We use five popular benchmark datasets to test the classification performance, and every dataset has been widely applied in supervised problems [13,[27][28][29].
(a) Iris: Taken from the UCI database (https ://archi ve.ics.uci.edu/ml/index .php), Iris is a multi-class classification dataset with 150 samples and 4 features, which is already widely used in unsupervised learning [30,31] and supervised learning [5]. (b) COIL20: As a multi-class classification image dataset, the Columbia object image library is often used as a benchmark dataset to test the performance of machine

Table 1 Classification results confusion matrix
The To evaluate the performance of CSRGELM in practical applications, we apply CSRGELM to the cancer classification. In recent years, cancer has become the biggest threat to human health. The most effective way to treat cancers has always been to develop different treatments for different types of cancers. Therefore, the improvement of cancer classification is crucial to the progress of cancer treatments [32]. In this paper, four integrated TCGA datasets are used in the experiments. As known as the world's largest cancer genome database, the TCGA database has immeasurable values in the field of cancer research [33]. There are several types of cancer data included in the TCGA database. The details of benchmark datasets are listed in Table 2.
In the experiments, each integrated dataset is a combination of data from two or more cancers. In the integration process, to reduce the sample imbalance rate and ensure the credibility of the experimental results, we remove all normal samples and integrate only the disease samples of each cancer for classification experiments. Tables 3 and 4 list the information about the cancer data used in our experiments.

Convergence and sensitivity
There are four parameters ( σ , , C, L ) that need to be turned in the experiments, and different combinations of parameters may produce different classification effects. Hence, ten fold cross-validation and grid search are used to find the optimal combination of parameters. Besides, the selection range of the parameter σ is 2 −4.5 , . . . , 2 4.5 , and C are set as 10 −4 , 10 −3 , . . . , 10 5 , and L is set as (100, 200, . . . , 2000) . Taking datasets COIL20 and CEHP as examples, Figs. 2 and 3 depict the sensitivity of CSRGELM to different parameters. Because there are so many different combinations of parameters, we only show the first 180. As shown in the 4-D figures, the X-axis represents the range of , the Y-axis represents the range of σ , and the Z-axis represents the range of C . Each point in the figure represents the classification accuracy obtained by different parameter combinations. A conclusion can be drawn from Figs. 2 and 3 that CSRGELM is sensitive to σ and C, while it is insensitive to . For the benchmark datasets, when σ > 2 −2.5 and C < 10 −1 , the classification performance of CSRGELM is better. And for TCGA datasets, when σ ≥ 2 −2.5 and C ≥ 10 −4 , the classification performance of CSRGELM is better.
Taking four datasets as the examples, we also show the effect of the number of hidden layer nodes on classification performance in Fig. 4. It is obvious that with the increase of the number of hidden layer nodes, the classification performance of CSRGELM on the benchmark dataset fluctuates greatly. On the TCGA dataset, however, CSRGELM can obtain good classification results. Besides, L 2,1 -norm and correntropy induced loss are introduced to our method, and their iterative optimization is more complicated. So, an iterative optimization algorithm is designed to solve the above optimization problem. As shown in Figs. 5 and 6, we plot the convergence curves to prove the convergence of the method. In the experiments, we assume that the method will converge after 40 iterations. And it's worth noting that CSRGELM can achieve convergence after 10 iterations. This can prove that the convergence rate of the method is relatively fast, and our iterative optimization algorithm is very efficient.

Classification results on benchmark datasets and TCGA datasets
In this sub-section, the classification results of every method are provided. On every dataset, each method runs 20 times, and the average results and variance of the 20   Tables 5 and 6. Besides, the running time of each method on different datasets is also listed in Tables 7 and 8. The best results are highlighted in italics.  A conclusion can be easily drawn that, both on the benchmark datasets and the integrated TCGA datasets, our method can get better results than other methods, or at least have competitive results. By evaluating each method using different evaluation measures, we can see that our method always gets a competitive result. Compared with RELM, L 2,1 -RFELM, LR21ELM, and CELM, CSRGELM obtains better results in most cases. In terms of running time, RELM can complete the training of the network model in the shortest time because there is no iterative adjustment. Compared with other methods, CSRGELM requires the most running time. According to the analysis, in addition to constantly iterating to optimize the output weight, the calculation of H T ZH or ZHH T also takes a lot of time. How to shorten the training time is also a problem we need to study in the future.
As stated in the previous section, L 2,1 -norm is applied to the output weight matrix as a sparse regularization constraint. To prove the validity of the sparse constraint and the sparseness of the output weight matrix, we analyze the weight distribution of CSRGELM and CELM. Figures 7 and 8 show the output weight distribution of CELM and CSRGELM on CE and CEHP.   From Figs. 7 and 8, we can conclude that the distribution of the elements of the output weight matrix is almost concentrated around zero. This proves that by the constraint of L 2,1 -norm to β , we can obtain a sparser network model, which makes the model easier to explain and saves storage space and resources. In the neural network model, a sparse network model can achieve feature selection, and then we can remove the unrelated hidden layer nodes to get a more simplified and efficient neural network model.

Discussion
Our method is applied to the sample classification problems, and the generalization performance is better than other methods. The main reason is that the non-convex function of the correntropy-induced loss is introduced to improve the robustness. CSRGELM is more efficient and accurate than CELM because of the introduction of the graph regularization. What's more, the L 2,1 -norm regularization constraint has also contributed to the improvement of classification performance. Although in another method LR21ELM [9], the L 2,1 -norm is also used as a loss function to improve the robustness, from the experimental results, in most cases, the robustness of the L 2,1 -norm is weaker than the correntropy induced loss. In other words, correntropy induced loss based methods can effectively reduce the negative influence of noise and outliers on classification results. At the same time, the introduction of the graph regularization can preserve the local structural information of data. The effective combination of them can not only improve the classification performance, but also improve the generalization ability of the model.
The introduction of L 2,1 -norm regularization tends to produce a structural sparsity. It is capable of reducing some rows of the output weight matrix to zero and simplify the inherent complexity of the neural network model. The results of Figs. 7 and 8 also prove the validity of the L 2,1 -norm regularization.

Conclusions
In this paper, we propose a new method named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) and apply it to the classification problems of cancer samples. The introduction of correntropy induced loss weakens the influence of noise and outliers on the classification performance and improves the robustness of the method. As a powerful sparse regularization constraint, L 2,1 -norm is used to constrain the output weight matrix, which can reduce the complexity of the network model. Besides, the graph regularization is introduced to preserve the local manifold structure between data and reduce the loss of information. To solve the above optimization problem, we propose an efficient iterative optimization algorithm, and the computational complexity of the algorithm is also proved. Whether on the benchmark datasets or the TCGA integrated datasets, the classification performance and generalization performance of CSRGELM are comparable to other methods. In future work, we will still conduct in-depth research on the robustness of ELM and apply it to the field of bioinformatics.

RELM
Huang et al. proposed the regularized-extreme learning machine (RELM) in [5] and proved its good performance in classification or regression problems. For a dataset where N is the number of samples and m is the number of features. The objective function of RELM can be expressed as: where γ is a regularization parameter, and ξ i is the error vector of i-th sample. T is the target label matrix. Substituting constraints into Eq. (5), we get the following unconstrained optimization problem: Let L be the number of hidden nodes, if N ≥ L, the solution of β can be obtained by calculating the partial derivative of Eq. (6) and setting it to zero: and where I L is an identity matrix with dimension L. If N < L, β can be calculated as: where I N is an identity matrix with dimension N . Finally, we get the solution of β: As a regularization constraint, Zhou et al. introduced the L 2,1 -norm to constrain the output weight matrix β [34]. L 2,1 -norm regularization can generate row-sparsity, which can eliminate the redundant nodes and achieve the feature selection [35][36][37]. The mathematical model of L 2,1 -RFELM is: where C is a parameter of the regularization term. Then, Eq. (11) can be rewritten as: where D is a diagonal matrix with d ii = 1 2 β i 2 . By computing the derivative of β and setting it equal to zero, we have: According to the relationship between the number of samples and hidden layer nodes, there are two analytic solutions for β:

LR21ELM
In [9], Li et al. introduced the L 2,1 -norm to constrain both the error matrix ξ and the output weight matrix β , and proposed a robust sparse ELM method named LR21ELM. The objective function of LR21ELM is: Following the KKT theorem, the Lagrangian function of Eq. (15) is defined as: where θ ij is the Lagrange multiplier. Based on the solution in [38], Eq. (16) is equivalent to: where D 1 = 1 2 ξ i 2 , and D = 1 2 β i 2 . According to Eq. (17), the optimal conditions can be written as:  (18), we obtain an alternative solution of β: So, the analytic solution of β is:

Graph regularization
Graph regularization framework [39] has been widely used in semi-supervised learning [13] and unsupervised learning [40][41][42][43]. In the process of data processing, the graph regularization can preserve the local manifold structure between data, so that the structural information can be extracted, which is beneficial to clustering or classification problems. In mathematics, the expression of graph regularization is as follows: where P(t|x i ) and P t|x j are conditional probabilities, and W = W i,j is the similarity matrix. Equation (26) is equal to where t i and t j are predictions of x i and x j , respectively. And the matrix form of Eq. (27) is: where T is the prediction matrix, Tr(•) is the trace norm and Z = D − W is the graph Laplacian matrix. D is a diagonal matrix with d ii = j W i,j .

Proposed CSRGELM
In practical applications, the dataset usually includes a lot of noise and outliers, which will cause serious interference to the experiment results, so as to obtain inaccurate results [44]. Due to the noise and outliers, the classification effect of ELM always fails to meet the expectation. A large number of conclusions have proved that the introduction of the graph regularization in ELM method can effectively improve the classification performance or feature extraction ability of the algorithm [45,46]. Therefore, it is necessary to develop a robust and efficient method for outliers and noise.
In this section, we propose a novel method which is named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM). The correntropy induced loss function is introduced to replace the square loss, which can effectively improve the robustness of the method. And in our method, the L 2,1 -norm is used to constrain the output weight matrix β . As an adaptive sparse regularization term, L 2,1norm is used to constrain the output weight matrix, which can generate row sparsity, eliminate redundant hidden layer nodes and simplify the structure of the neural network. In recent years, how to use local consistency of data for learning to improve the performance of machine learning methods that has attracted researchers' attention [45]. Based on the theory that similar samples should have similar properties, the graph regularization is combined with our method to preserve the local structural information, which may improve the classification performance of the method [13,47]. We use the label information of the training sample to construct an adjacent graph, and the regularization term of the graph is integrated to constrain the output weight matrix, so as to learn the similar output of similar samples.

The objective function of CSRGELM
This section introduces the objective function of CSRGELM. For a dataset T train is the label matrix of X train , N is the number of samples, and m is the number of features. The mathematical model of CSRGELM can be expressed as: In Eq. (29), ξ i is the error vector, σ is the bandwidth and Z is the graph Laplacian matrix. and C are regularization parameters, respectively. Since Eq. (29) is not a convex function, it can't be solved by a commonly used optimization method. According to the solution process in [23], we can effectively solve the optimization problem of nonconvex functions.

The optimization of CSRGELM
Since the correntropy induced loss is a differentiable and smooth function, the gradient optimization algorithm can be employed [23]. However, the gradient-based optimization algorithm converges slowly, so we use the half-quadratic optimization algorithm to solve the optimization problem of CSRGELM. Firstly, we should define a convex function as: where τ < 0. Following the definition and solution of conjugate function in [48]: If we define a differentiable function: ψ(x) : R n → R, the conjugate function ψ * (x) : R n → R can be expressed as: ψ * (x) = sup p (px − ψ(p)). And if ψ(x) is a convex function, we can obtain that (ψ * (x)) * = ψ(x) [49]. we can obtain the conjugate function of Eq. When we assume υ = ξ 2 i 2σ 2 , we will have As described in [23], the supremum is reached when τ = − exp − ξ 2 i 2σ 2 < 0.
Combining Eq. (35) with Eq. (29), and we can get hold of the following mathematical model: where τ = [τ 1 , τ 2 , . . . ,τ N ] T . Equation (36) can be rewritten as: Obviously, there are two variables that need to be optimized: τ and β . Here we use a method of fixing one to optimize the other to solve Eq. (37).
For a given β n , Eq. (37) can be expressed as: Substituting constraints into Eq. (38), we can get: According to Eq. (32), the solution of Eq. (39) is: For a given τ n+1 , we focus on solving the problem as: By eliminating the constraint conditions and rewriting the Eq. (41) into a matrix form, we can get: Following the conclusion in [38]. Equation (42) where κ is a very small regularization term, in the experiment, κ = 10 −6 . It is clear that d ii = d ′ ii when κ ⇒ 0. Computing the derivative of β n+1 about ℓ CSRGELM and we have: For the case that the number of hidden nodes is less than the number of training samples, the output weights matrix β n+1 can be solved as:

that is
And if the number of hidden nodes is larger than the number of training samples, β n+1 may have an unlimited number of solutions. Inspired by the solution of Huang et al. [13],and according to Eq. (46), we make: Substituting Eq. (48) into Eq. (46), we have: And multiplying HH T −1 H on both sides of the Eq. (49), we get: Then we obtain the solution of α: And β n+1 can be computed as: where I is an identity matrix with dimension of N . The analytical solution of β n+1 can be finally determined as: (44) the classification results with no noise, and it shows that these two classes are separated easily. Figure 9b is the classification results with 50 noise, these noisy points originally belong to the class 2 but are confused in the class 1. And Fig. 9b shows that under the interference of noise, the classification decision boundaries of these four methods have changed. And the changes of RELM and L 2,1 -RFELM are more obvious. Again, another dataset is generated, class 1 and class 2 have 500 samples, respectively. First, four methods are trained on this dataset and the classification decision boundary is shown in Fig. 10a. It is obvious that the data can be separated by four straight lines. And then, 100 points belonging to class 2 are confused into class 1 as the noise. The final classification results have been shown in Fig. 10b. Clearly, RELM and L 2,1 -RFELM try to fit the noise, and their classification decision boundaries are already unreliable. But due to the constraints of the robust loss function, the classification decision boundaries of CSRGELM and LR21ELM are hardly affected.