Skip to main content

Correntropy induced loss based sparse robust graph regularized extreme learning machine for cancer classification

Abstract

Background

As a machine learning method with high performance and excellent generalization ability, extreme learning machine (ELM) is gaining popularity in various studies. Various ELM-based methods for different fields have been proposed. However, the robustness to noise and outliers is always the main problem affecting the performance of ELM.

Results

In this paper, an integrated method named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) is proposed. The introduction of correntropy induced loss improves the robustness of ELM and weakens the negative effects of noise and outliers. By using the L2,1-norm to constrain the output weight matrix, we tend to obtain a sparse output weight matrix to construct a simpler single hidden layer feedforward neural network model. By introducing the graph regularization to preserve the local structural information of the data, the classification performance of the new method is further improved. Besides, we design an iterative optimization method based on the idea of half quadratic optimization to solve the non-convex problem of CSRGELM.

Conclusions

The classification results on the benchmark dataset show that CSRGELM can obtain better classification results compared with other methods. More importantly, we also apply the new method to the classification problems of cancer samples and get a good classification effect.

Background

Universal approximation capability plays a crucial role in settling regression and classification problems. Because of this ability, the single hidden layer feedforward neural network has always been the focus and hotspot of researches [1]. As a method to train the SLFNs [2], extreme learning machine (ELM) [3,4,5,6,7,8] has attracted the attention of researchers in recent decades [9]. Different from traditional neural network models, such as the backpropagation (BP) algorithm [10, 11], the training process of ELM is implemented in one step rather than iteratively [12]. In the original ELM, the first step is to randomly initialize an input weight matrix \({\mathbf{A}}\) and remain fixed throughout the process. Then, by using a nonlinear piecewise continuous activation function \({\text{g}} \left( x \right)\), the data of the input layer is mapped into the feature space of the ELM, and a hidden layer output matrix \({\mathbf{H}} = \left[ {{\mathbf{h}}\left( {{\mathbf{x}}_{{\mathbf{1}}} } \right),{\mathbf{h}}\left( {{\mathbf{x}}_{{\mathbf{2}}} } \right), \ldots ,{\mathbf{h}}\left( {{\mathbf{x}}_{{\mathbf{N}}} } \right)} \right]^{T}\) is obtained. Finally, by solving a ridge regression problem [13], the output weights \({{\varvec{\upbeta}}} = \left[ {{{\varvec{\upbeta}}}_{1} , \, {{\varvec{\upbeta}}}_{2} , \ldots , \, {{\varvec{\upbeta}}}_{L} } \right]^{T}\) connecting with the hidden layer and the output layer can be determined [14]. Since there is no need to iteratively solve the output weight matrix, compared with the traditional backpropagation algorithm, ELM can achieve better generalization performance at a faster speed [2, 3, 7]. Because of the advantages of simple theories, high efficiency, and low manual intervention, ELM has been used as a tool for various applications, such as image classification [15, 16], label learning [17], image quality assessment [18], traffic sign recognition [19], and so on.

Although it has been widely used, the robustness and sparseness of the ELM algorithm are still the hot topic. Huang et al. proposed RELM in [5] and in their method, \(L_{2}\)-norm was introduced to simultaneously constrain the loss function and the output weight matrix. Their experimental results provided that RELM was better than the original ELM. However, the square loss based on \(L_{2}\)-norm will amplify the negative impact of noise and outliers, and lead to inaccurate results. In [9], Li et al. introduced the L2,1-norm into ELM as a loss function and the regularization constraint. Hence, a new method named LR21ELM is proposed. The classification results showed that the robustness of the L2,1-norm was significantly better than the \(L_{2}\)-norm.

As a local similarity measure, correntropy is proposed based on the information theory and the kernel method [20]. Through a nonlinear feature mapping, correntropy can project the data from the input space into the feature space. It also computes the \(L_{2}\)-norm distance and defines a correntropy induced metric (CIM) in the feature space [21]. The correntropy induced loss [22] is defined as \({\text{C}} \left( {{\mathbf{t}}_{i} ,{\text{f}} \left( {{\mathbf{x}}_{i} } \right)} \right) = 1 - \exp \left( { - {{\left( {{\mathbf{t}}_{i} - {\text{f}} \left( {{\mathbf{x}}_{i} } \right)} \right)^{2} } \mathord{\left/ {\vphantom {{\left( {{\mathbf{t}}_{i} - {\text{f}} \left( {{\mathbf{x}}_{i} } \right)} \right)^{2} } {2\sigma^{2} }}} \right. \kern-\nulldelimiterspace} {2\sigma^{2} }}} \right),\) where \({\mathbf{t}}_{i}\) is the target vector, \({\text{f}} \left( {{\mathbf{x}}_{i} } \right)\) is the prediction matrix and \(\sigma\) is the kernel bandwidth. Figure 1 depicts the correntropy induced loss function for different kernel bandwidths within the same error range. We can observe that correntropy induced loss is a non-convex, bounded, and robust loss function [23].

Fig. 1
figure1

Correntropy induced loss with different kernel bandwidths

The robustness of correntropy to noise and outliers has been proved theoretically and experimentally. Ren et al. [21] integrated the correntropy loss and hinge loss (CH-loss) into ELM and proposed a robust extreme learning machine with the CH-loss (CHELM). They verified the robustness of the method at different noise levels. The results showed that correntropy loss could effectively reduce the influence of noise on classification results. In [24], Zhao et al. proposed the C-loss based ELM (CELM) and applied their method to estimate the power of small-scale turbojet engines. Chen et al. [25] introduced the correntropy loss to the multilayer ELM and proposed a robust multilayer ELM auto-encoder. The results showed that the feature extraction ability of the method was improved with the improvement of robustness.

In this paper, by integrating the correntropy induced loss into the ELM instead of the original \(L_{2}\)-norm, an integrated model named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) is proposed. Different from the traditional ELM, we use L2,1-norm instead of \(L_{2}\)-norm to constrain the output weight matrix to reduce the complexity of the neural network model. Moreover, the graph regularization is integrated into our method so that the neural network model can learn local structural information between data. This paper mainly makes the following research:

  1. (1)

    A new correntropy induced loss based sparse robust graph regularized extreme learning machine is proposed. Compared with the original ELM, the introduction of correntropy induced loss can improve the robustness. And the L2,1-norm is used as a sparse constraint to regularize the output weight matrix \({{\varvec{\upbeta}}}\), which can reduce the complexity of the model. To fully preserve the manifold structure information between the original data, the graph regularization is introduced into our method.

  2. (2)

    Based on the theory of [26], we design an iterative optimization method to cope with the non-convex problem of CSRGELM. The convergence and the computational time complexity of the new method are proved, respectively. We also design some experiments to prove the robustness of the method. It is observed that the robustness and classification ability of CSRGELM is better than that of ELM based on the traditional \(L_{2}\)-norm loss function. Compared to other robust ELMs, CSRGELM can also achieve competitive results.

  3. (3)

    We first perform the classification experiments on five benchmark datasets and evaluate the performance of CSRGELM through multiple evaluation measures. The results show that in most datasets, the classification results of CSRGELM are superior to other methods.

  4. (4)

    The new method is applied to the cancer sample classification problems of integrated TCGA datasets. Whether on integrated binary datasets or integrated multi-class classification datasets, the classification performance of CSRGELM is superior to other methods. The experimental results prove that CSRGELM can be a powerful tool for studying biological omics data.

Results

Firstly, five benchmark datasets are used to evaluate the classification performance of RELM, L2,1-RFELM, LR21ELM, CELM [24], and CSRGELM. And then, CSRGELM is applied to the cancer sample classification tasks of the TCGA integrated datasets. In the experiments, the sigmoid function is chosen as the activation function. The evaluation criteria for testing classification performance are commonly used measures: Accuracy (Acc); Precision (Pre); Recall; F-measure (F-mea). Next, we will introduce the content of the experiment in detail.

Evaluation criteria

According to the Table 1, the definition of each measure are as follows:

$$Acc = \frac{TP + TN}{{TP + FN + FP + TN}},$$
(1)
$$Pre = \frac{TP}{{TP + FP}},$$
(2)
$$Recall = \frac{TP}{{TP + FN}},$$
(3)
$$F{-}mea = \frac{2 \times Pre \times Recall}{{Pre + Recall}}.$$
(4)
Table 1 Classification results confusion matrix

For a multi-class dataset, we use one of the classes as the positive class and the remaining as the negative class to compute the accuracy, precision, recall, and F-measure. Finally, the average of every measure for all classes is obtained. All methods are conducted in MATLAB R2016a with 64 GB of memory and 3.60-GHz computer.

Datasets

We use five popular benchmark datasets to test the classification performance, and every dataset has been widely applied in supervised problems [13, 27,28,29].

  1. (a)

    Iris: Taken from the UCI database (https://archive.ics.uci.edu/ml/index.php), Iris is a multi-class classification dataset with 150 samples and 4 features, which is already widely used in unsupervised learning [30, 31] and supervised learning [5].

  2. (b)

    COIL20: As a multi-class classification image dataset, the Columbia object image library is often used as a benchmark dataset to test the performance of machine learning methods. With 1024 features, it has 1440 samples, all of which are grayscale images of 20 different objects.

  3. (c)

    USPST: As a subset of the popular handwritten digital recognition dataset USPS, USPST is the testing set of USPS. And it has 2007 samples and 256 features.

  4. (d)

    g50c: g50c is a binary dataset, and each class is generated by a 50-D multivariate Gaussian distribution [13].

  5. (e)

    RNA-seq: It is a multi-class dataset about cancers, which has different types of tumors: BRCA, KIRC, COAD, LUAD, and PRAD. It has 801 samples and 20,531 features, and every attribute is RNA-Seq gene expression levels measured by the Illumina HiSeq platform.

To evaluate the performance of CSRGELM in practical applications, we apply CSRGELM to the cancer classification. In recent years, cancer has become the biggest threat to human health. The most effective way to treat cancers has always been to develop different treatments for different types of cancers. Therefore, the improvement of cancer classification is crucial to the progress of cancer treatments [32]. In this paper, four integrated TCGA datasets are used in the experiments. As known as the world's largest cancer genome database, the TCGA database has immeasurable values in the field of cancer research [33]. There are several types of cancer data included in the TCGA database. The details of benchmark datasets are listed in Table 2.

Table 2 Details of the benchmark datasets

In the experiments, each integrated dataset is a combination of data from two or more cancers. In the integration process, to reduce the sample imbalance rate and ensure the credibility of the experimental results, we remove all normal samples and integrate only the disease samples of each cancer for classification experiments. Tables 3 and 4 list the information about the cancer data used in our experiments.

Table 3 The full name, abbreviation, and symbol for each cancer
Table 4 Information of the integrated datasets

Convergence and sensitivity

There are four parameters (\(\sigma , \, \lambda , \, C, \, L\)) that need to be turned in the experiments, and different combinations of parameters may produce different classification effects. Hence, ten fold cross-validation and grid search are used to find the optimal combination of parameters. Besides, the selection range of the parameter \(\sigma\) is \(\left( {2^{ - 4.5} , \ldots ,2^{4.5} } \right),\) \(\lambda\) and \(C\) are set as \(\left( {10^{ - 4} ,10^{ - 3} , \ldots ,10^{5} } \right),\) and \(L\) is set as \(\left( {100,200, \ldots ,2000} \right)\). Taking datasets COIL20 and CEHP as examples, Figs. 2 and 3 depict the sensitivity of CSRGELM to different parameters. Because there are so many different combinations of parameters, we only show the first 180. As shown in the 4-D figures, the X-axis represents the range of \(\lambda\), the Y-axis represents the range of \(\sigma\), and the Z-axis represents the range of \({\text{C}}\). Each point in the figure represents the classification accuracy obtained by different parameter combinations. A conclusion can be drawn from Figs. 2 and 3 that CSRGELM is sensitive to \(\sigma\) and \({\text{C,}}\) while it is insensitive to \(\lambda .\) For the benchmark datasets, when \(\sigma > 2^{ - 2.5}\) and \(C < 10^{ - 1} ,\) the classification performance of CSRGELM is better. And for TCGA datasets, when \(\sigma \ge 2^{ - 2.5}\) and \(C \ge 10^{ - 4} ,\) the classification performance of CSRGELM is better.

Fig. 2
figure2

Parameter sensitivity of CSRGELM on COIL20

Fig. 3
figure3

Parameter sensitivity of CSRGELM on CEHP

Taking four datasets as the examples, we also show the effect of the number of hidden layer nodes on classification performance in Fig. 4. It is obvious that with the increase of the number of hidden layer nodes, the classification performance of CSRGELM on the benchmark dataset fluctuates greatly. On the TCGA dataset, however, CSRGELM can obtain good classification results.

Fig. 4
figure4

The influence of the number of hidden layer nodes on classification results

Besides, L2,1-norm and correntropy induced loss are introduced to our method, and their iterative optimization is more complicated. So, an iterative optimization algorithm is designed to solve the above optimization problem. As shown in Figs. 5 and 6, we plot the convergence curves to prove the convergence of the method. In the experiments, we assume that the method will converge after 40 iterations. And it's worth noting that CSRGELM can achieve convergence after 10 iterations. This can prove that the convergence rate of the method is relatively fast, and our iterative optimization algorithm is very efficient.

Fig. 5
figure5

Convergence curve of CSRGELM on benchmark datasets a Iris, b COIL20, c USPST, d g50c

Fig. 6
figure6

Convergence curve of CSRGELM on TCGA datasets a CE, b EHP, c CEHP, d CEHPC2

Classification results on benchmark datasets and TCGA datasets

In this sub-section, the classification results of every method are provided. On every dataset, each method runs 20 times, and the average results and variance of the 20 classification results are listed in Tables 5 and 6. Besides, the running time of each method on different datasets is also listed in Tables 7 and 8. The best results are highlighted in italics.

Table 5 Classification results on benchmark datasets (± variance)
Table 6 Classification results on TCGA datasets (± variance)
Table 7 Training time of every method on benchmark datasets (± variance)
Table 8 Training time of every method on TCGA datasets (± variance)

A conclusion can be easily drawn that, both on the benchmark datasets and the integrated TCGA datasets, our method can get better results than other methods, or at least have competitive results. By evaluating each method using different evaluation measures, we can see that our method always gets a competitive result. Compared with RELM, L2,1-RFELM, LR21ELM, and CELM, CSRGELM obtains better results in most cases. In terms of running time, RELM can complete the training of the network model in the shortest time because there is no iterative adjustment. Compared with other methods, CSRGELM requires the most running time. According to the analysis, in addition to constantly iterating to optimize the output weight, the calculation of \({\mathbf{H}}^{T} {\mathbf{ZH}}\) or \({\mathbf{ZHH}}^{T}\) also takes a lot of time. How to shorten the training time is also a problem we need to study in the future.

As stated in the previous section, L2,1-norm is applied to the output weight matrix as a sparse regularization constraint. To prove the validity of the sparse constraint and the sparseness of the output weight matrix, we analyze the weight distribution of CSRGELM and CELM. Figures 7 and 8 show the output weight distribution of CELM and CSRGELM on CE and CEHP.

Fig. 7
figure7

Output weight distribution on CE a CELM, b CSRGELM

Fig. 8
figure8

Output weight distribution on CEHP a CELM, b CSRGELM

From Figs. 7 and 8, we can conclude that the distribution of the elements of the output weight matrix is almost concentrated around zero. This proves that by the constraint of L2,1-norm to \({{\varvec{\upbeta}}}\), we can obtain a sparser network model, which makes the model easier to explain and saves storage space and resources. In the neural network model, a sparse network model can achieve feature selection, and then we can remove the unrelated hidden layer nodes to get a more simplified and efficient neural network model.

Discussion

Our method is applied to the sample classification problems, and the generalization performance is better than other methods. The main reason is that the non-convex function of the correntropy-induced loss is introduced to improve the robustness. CSRGELM is more efficient and accurate than CELM because of the introduction of the graph regularization. What’s more, the L2,1-norm regularization constraint has also contributed to the improvement of classification performance. Although in another method LR21ELM [9], the L2,1-norm is also used as a loss function to improve the robustness, from the experimental results, in most cases, the robustness of the L2,1-norm is weaker than the correntropy induced loss. In other words, correntropy induced loss based methods can effectively reduce the negative influence of noise and outliers on classification results. At the same time, the introduction of the graph regularization can preserve the local structural information of data. The effective combination of them can not only improve the classification performance, but also improve the generalization ability of the model.

The introduction of L2,1-norm regularization tends to produce a structural sparsity. It is capable of reducing some rows of the output weight matrix to zero and simplify the inherent complexity of the neural network model. The results of Figs. 7 and 8 also prove the validity of the L2,1-norm regularization.

Conclusions

In this paper, we propose a new method named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) and apply it to the classification problems of cancer samples. The introduction of correntropy induced loss weakens the influence of noise and outliers on the classification performance and improves the robustness of the method. As a powerful sparse regularization constraint, L2,1-norm is used to constrain the output weight matrix, which can reduce the complexity of the network model. Besides, the graph regularization is introduced to preserve the local manifold structure between data and reduce the loss of information. To solve the above optimization problem, we propose an efficient iterative optimization algorithm, and the computational complexity of the algorithm is also proved. Whether on the benchmark datasets or the TCGA integrated datasets, the classification performance and generalization performance of CSRGELM are comparable to other methods. In future work, we will still conduct in-depth research on the robustness of ELM and apply it to the field of bioinformatics.

Methods

RELM

Huang et al. proposed the regularized-extreme learning machine (RELM) in [5] and proved its good performance in classification or regression problems. For a dataset \(\left\{ {{\mathbf{X}},{\mathbf{T}}} \right\} = \left\{ {{\mathbf{x}}_{i} ,{\mathbf{t}}_{i} } \right\}_{i = 1}^{N} \in {\mathbb{R}}^{N \times m} ,\) where \(N\) is the number of samples and \(m\) is the number of features. The objective function of RELM can be expressed as:

$$\mathop {\min }\limits_{{{{\varvec{\upbeta}}}, \, {{\varvec{\upxi}}}}} \frac{1}{2}\left\| {{\varvec{\upbeta}}} \right\|^{2} + \frac{\gamma }{2}\sum\nolimits_{i = 1}^{N} {\left\| {{{\varvec{\upxi}}}_{i} } \right\|^{2} } ,\,{\text{s.t.}}\,{{\varvec{\upxi}}}_{i}^{T} = {\mathbf{t}}_{i}^{T} - {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}},\,i = 1, \ldots , \, N,$$
(5)

where \(\gamma\) is a regularization parameter, and \({{\varvec{\upxi}}}_{i}\) is the error vector of \(i\)-th sample. \({\mathbf{T}}\) is the target label matrix. Substituting constraints into Eq. (5), we get the following unconstrained optimization problem:

$$\mathop {\min }\limits_{{{\varvec{\upbeta}}}} \frac{1}{2}\left\| {{\varvec{\upbeta}}} \right\|^{2} + \frac{\gamma }{2}\left\| {{\mathbf{T}} - {\mathbf{H\beta }}} \right\|^{2} {.}$$
(6)

Let \(L\) be the number of hidden nodes, if \(N \ge L,\) the solution of \({{\varvec{\upbeta}}}\) can be obtained by calculating the partial derivative of Eq. (6) and setting it to zero:

$${{\varvec{\upbeta}}} - \gamma {\mathbf{H}}^{T} \left( {{\mathbf{T}} - {\mathbf{H\beta }}} \right) = 0,$$
(7)

and

$${{\varvec{\upbeta}}} = \left( {\gamma {\mathbf{H}}^{T} {\mathbf{H}} + {\mathbf{I}}_{L} } \right)^{ - 1} \gamma {\mathbf{H}}^{T} {\mathbf{T}},$$
(8)

where \({\mathbf{I}}_{L}\) is an identity matrix with dimension \(L.\) If \(N < L,\) \({{\varvec{\upbeta}}}\) can be calculated as:

$${{\varvec{\upbeta}}} = {\mathbf{H}}^{T} \left( {\gamma {\mathbf{HH}}^{T} + {\mathbf{I}}_{N} } \right)^{ - 1} \gamma {\mathbf{T}},$$
(9)

where \({\mathbf{I}}_{N}\) is an identity matrix with dimension \(N.\) Finally, we get the solution of \({{\varvec{\upbeta}}}\):

$$\left\{ {\begin{array}{*{20}l} {{{\varvec{\upbeta}}} = \left( {\gamma {\mathbf{H}}^{T} {\mathbf{H}} + {\mathbf{I}}_{L} } \right)^{ - 1} \gamma {\mathbf{H}}^{T} {\mathbf{T}},} \hfill & {N \ge L.} \hfill \\ {{{\varvec{\upbeta}}} = {\mathbf{H}}^{T} \left( {\gamma {\mathbf{HH}}^{T} + {\mathbf{I}}_{N} } \right)^{ - 1} \gamma {\mathbf{T}},} \hfill & {N < L.} \hfill \\ \end{array} } \right.$$
(10)

L2,1-RFELM

As a regularization constraint, Zhou et al. introduced the L2,1-norm to constrain the output weight matrix \({{\varvec{\upbeta}}}\) [34]. L2,1-norm regularization can generate row-sparsity, which can eliminate the redundant nodes and achieve the feature selection [35,36,37]. The mathematical model of L2,1-RFELM is:

$$\mathop {\min }\limits_{{{{\varvec{\upbeta}}}, \, {{\varvec{\upxi}}}}} \frac{1}{2}\left\| {{\varvec{\upbeta}}} \right\|_{2,1} + \frac{C}{2}\sum\nolimits_{i = 1}^{N} {\left\| {{{\varvec{\upxi}}}_{i} } \right\|^{2} } ,\,{\text{s.t.}}\,{{\varvec{\upxi}}}_{i}^{T} = {\mathbf{t}}_{i}^{T} - {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}},\,i = 1, \ldots ,N,$$
(11)

where \(C\) is a parameter of the regularization term. Then, Eq. (11) can be rewritten as:

$$\ell = \frac{1}{2}{\text{Tr}} \left( {{{\varvec{\upbeta}}}^{T} {\mathbf{D\upbeta}}} \right) + \frac{C}{2}\left\| {{\mathbf{T}} - {\mathbf{H\upbeta }}} \right\|^{2} ,$$
(12)

where \({\mathbf{D}}\) is a diagonal matrix with \(d_{ii} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\| {{{\varvec{\upbeta}}}_{i} } \right\|_{2} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {2\left\| {{{\varvec{\upbeta}}}_{i} } \right\|_{2} } \right)}}.\) By computing the derivative of \({{\varvec{\upbeta}}}\) and setting it equal to zero, we have:

$${\mathbf{D\upbeta }} - C{\mathbf{H}}^{T} \left( {{\mathbf{T}} - {\mathbf{H\upbeta }}} \right) = 0.$$
(13)

According to the relationship between the number of samples and hidden layer nodes, there are two analytic solutions for \({{\varvec{\upbeta}}}\):

$$\left\{ {\begin{array}{*{20}l} {{{\varvec{\upbeta}}} = \left( {{\mathbf{D}} + C{\mathbf{H}}^{T} {\mathbf{H}}} \right)^{{{ - }1}} C{\mathbf{H}}^{T} {\mathbf{T}},} \hfill & {N \ge L,} \hfill \\ {{{\varvec{\upbeta}}} = C{\mathbf{D}}^{ - 1} {\mathbf{H}}^{T} \left( {{\mathbf{I}} + C{\mathbf{HD}}^{ - 1} {\mathbf{H}}^{T} } \right){\mathbf{T}},} \hfill & {N < L{.}} \hfill \\ \end{array} } \right.$$
(14)

LR21ELM

In [9], Li et al. introduced the L2,1-norm to constrain both the error matrix \({{\varvec{\upxi}}}\) and the output weight matrix \({{\varvec{\upbeta}}}\), and proposed a robust sparse ELM method named LR21ELM. The objective function of LR21ELM is:

$$\mathop {\min }\limits_{{{{\varvec{\upbeta}}}, \, {{\varvec{\upxi}}}}} \left\| {{\varvec{\upbeta}}} \right\|_{2,1} + C\left\| {{\varvec{\upxi}}} \right\|_{2,1} ,\,{\text{s.t.}}\,{{\varvec{\upxi}}}_{i}^{T} = {\mathbf{t}}_{i}^{T} - {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}},\,i = 1, \ldots ,N{.}$$
(15)

Following the KKT theorem, the Lagrangian function of Eq. (15) is defined as:

$$\ell_{LR21ELM} = C\left\| {{\varvec{\upxi}}} \right\|_{2,1} + \left\| {{\varvec{\upbeta}}} \right\|_{2,1} - \sum\nolimits_{i = 1}^{N} {\sum\nolimits_{j = 1}^{m} {{{\varvec{\uptheta}}}_{ij} } } \left( {{\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}} - {\mathbf{t}}_{ij} + {{\varvec{\upxi}}}_{ij} } \right),$$
(16)

where \({{\varvec{\uptheta}}}_{ij}\) is the Lagrange multiplier. Based on the solution in [38], Eq. (16) is equivalent to:

$$\ell_{LR21ELM} = C{\text{Tr}} \left( {{{\varvec{\upxi}}}^{T} {\mathbf{D}}_{1} {{\varvec{\upxi}}}} \right) + {\text{Tr}} \left( {{{\varvec{\upbeta}}}^{T} {\mathbf{D\upbeta }}} \right) - \sum\nolimits_{i = 1}^{N} {\sum\nolimits_{j = 1}^{m} {{{\varvec{\uptheta}}}_{ij} } } \left( {{\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}} - {\mathbf{t}}_{ij} + {{\varvec{\upxi}}}_{ij} } \right),$$
(17)

where \({\mathbf{D}}_{1} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\| {{{\varvec{\upxi}}}_{i} } \right\|_{2} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {2\left\| {{{\varvec{\upxi}}}_{i} } \right\|_{2} } \right)}},\) and \({\mathbf{D}} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\| {{{\varvec{\upbeta}}}_{i} } \right\|_{2} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {2\left\| {{{\varvec{\upbeta}}}_{i} } \right\|_{2} } \right)}}.\) According to Eq. (17), the optimal conditions can be written as:

$$\frac{{\partial \ell_{LR21ELM} }}{{\partial {{\varvec{\uptheta}}}_{i} }} = 0 \Rightarrow {\mathbf{H\upbeta }} - {\mathbf{T}} + {{\varvec{\upxi}}} = 0,$$
(18)
$$\frac{{\partial \ell_{LR21ELM} }}{{\partial {{\varvec{\upbeta}}}_{i} }} = 0 \Rightarrow {\mathbf{D\upbeta }} = {\mathbf{H}}^{T} {{\varvec{\uptheta}}},$$
(19)
$$\frac{{\partial \ell_{LR21ELM} }}{{\partial {{\varvec{\upxi}}}_{i} }} = 0 \Rightarrow {{\varvec{\uptheta}}} = C{\mathbf{D}}_{1} {{\varvec{\upxi}}}.$$
(20)

If \(N < L,\) by substituting Eq. (19) and Eq. (20) into Eq. (18), we have:

$${{\varvec{\uptheta}}} = \left( {{\mathbf{HD}}^{ - 1} {\mathbf{H}}^{T} + \frac{{{\mathbf{D}}_{1}^{ - 1} }}{C}} \right)^{ - 1} {\mathbf{T}}.$$
(21)

According to Eq. (19), we have:

$${{\varvec{\upbeta}}} = {\mathbf{D}}^{ - 1} {\mathbf{H}}^{T} \left( {{\mathbf{HD}}^{ - 1} {\mathbf{H}}^{T} + \frac{{{\mathbf{D}}_{1}^{ - 1} }}{C}} \right)^{ - 1} {\mathbf{T}}.$$
(22)

And if \(N \ge L,\) by combining Eq. (19) with Eq. (20), we have:

$${{\varvec{\upxi}}} = \frac{{\left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1} } \right)^{\dag } {\mathbf{D\upbeta }}}}{C}.$$
(23)

Substituting Eq. (23) into Eq. (18), we obtain an alternative solution of \({{\varvec{\upbeta}}}\):

$${{\varvec{\upbeta}}} = \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{H}} + \frac{{\mathbf{D}}}{C}} \right)^{ - 1} {\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{T}}.$$
(24)

So, the analytic solution of \({{\varvec{\upbeta}}}\) is:

$$\left\{ {\begin{array}{*{20}l} {{{\varvec{\upbeta}}} = \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{H}} + \frac{{\mathbf{D}}}{C}} \right)^{ - 1} {\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{T}}.} \hfill & {N \ge L,} \hfill \\ {{{\varvec{\upbeta}}} = {\mathbf{D}}^{ - 1} {\mathbf{H}}^{T} \left( {{\mathbf{HD}}^{ - 1} {\mathbf{H}}^{T} + \frac{{{\mathbf{D}}_{1}^{ - 1} }}{C}} \right)^{ - 1} {\mathbf{T}}.} \hfill & {N < L{.}} \hfill \\ \end{array} } \right.$$
(25)

Graph regularization

Graph regularization framework [39] has been widely used in semi-supervised learning [13] and unsupervised learning [40,41,42,43]. In the process of data processing, the graph regularization can preserve the local manifold structure between data, so that the structural information can be extracted, which is beneficial to clustering or classification problems. In mathematics, the expression of graph regularization is as follows:

$${\mathbf{Q}}_{gL} = \frac{1}{2}\sum\nolimits_{i,j} {{\mathbf{W}}_{i,j} \left\| {{\text{P}} \left( {{\mathbf{t}}|{\mathbf{x}}_{i} } \right) - {\text{P}} \left( {{\mathbf{t}}|{\mathbf{x}}_{j} } \right)} \right\|}^{2} ,$$
(26)

where \({\text{P}} \left( {{\mathbf{t}}|{\mathbf{x}}_{i} } \right)\) and \({\text{P}} \left( {{\mathbf{t}}|{\mathbf{x}}_{j} } \right)\) are conditional probabilities, and \({\mathbf{W}} = \left[ {{\mathbf{W}}_{i,j} } \right]\) is the similarity matrix. Equation (26) is equal to

$${\mathbf{Q}}_{gL}^{^{\prime}} = \frac{1}{2}\sum\nolimits_{i,j} {{\mathbf{W}}_{i,j} \left\| {{\mathbf{t}}_{i} - {\mathbf{t}}_{j} } \right\|}^{2} ,$$
(27)

where \({\mathbf{t}}_{i}\) and \({\mathbf{t}}_{j}\) are predictions of \({\mathbf{x}}_{i}\) and \({\mathbf{x}}_{j}\), respectively. And the matrix form of Eq. (27) is:

$${\mathbf{Q}}_{gL}^{^{\prime}} = {\text{Tr}} \left( {{\mathbf{T}}^{T} {\mathbf{ZT}}} \right),$$
(28)

where \({\mathbf{T}}\) is the prediction matrix, \({\text{Tr}} ( \bullet)\) is the trace norm and \({\mathbf{Z}} = {\mathbf{D}} - {\mathbf{W}}\) is the graph Laplacian matrix. \({\mathbf{D}}\) is a diagonal matrix with \(d_{ii} = \sum\nolimits_{j} {{\mathbf{W}}_{i,j} .}\)

Proposed CSRGELM

In practical applications, the dataset usually includes a lot of noise and outliers, which will cause serious interference to the experiment results, so as to obtain inaccurate results [44]. Due to the noise and outliers, the classification effect of ELM always fails to meet the expectation. A large number of conclusions have proved that the introduction of the graph regularization in ELM method can effectively improve the classification performance or feature extraction ability of the algorithm [45, 46]. Therefore, it is necessary to develop a robust and efficient method for outliers and noise.

In this section, we propose a novel method which is named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM). The correntropy induced loss function is introduced to replace the square loss, which can effectively improve the robustness of the method. And in our method, the L2,1-norm is used to constrain the output weight matrix \({{\varvec{\upbeta}}}\). As an adaptive sparse regularization term, L2,1-norm is used to constrain the output weight matrix, which can generate row sparsity, eliminate redundant hidden layer nodes and simplify the structure of the neural network. In recent years, how to use local consistency of data for learning to improve the performance of machine learning methods that has attracted researchers' attention [45]. Based on the theory that similar samples should have similar properties, the graph regularization is combined with our method to preserve the local structural information, which may improve the classification performance of the method [13, 47]. We use the label information of the training sample to construct an adjacent graph, and the regularization term of the graph is integrated to constrain the output weight matrix, so as to learn the similar output of similar samples.

The objective function of CSRGELM

This section introduces the objective function of CSRGELM. For a dataset \(\left\{ {{\mathbf{X}}_{train} , \, {\mathbf{T}}_{train} } \right\} = \left\{ {{\mathbf{x}}_{i} , \, {\mathbf{t}}_{i} } \right\}_{i = 1}^{N} \in {\mathbb{R}}^{N \times m} ,\) \({\mathbf{T}}_{train}\) is the label matrix of \({\mathbf{X}}_{train}\), \(N\) is the number of samples, and \(m\) is the number of features. The mathematical model of CSRGELM can be expressed as:

$$\begin{aligned} & {\text{F}} \left( {{\varvec{\upbeta}}} \right) = \mathop {\min }\limits_{{{\varvec{\upbeta}}}} \sum\nolimits_{i = 1}^{N} {\left( {1 - \exp \left( { - \frac{{{{\varvec{\upxi}}}_{i}^{2} }}{{2\sigma^{2} }}} \right)} \right)} + \frac{\lambda }{2}\left\| {{\varvec{\upbeta}}} \right\|_{2,1} + \frac{C}{2}{\text{Tr}} \left( {\left( {{\mathbf{H\upbeta }}} \right)^{T} {\mathbf{ZH\upbeta }}} \right), \\ & \quad s.t.\quad {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}} = {\mathbf{t}}_{i}^{T} - {{\varvec{\upxi}}}_{i}^{T} ,i = 1, \ldots ,N. \\ \end{aligned}$$
(29)

In Eq. (29), \({{\varvec{\upxi}}}_{i}\) is the error vector, \(\sigma\) is the bandwidth and \({\mathbf{Z}}\) is the graph Laplacian matrix. \(\lambda\) and \(C\) are regularization parameters, respectively. Since Eq. (29) is not a convex function, it can’t be solved by a commonly used optimization method. According to the solution process in [23], we can effectively solve the optimization problem of non-convex functions.

The optimization of CSRGELM

Since the correntropy induced loss is a differentiable and smooth function, the gradient optimization algorithm can be employed [23]. However, the gradient-based optimization algorithm converges slowly, so we use the half-quadratic optimization algorithm to solve the optimization problem of CSRGELM.

Firstly, we should define a convex function as:

$${\text{f}} \left( \tau \right) = - \tau \log \left( \tau \right) + \tau ,$$
(30)

where \(\tau < 0.\) Following the definition and solution of conjugate function in [48]: If we define a differentiable function: \(\psi \left( x \right): \, {\mathbb{R}}^{n} \to {\mathbb{R}},\) the conjugate function \(\psi^{*} \left( x \right): \, {\mathbb{R}}^{n} \to {\mathbb{R}}\) can be expressed as: \(\psi^{*} \left( x \right) = \mathop {\sup }\limits_{p} \left( {px - \psi \left( p \right)} \right).\) And if \(\psi \left( x \right)\) is a convex function, we can obtain that \(\left( {\psi^{*} \left( x \right)} \right)^{*} = \psi \left( x \right)\) [49]. we can obtain the conjugate function of Eq. (30):

$${\text{f}}^{*} \left( \upsilon \right) = \sup {\text{f}}^{{\prime }} \left( \tau \right),$$
(31)

and

$${\text{f}}^{{\prime }} \left( \tau \right) = \upsilon \tau - {\text{f}} \left( \tau \right) = \upsilon \tau + \tau \log \left( { - \tau } \right) - \tau .$$
(32)

By letting \({{d{\text{f}}^{{\prime }} \left( \tau \right)} \mathord{\left/ {\vphantom {{d{\text{f}}^{{\prime }} \left( \tau \right)} {d\tau }}} \right. \kern-\nulldelimiterspace} {d\tau }} = 0,\) the solution of Eq. (32) can be obtained:

$$\upsilon + \log \left( { - \tau } \right) = 0 \Rightarrow \tau = - \exp \left( { - \upsilon } \right) < 0.$$
(33)

Substituting Eq. (33) into Eq. (31), so Eq. (34) can be expressed as:

$${\text{f}}^{*} \left( \upsilon \right) = \exp \left( { - \upsilon } \right).$$
(34)

When we assume \({{\varvec{\upupsilon}}} = {{{{\varvec{\upxi}}}_{i}^{2} } \mathord{\left/ {\vphantom {{{{\varvec{\upxi}}}_{i}^{2} } {2\sigma^{2} }}} \right. \kern-\nulldelimiterspace} {2\sigma^{2} }},\) we will have

$${\text{f}}^{*} \left( {\frac{{{{\varvec{\upxi}}}_{i}^{2} }}{{2\sigma^{2} }}} \right) = \sup \left( {\frac{{{{\varvec{\upxi}}}_{i}^{2} }}{{2\sigma^{2} }}\tau + \tau \log \left( { - \tau } \right) - \tau } \right) = \exp \left( { - \frac{{{{\varvec{\upxi}}}_{i}^{2} }}{{2\sigma^{2} }}} \right).$$
(35)

As described in [23], the supremum is reached when \(\tau = - \exp \left( { - \left( {{{{{\varvec{\upxi}}}_{i}^{2} } \mathord{\left/ {\vphantom {{{{\varvec{\upxi}}}_{i}^{2} } {2\sigma^{2} }}} \right. \kern-\nulldelimiterspace} {2\sigma^{2} }}} \right)} \right) < 0.\)

Combining Eq. (35) with Eq. (29), and we can get hold of the following mathematical model:

$$\begin{gathered} {\text{F}}^{^{\prime}} \left( {{\varvec{\upbeta}}} \right) = \mathop {\min }\limits_{{{{\varvec{\upbeta}}},{{\varvec{\uptau}}}}} \sum\nolimits_{i = 1}^{N} {\left( {1 - \sup \left( {\frac{{{{\varvec{\upxi}}}_{i}^{2} }}{{2\sigma^{2} }}{{\varvec{\uptau}}}_{i} - {\text{f}} \left( {{{\varvec{\uptau}}}_{i} } \right)} \right)} \right)} + \frac{\lambda }{2}\left\| {{\varvec{\upbeta}}} \right\|_{2,1} + \frac{C}{2}{\text{Tr}} \left( {\left( {{\mathbf{H\upbeta }}} \right)^{T} {\mathbf{ZH\upbeta }}} \right), \hfill \\ s.t. \, {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}} = {\mathbf{t}}_{i}^{T} - {{\varvec{\upxi}}}_{i}^{T} , \, i = 1, \, \ldots , \, N, \hfill \\ \end{gathered}$$
(36)

where \({{\varvec{\uptau}}} = [{{\varvec{\uptau}}}_{1} , \, {{\varvec{\uptau}}}_{2} , \ldots {,}{{\varvec{\uptau}}}_{N} ]^{T} .\) Equation (36) can be rewritten as:

$$\begin{gathered} {\text{F}}^{^{\prime\prime}} \left( {{\varvec{\upbeta}}} \right) = \mathop {\min }\limits_{{{{\varvec{\upbeta}}},{{\varvec{\uptau}}}}} \left( {\sup \sum\nolimits_{i = 1}^{N} {\left( { - \frac{{{{\varvec{\upxi}}}_{i}^{2} }}{{2\sigma^{2} }}{{\varvec{\uptau}}}_{i} + {\text{f}} \left( {{{\varvec{\uptau}}}_{i} } \right)} \right) + \frac{\lambda }{2}\left\| {{\varvec{\upbeta}}} \right\|_{2,1} + \frac{C}{2}{\text{Tr}} \left( {\left( {{\mathbf{H\upbeta }}} \right)^{T} {\mathbf{ZH\upbeta }}} \right)} } \right), \hfill \\ s.t. \, {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}} = {\mathbf{t}}_{i}^{T} - {{\varvec{\upxi}}}_{i}^{T} , \, i = 1, \, \ldots , \, N. \hfill \\ \end{gathered}$$
(37)

Obviously, there are two variables that need to be optimized: \({{\varvec{\uptau}}}\) and \({{\varvec{\upbeta}}}\). Here we use a method of fixing one to optimize the other to solve Eq. (37).

  1. (1)

    Fixed \({{\varvec{\upbeta}}}^{n}\) to optimize \({{\varvec{\uptau}}}^{n + 1} .\)

For a given \({{\varvec{\upbeta}}}^{n} ,\) Eq. (37) can be expressed as:

$$\mathop {\min }\limits_{{{{\varvec{\uptau}}}^{n + 1} }} \sum\nolimits_{i = 1}^{N} {\left( { - \frac{{{{\varvec{\upxi}}}_{i}^{2} }}{{2\sigma^{2} }}{{\varvec{\uptau}}}_{i}^{n + 1} + {\text{f}} \left( {{{\varvec{\uptau}}}_{i}^{n + 1} } \right)} \right)} ,\,s.t.\,{\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}}^{n} = {\mathbf{t}}_{i}^{T} - {{\varvec{\upxi}}}_{i}^{T} ,\,i = 1, \ldots ,N{.}$$
(38)

Substituting constraints into Eq. (38), we can get:

$$\mathop {\min }\limits_{{{{\varvec{\uptau}}}^{n + 1} }} \sum\nolimits_{i = 1}^{N} {\left( { - \frac{{\left( {{\mathbf{t}}_{i}^{T} - {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}}^{n} } \right)^{2} }}{{2\sigma^{2} }}{{\varvec{\uptau}}}_{i}^{n + 1} + {\text{f}} \left( {{{\varvec{\uptau}}}_{i}^{n + 1} } \right)} \right)} .$$
(39)

According to Eq. (32), the solution of Eq. (39) is:

$${{\varvec{\uptau}}}_{i}^{n + 1} = - \exp \left( { - \frac{{\left( {{\mathbf{t}}_{i}^{T} - {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}}^{n} } \right)^{2} }}{{2\sigma^{2} }}} \right),\,i = 1, \ldots ,N,$$
(40)

where \({{\varvec{\uptau}}}_{i}^{n + 1} < 0.\)

  1. 2

    Fixed \({{\varvec{\uptau}}}^{n + 1}\) to optimize \({{\varvec{\upbeta}}}^{n + 1} .\)

For a given \({{\varvec{\uptau}}}^{n + 1} ,\) we focus on solving the problem as:

$$\begin{aligned} & \mathop {\min }\limits_{{{{\varvec{\upbeta}}}^{n + 1} }} \left( {\sum\nolimits_{i = 1}^{N} {\left( { - \frac{{{{\varvec{\uptau}}}_{i}^{n + 1} }}{{2\sigma^{2} }}{{\varvec{\upxi}}}_{i}^{2} } \right) + \frac{\lambda }{2}\left\| {{{\varvec{\upbeta}}}^{n + 1} } \right\|_{2,1} + \frac{C}{2}{\text{Tr}} \left( {\left( {{\mathbf{H\upbeta }}^{n + 1} } \right)^{T} {\mathbf{ZH\upbeta }}^{n + 1} } \right)} } \right), \\ & \quad s.t.\quad {\mathbf{h}}\left( {{\mathbf{x}}_{i} } \right){{\varvec{\upbeta}}}^{n + 1} = {\mathbf{t}}_{i}^{T} - {{\varvec{\upxi}}}_{i}^{T} ,\,i = 1, \ldots ,N. \\ \end{aligned}$$
(41)

By eliminating the constraint conditions and rewriting the Eq. (41) into a matrix form, we can get:

$$\ell_{CSRGELM} = - \frac{{{{\varvec{\uptau}}}^{n + 1} }}{{2\sigma^{2} }}\left( {{\mathbf{T}} - {\mathbf{H\upbeta }}^{n + 1} } \right)^{2} + \frac{\lambda }{2}\left\| {{{\varvec{\upbeta}}}^{n + 1} } \right\|_{2,1} + \frac{C}{2}{\text{Tr}} \left( {\left( {{\mathbf{H\upbeta }}^{n + 1} } \right)^{T} {\mathbf{ZH\upbeta }}^{n + 1} } \right).$$
(42)

Following the conclusion in [38]. Equation (42) can be rewritten as:

$$\begin{aligned} \ell_{CSRGELM} & = - \frac{{{{\varvec{\uptau}}}^{n + 1} }}{{2\sigma^{2} }}\left( {{\mathbf{T}} - {\mathbf{H\upbeta }}^{n + 1} } \right)^{2} + \frac{\lambda }{2}{\text{Tr}} \left( {\left( {{{\varvec{\upbeta}}}^{n + 1} } \right)^{T} {\mathbf{D}}^{n + 1} {{\varvec{\upbeta}}}^{n + 1} } \right) \\ & \quad + \frac{C}{2}{\text{Tr}} \left( {\left( {{\mathbf{H\upbeta }}^{n + 1} } \right)^{T} {\mathbf{ZH\upbeta }}^{n + 1} } \right), \\ \end{aligned}$$
(43)

where \({\mathbf{D}}^{n + 1}\) is a diagonal matrix and \(d_{ii} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\| {{{\varvec{\upbeta}}}_{i}^{n + 1} } \right\|_{2} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {2\left\| {{{\varvec{\upbeta}}}_{i}^{n + 1} } \right\|_{2} } \right)}}.\) In theory, the value of \(\left\| {{{\varvec{\upbeta}}}_{i}^{n + 1} } \right\|_{2}\) can be zero, but this will make the Eq. (43) undifferentiable. To prevent this from happening, a regularization term is added and

$$d_{ii}^{{\prime }} = \frac{1}{{2\left( {\sqrt {\left( {{{\varvec{\upbeta}}}_{i}^{n + 1} } \right){{\varvec{\upbeta}}}_{i}^{n + 1} + \kappa } } \right)}},$$
(44)

where \(\kappa\) is a very small regularization term, in the experiment, \(\kappa = 10^{ - 6} .\) It is clear that \(d_{ii} = d_{ii}^{^{\prime}}\) when \(\kappa \Rightarrow 0.\)

Computing the derivative of \({{\varvec{\upbeta}}}^{n + 1}\) about \(\ell_{CSRGELM}\) and we have:

$$\frac{{\partial \ell_{CSRGELM} }}{{\partial {{\varvec{\upbeta}}}^{n + 1} }} = 0 \Rightarrow - \frac{1}{{\sigma^{2} }}{\mathbf{H}}^{T} {{\varvec{\upomega}}}\left( {{\mathbf{T}} - {\mathbf{H\upbeta }}^{n + 1} } \right) + \lambda {\mathbf{D\upbeta }}^{n + 1} + C{\mathbf{H}}^{T} {\mathbf{ZH\upbeta }}^{n + 1} = 0,$$
(45)

where \({{\varvec{\upomega}}} = {\text{diag}} \left( { - {{\varvec{\uptau}}}_{1}^{n + 1} , \ldots , - {{\varvec{\uptau}}}_{N}^{n + 1} } \right).\)

For the case that the number of hidden nodes is less than the number of training samples, the output weights matrix \({{\varvec{\upbeta}}}^{n + 1}\) can be solved as:

$$\lambda \sigma^{2} {\mathbf{D}}^{n + 1} {{\varvec{\upbeta}}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H\upbeta }}^{n + 1} + C\sigma^{2} {\mathbf{H}}^{T} {\mathbf{ZH\upbeta }}^{n + 1} - {\mathbf{H}}^{T} {\mathbf{\omega T}} = 0,$$
(46)

that is

$${{\varvec{\upbeta}}}^{n + 1} = \left( {\lambda \sigma^{2} {\mathbf{D}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H}} + C\sigma^{2} {\mathbf{H}}^{T} {\mathbf{ZH}}} \right)^{ - 1} {\mathbf{H}}^{T} {\mathbf{\omega T}}.$$
(47)

And if the number of hidden nodes is larger than the number of training samples, \({{\varvec{\upbeta}}}^{n + 1}\) may have an unlimited number of solutions. Inspired by the solution of Huang et al. [13],and according to Eq. (46), we make:

$$\lambda \sigma^{2} {\mathbf{D}}^{n + 1} {{\varvec{\upbeta}}}^{n + 1} = {\mathbf{H}}^{T} {{\varvec{\upalpha}}} \Rightarrow {{\varvec{\upbeta}}}^{n + 1} = \frac{1}{{\lambda \sigma^{2} }}\left( {{\mathbf{D}}^{n + 1} } \right)^{ - 1} {\mathbf{H}}^{T} {{\varvec{\upalpha}}}.$$
(48)

Substituting Eq. (48) into Eq. (46), we have:

$${\mathbf{H}}^{T} {{\varvec{\upalpha}}} + \frac{1}{{\lambda \sigma^{2} }}{\mathbf{H}}^{T} {\mathbf{\omega H}}\left( {{\mathbf{D}}^{n + 1} } \right)^{ - 1} {\mathbf{H}}^{T} {{\varvec{\upalpha}}} + \frac{C}{\lambda }{\mathbf{H}}^{T} {\mathbf{ZH}}\left( {{\mathbf{D}}^{n + 1} } \right)^{ - 1} {\mathbf{H}}^{T} {{\varvec{\upalpha}}} - {\mathbf{H}}^{T} {\mathbf{\omega T}} = 0.$$
(49)

And multiplying \(\left( {{\mathbf{HH}}^{T} } \right)^{ - 1} {\mathbf{H}}\) on both sides of the Eq. (49), we get:

$${{\varvec{\upalpha}}} + \frac{1}{{\lambda \sigma^{2} }}{\mathbf{\omega H}}\left( {{\mathbf{D}}^{n + 1} } \right){\mathbf{H}}^{T} {{\varvec{\upalpha}}} + \frac{C}{\lambda }{\mathbf{ZH}}\left( {{\mathbf{D}}^{n + 1} } \right){\mathbf{H}}^{T} {{\varvec{\upalpha}}} - {\mathbf{\omega T}} = 0.$$
(50)

Then we obtain the solution of \({{\varvec{\upalpha}}}\):

$${{\varvec{\upalpha}}} = \left( {{\mathbf{I}} + \frac{1}{{\lambda \sigma^{2} }}{\mathbf{\omega H}}\left( {{\mathbf{D}}^{n + 1} } \right){\mathbf{H}}^{T} + \frac{C}{\lambda }{\mathbf{ZH}}\left( {{\mathbf{D}}^{n + 1} } \right){\mathbf{H}}^{T} } \right)^{ - 1} {\mathbf{\omega T}}.$$
(51)

And \({{\varvec{\upbeta}}}^{n + 1}\) can be computed as:

$${{\varvec{\upbeta}}}^{n + 1} = \frac{1}{{\lambda \sigma^{2} }}\left( {{\mathbf{D}}^{n + 1} } \right){\mathbf{H}}^{T} \left( {{\mathbf{I}} + \frac{1}{{\lambda \sigma^{2} }}{\mathbf{\omega H}}\left( {{\mathbf{D}}^{n + 1} } \right){\mathbf{H}}^{T} + \frac{C}{\lambda }{\mathbf{ZH}}\left( {{\mathbf{D}}^{n + 1} } \right){\mathbf{H}}^{T} } \right)^{ - 1} {\mathbf{\omega T}},$$
(52)

where \({\mathbf{I}}\) is an identity matrix with dimension of \(N\). The analytical solution of \({{\varvec{\upbeta}}}^{n + 1}\) can be finally determined as:

$$\left\{ {\begin{array}{*{20}l} {{{\varvec{\upbeta}}}^{n + 1} = \left( {\eta {\mathbf{D}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H}} + \rho {\mathbf{H}}^{T} {\mathbf{ZH}}} \right)^{ - 1} {\mathbf{H}}^{T} {\mathbf{\omega T}},} \hfill & {N \ge L} \hfill \\ {{{\varvec{\upbeta}}}^{n + 1} = \frac{1}{\eta }\left( {{\mathbf{D}}^{n + 1} } \right)^{ - 1} {\mathbf{H}}^{T} \left( {{\mathbf{I}} + \frac{1}{\eta }{\mathbf{\omega H}}\left( {{\mathbf{D}}^{n + 1} } \right)^{ - 1} {\mathbf{H}}^{T} + \frac{C}{\lambda }{\mathbf{ZH}}\left( {{\mathbf{D}}^{n + 1} } \right)^{ - 1} {\mathbf{H}}^{T} } \right)^{ - 1} {\mathbf{\omega T}}.} \hfill & {N < L} \hfill \\ \end{array} } \right.$$
(53)

And \(\eta = \lambda \sigma^{2} , \, \rho = C\sigma^{2} .\) It is worth noting that \({{\varvec{\upbeta}}}^{n + 1}\) is a dependence on \({\mathbf{D}}^{n + 1} ,\) so an iterative optimization algorithm is proposed for solving \({{\varvec{\upbeta}}}^{n + 1}\) and \({\mathbf{D}}^{n + 1} .\) The flow of Algorithm 1 is as follows:

figurea

Computational complexity analysis

The computational complexity of CSRGELM is analyzed in this subsection. We define \(M\) as the number of classes. In Eq. (47), we have to calculate \({\mathbf{D}}^{n + 1} ,\) \({\mathbf{H}}^{T} {\mathbf{\omega H}},\) \({\mathbf{H}}^{T} {\mathbf{ZH}},\) \({\mathbf{H}}^{T} {\mathbf{\omega T}},\) \({{\varvec{\upomega}}}\) and \(\left( {\lambda \sigma^{2} {\mathbf{D}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H}} + C\sigma^{2} {\mathbf{H}}^{T} {\mathbf{ZH}}} \right)^{ - 1} .\) The computational cost for \({\mathbf{D}}^{n + 1}\) is \(O\left( {LM} \right),\) and it needs \(O\left( {L^{2} N} \right)\) to compute \({\mathbf{H}}^{T} {\mathbf{\omega H}}\) and \({\mathbf{H}}^{T} {\mathbf{ZH}}.\) For \({\mathbf{H}}^{T} {\mathbf{\omega T}},\) the computational complexity is \(O\left( {LNM} \right),\) and the computational complexity for \(\left( {\lambda \sigma^{2} {\mathbf{D}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H}} + C\sigma^{2} {\mathbf{H}}^{T} {\mathbf{ZH}}} \right)^{ - 1}\) is \(O\left( {L^{3} } \right),\) while it needs \(O\left( {NM} \right)\) to compute \({{\varvec{\upomega}}}.\) In addition, the computational time complexity of the operation of \(\left( {\lambda \sigma^{2} {\mathbf{D}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H}} + C\sigma^{2} {\mathbf{H}}^{T} {\mathbf{ZH}}} \right)^{ - 1}\) multiplied by \({\mathbf{H}}^{T} {\mathbf{\omega T}}\) is \(O\left( {L^{2} M} \right).\) Owing to \(N > L,\) The computational cost of Eq. (47) is \(O\left( {L^{2} N} \right).\) Assuming that the method converges after \(K\) iterations, we can obtain that the final computational cost of CSRGELM is \(K \times O\left( {L^{2} N} \right).\)

Robustness analysis

An experiment is designed to demonstrate the robustness of CSRGELM to outliers and noise. Two groups of data subject to Gaussian distribution that are randomly generated. Class 1 includes 300 samples with mean parameter \(\chi_{1} = \left[ { - 2, - 2} \right]\) and covariance matrix \(\phi_{1} = \left[ {1 0; 0 1} \right],\) while class 2 includes another 300 samples with mean parameter \(\chi_{2} = \left[ {2, 2} \right]\) and covariance matrix \(\phi_{2} = \left[ {1 0; 0 1} \right].\) And in the experiments, RELM, L2,1-RFELM, LR21ELM and CSRGELM are trained on this dataset, respectively. The classification decision boundary has shown in Fig. 9. Figure 9a is the classification results with no noise, and it shows that these two classes are separated easily. Figure 9b is the classification results with 50 noise, these noisy points originally belong to the class 2 but are confused in the class 1. And Fig. 9b shows that under the interference of noise, the classification decision boundaries of these four methods have changed. And the changes of RELM and L2,1-RFELM are more obvious. Again, another dataset is generated, class 1 and class 2 have 500 samples, respectively. First, four methods are trained on this dataset and the classification decision boundary is shown in Fig. 10a. It is obvious that the data can be separated by four straight lines. And then, 100 points belonging to class 2 are confused into class 1 as the noise. The final classification results have been shown in Fig. 10b. Clearly, RELM and L2,1-RFELM try to fit the noise, and their classification decision boundaries are already unreliable. But due to the constraints of the robust loss function, the classification decision boundaries of CSRGELM and LR21ELM are hardly affected.

Fig. 9
figure9

Classification decision boundary on artificial Gaussian dataset with 300 samples and 50 noise

Fig. 10
figure10

Classification decision boundary on artificial Gaussian dataset with 500 samples and 100 noise

Availability of data and materials

The TCGA datasets that support the findings of this study are available in https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga. The UCI datasets that support the findings of this study are available in https://archive.ics.uci.edu/ml/datasets.php. The datasets of g50c, COIL20, and USPST that support the findings of this study are included in this published article [13].

Abbreviations

ELM:

Extreme learning machine

CSRGELM:

Correntropy induced loss based sparse robust graph regularized extreme learning machine

SLFNs:

Single hidden layer feedforward neural network

CIM:

Correntropy induced metric

CH-loss:

Correntropy loss and hinge loss

CHELM:

Extreme learning machine with the CH-loss

TCGA:

The Cancer Genome Atlas

UCI:

University of California Irvine

Acc:

Accuracy

Pre:

Precision

F-mea:

F-measure

References

  1. 1.

    Leshno M, Lin VY, Pinkus A, Schocken S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 1993;6(6):861–67.

    Article  Google Scholar 

  2. 2.

    Huang G-B, Zhu Q-Y, Siew C-K. Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat No 04CH37541): 2004. IEEE, pp. 985–990.

  3. 3.

    Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing. 2006;70(1–3):489–501.

    Article  Google Scholar 

  4. 4.

    Huang GB, Wang DH, Lan Y. Extreme learning machines: a survey. Int J Mach Learn Cybernet. 2011;2(2):107–22.

    Article  Google Scholar 

  5. 5.

    Huang GB, Zhou H, Ding X, Zhang R. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybernet Part B. 2012;42(2):513–529.

    Article  Google Scholar 

  6. 6.

    Huang G-B, Chen L, Siew CK. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw Learn Syst. 2006;17(4):879–92.

    Article  Google Scholar 

  7. 7.

    Huang G-B. An insight into extreme learning machines: random neurons, random features and kernels. Cognit Comput. 2014;6(3):376–90.

    Article  Google Scholar 

  8. 8.

    Huang GB. What are extreme learning machines? Filling the gap between Frank Rosenblatt’s Dream and John von Neumann’s Puzzle. Cognit Comput. 2015;7(3):263–78.

    Article  Google Scholar 

  9. 9.

    Li R, Wang X, Lei L, Song Y. L2,1-norm based loss function and regularization extreme learning machine. IEEE Access. 2018;7:6575–86.

    Article  Google Scholar 

  10. 10.

    Cilimkovic M. Neural networks and back propagation algorithm. Dublin: Institute of Technology Blanchardstown; 2015. p. 15.

    Google Scholar 

  11. 11.

    Man Z, Wu HR, Liu S, Yu X. A new adaptive backpropagation algorithm based on Lyapunov stability theory for neural networks. IEEE Trans Neural Networks. 2006;17(6):1580–91.

    PubMed  Article  Google Scholar 

  12. 12.

    Lu H, Zheng E, Lu Y, Ma X, Liu J. ELM-based gene expression classification with misclassification cost. Neural Comput Appl. 2014;25(3–4):525–31.

    Article  Google Scholar 

  13. 13.

    Huang G, Song S, Gupta JN, Wu C. Semi-supervised and unsupervised extreme learning machines. IEEE Trans Cybernet. 2014;44(12):2405–17.

    Article  Google Scholar 

  14. 14.

    Huang G, Huang GB, Song S, You K. Trends in extreme learning machines: a review. Neural Netw. 2015;61(C):32–48.

    PubMed  Article  Google Scholar 

  15. 15.

    Cao F, Liu B, Park DS. Image classification based on effective extreme learning machine. Neurocomputing. 2013;102:90–7.

    Article  Google Scholar 

  16. 16.

    Ergul U, Bilgin G. MCK-ELM: multiple composite kernel extreme learning machine for hyperspectral images. Neural Comput Appl. 2020, 32(11):6809–19

    Google Scholar 

  17. 17.

    Jiang M, Pan Z, Li N. Multi-label text categorization using L21-norm minimization extreme learning machine. Neurocomputing. 2017;261:4–10.

    Article  Google Scholar 

  18. 18.

    Deng C, Wang S, Bovik AC, Huang G-B, Zhao B. Blind noisy image quality assessment using sub-band kurtosis. IEEE Trans Cybernet. 2019;50(3):1146–56.

    Article  Google Scholar 

  19. 19.

    Huang Z, Yu Y, Gu J, Liu H. An efficient method for traffic sign recognition based on extreme learning machine. IEEE Trans Cybernet. 2016;47(4):920–33.

    Article  Google Scholar 

  20. 20.

    Liu W, Pokharel PP, Principe JC. Correntropy: a localized similarity measure. In: The 2006 IEEE international joint conference on neural network proceedings; 2006. IEEE, pp. 4919–4924.

  21. 21.

    Ren Z, Yang L. Correntropy-based robust extreme learning machine for classification. Neurocomputing. 2018;313:74–84.

    Article  Google Scholar 

  22. 22.

    Singh A, Pokharel R, Principe J. The C-loss function for pattern classification. Pattern Recognit. 2014;47(1):441–53.

    Article  Google Scholar 

  23. 23.

    Xu G, Hu B-G, Principe JC. Robust C-loss kernel classifiers. IEEE Trans Neural Netw Learn Syst. 2016;29(3):510–22.

    PubMed  Article  Google Scholar 

  24. 24.

    Zhao Y-P, Tan J-F, Wang J-J, Yang Z. C-loss based extreme learning machine for estimating power of small-scale turbojet engine. Aerosp Sci Technol. 2019;89:407–19.

    Article  Google Scholar 

  25. 25.

    Liangjun C, Honeine P, Hua Q, Jihong Z, Xia S. Correntropy-based robust multilayer extreme learning machines. Pattern Recognit. 2018;84:357–70.

    Article  Google Scholar 

  26. 26.

    Allain M, Idier J, Goussard Y. On global and local convergence of half-quadratic algorithms. IEEE Trans Image Process. 2006;15(5):1130–42.

    PubMed  Article  Google Scholar 

  27. 27.

    Sindhwani V, Niyogi P, Belkin M. Beyond the point cloud: from transductive to semi-supervised learning. In: Proceedings of the 22nd international conference on machine learning; 2005, pp. 824–831.

  28. 28.

    Sindhwani V, Rosenberg DS. An RKHS for multi-view learning and manifold co-regularization. In: Proceedings of the 25th International Conference on Machine Learning; 2008, pp. 976–983.

  29. 29.

    Melacci S, Belkin M. Laplacian support vector machines trained in the primal. J Mach Learn Res. 2011;12(3):1149–84.

    Google Scholar 

  30. 30.

    Lekamalage CKL, Liu T, Yang Y, Lin Z, Huang G-B. Extreme learning machine for clustering. In: Proceedings of ELM-2014 Volume 1. Springer; 2015: 435–444.

  31. 31.

    Liu T, Lekamalage CKL, Huang G-B, Lin Z. Extreme learning machine for joint embedding and clustering. Neurocomputing. 2018;277:78–88.

    Article  Google Scholar 

  32. 32.

    Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–37.

    CAS  PubMed  Article  Google Scholar 

  33. 33.

    Hao Y-J, Gao Y-L, Hou M-X, Dai L-Y, Liu J-X. Hypergraph regularized discriminative nonnegative matrix factorization on sample classification and co-differentially expressed gene selection. Complexity. 2019;2019:7081674.

    Article  Google Scholar 

  34. 34.

    Zhou S, Liu X, Liu Q, Wang S, Zhu C, Yin J. Random Fourier extreme learning machine with L2,1-norm regularization. Neurocomputing. 2016;174:143–53.

    Article  Google Scholar 

  35. 35.

    Lu Y, Gao Y-L, Liu J-X, Wen C-G, Wang Y-X, Yu J. Characteristic gene selection via L 2, 1-norm sparse principal component analysis. In: 2016 IEEE international conference on bioinformatics and biomedicine (BIBM): 2016. IEEE, pp. 1828–1833.

  36. 36.

    Ding C, Zhou D, He X, Zha H. R 1-PCA: rotational invariant L 1-norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd international conference on machine learning: 2006. ACM, pp. 281–288.

  37. 37.

    Yang Y, Shen HT, Ma Z, Huang Z, Zhou X. L21-norm regularized discriminative feature selection for unsupervised learning. In: International joint conference on artificial intelligence; 2011.

  38. 38.

    Nie F, Huang H, Cai X, Ding CH. Efficient and robust feature selection via joint ℓ2, 1-norms minimization. In: Advances in neural information processing systems, 2010; pp. 1813–1821.

  39. 39.

    Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res. 2006;7(1):2399–434.

    Google Scholar 

  40. 40.

    Yu N, Liu J-X, Gao Y-L, Zheng C-H, Wang J, Wu M-J: Graph regularized robust non-negative matrix factorization for clustering and selecting differentially expressed genes. In: 2017 IEEE international conference on bioinformatics and biomedicine (BIBM); 2017. IEEE, pp. 1752–1756.

  41. 41.

    He Q, Jin X, Du C, Zhuang F, Shi Z. Clustering in extreme learning machine feature space. Neurocomputing. 2014;128:88–95.

    Article  Google Scholar 

  42. 42.

    Yu N, Gao Y-L, Liu J-X, Shang J, Zhu R, Dai L-Y. Co-differential gene selection and clustering based on graph regularized multi-view NMF in cancer genomic data. Genes. 2018;9(12):586.

    PubMed Central  Article  CAS  Google Scholar 

  43. 43.

    Gao M-M, Cui Z, Gao Y-L, Liu J-X, Zheng C-H. Dual-network sparse graph regularized matrix factorization for predicting miRNA–disease associations. Mol Omics. 2019;15(2):130–37.

    CAS  PubMed  Article  Google Scholar 

  44. 44.

    Horata P, Chiewchanwattana S, Sunat K. Robust extreme learning machine. Neurocomputing. 2013;102:31–44.

    Article  Google Scholar 

  45. 45.

    Peng Y, Wang S, Long X, Lu B-L. Discriminative graph regularized extreme learning machine and its application to face recognition. Neurocomputing. 2015;149:340–53.

    Article  Google Scholar 

  46. 46.

    Huang G, Liu T, Yang Y, Lin Z, Song S, Wu C. Discriminative clustering via extreme learning machine. Neural Netw. 2015;70:1–8.

    PubMed  Article  Google Scholar 

  47. 47.

    Yi Y, Qiao S, Zhou W, Zheng C, Liu Q, Wang J. Adaptive multiple graph regularized semi-supervised extreme learning machine. Soft Comput. 2018;22(11):3545–62.

    Article  Google Scholar 

  48. 48.

    Boyd S, Vandenberghe L. Convex optimization. Cambridge: Cambridge University Press; 2004.

    Google Scholar 

  49. 49.

    He R, Zheng W-S, Tan T, Sun Z. Half-quadratic-based iterative minimization for robust sparse representation. IEEE Trans Pattern Anal Mach Intell. 2013;36(2):261–75.

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported in part by the NSFC under Grant Nos. 61872220, and 61873001. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Affiliations

Authors

Contributions

LRR and JXL proposed the CSRGELM method and performed the experiments. CHZ and JLS contributed to the data analysis. YLG and LRR drafted the manuscript and improved the writing of manuscripts. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jin-Xing Liu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ren, L., Gao, Y., Liu, J. et al. Correntropy induced loss based sparse robust graph regularized extreme learning machine for cancer classification. BMC Bioinformatics 21, 445 (2020). https://doi.org/10.1186/s12859-020-03790-1

Download citation

Keywords

  • Extreme learning machine
  • Correntropy induced loss
  • Supervised learning
  • Bioinformatics