 Methodology article
 Open Access
 Published:
Correntropy induced loss based sparse robust graph regularized extreme learning machine for cancer classification
BMC Bioinformatics volume 21, Article number: 445 (2020)
Abstract
Background
As a machine learning method with high performance and excellent generalization ability, extreme learning machine (ELM) is gaining popularity in various studies. Various ELMbased methods for different fields have been proposed. However, the robustness to noise and outliers is always the main problem affecting the performance of ELM.
Results
In this paper, an integrated method named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) is proposed. The introduction of correntropy induced loss improves the robustness of ELM and weakens the negative effects of noise and outliers. By using the L_{2,1}norm to constrain the output weight matrix, we tend to obtain a sparse output weight matrix to construct a simpler single hidden layer feedforward neural network model. By introducing the graph regularization to preserve the local structural information of the data, the classification performance of the new method is further improved. Besides, we design an iterative optimization method based on the idea of half quadratic optimization to solve the nonconvex problem of CSRGELM.
Conclusions
The classification results on the benchmark dataset show that CSRGELM can obtain better classification results compared with other methods. More importantly, we also apply the new method to the classification problems of cancer samples and get a good classification effect.
Background
Universal approximation capability plays a crucial role in settling regression and classification problems. Because of this ability, the single hidden layer feedforward neural network has always been the focus and hotspot of researches [1]. As a method to train the SLFNs [2], extreme learning machine (ELM) [3,4,5,6,7,8] has attracted the attention of researchers in recent decades [9]. Different from traditional neural network models, such as the backpropagation (BP) algorithm [10, 11], the training process of ELM is implemented in one step rather than iteratively [12]. In the original ELM, the first step is to randomly initialize an input weight matrix \({\mathbf{A}}\) and remain fixed throughout the process. Then, by using a nonlinear piecewise continuous activation function \({\text{g}} \left( x \right)\), the data of the input layer is mapped into the feature space of the ELM, and a hidden layer output matrix \({\mathbf{H}} = \left[ {{\mathbf{h}}\left( {{\mathbf{x}}_{{\mathbf{1}}} } \right),{\mathbf{h}}\left( {{\mathbf{x}}_{{\mathbf{2}}} } \right), \ldots ,{\mathbf{h}}\left( {{\mathbf{x}}_{{\mathbf{N}}} } \right)} \right]^{T}\) is obtained. Finally, by solving a ridge regression problem [13], the output weights \({{\varvec{\upbeta}}} = \left[ {{{\varvec{\upbeta}}}_{1} , \, {{\varvec{\upbeta}}}_{2} , \ldots , \, {{\varvec{\upbeta}}}_{L} } \right]^{T}\) connecting with the hidden layer and the output layer can be determined [14]. Since there is no need to iteratively solve the output weight matrix, compared with the traditional backpropagation algorithm, ELM can achieve better generalization performance at a faster speed [2, 3, 7]. Because of the advantages of simple theories, high efficiency, and low manual intervention, ELM has been used as a tool for various applications, such as image classification [15, 16], label learning [17], image quality assessment [18], traffic sign recognition [19], and so on.
Although it has been widely used, the robustness and sparseness of the ELM algorithm are still the hot topic. Huang et al. proposed RELM in [5] and in their method, \(L_{2}\)norm was introduced to simultaneously constrain the loss function and the output weight matrix. Their experimental results provided that RELM was better than the original ELM. However, the square loss based on \(L_{2}\)norm will amplify the negative impact of noise and outliers, and lead to inaccurate results. In [9], Li et al. introduced the L_{2,1}norm into ELM as a loss function and the regularization constraint. Hence, a new method named LR21ELM is proposed. The classification results showed that the robustness of the L_{2,1}norm was significantly better than the \(L_{2}\)norm.
As a local similarity measure, correntropy is proposed based on the information theory and the kernel method [20]. Through a nonlinear feature mapping, correntropy can project the data from the input space into the feature space. It also computes the \(L_{2}\)norm distance and defines a correntropy induced metric (CIM) in the feature space [21]. The correntropy induced loss [22] is defined as \({\text{C}} \left( {{\mathbf{t}}_{i} ,{\text{f}} \left( {{\mathbf{x}}_{i} } \right)} \right) = 1  \exp \left( {  {{\left( {{\mathbf{t}}_{i}  {\text{f}} \left( {{\mathbf{x}}_{i} } \right)} \right)^{2} } \mathord{\left/ {\vphantom {{\left( {{\mathbf{t}}_{i}  {\text{f}} \left( {{\mathbf{x}}_{i} } \right)} \right)^{2} } {2\sigma^{2} }}} \right. \kern\nulldelimiterspace} {2\sigma^{2} }}} \right),\) where \({\mathbf{t}}_{i}\) is the target vector, \({\text{f}} \left( {{\mathbf{x}}_{i} } \right)\) is the prediction matrix and \(\sigma\) is the kernel bandwidth. Figure 1 depicts the correntropy induced loss function for different kernel bandwidths within the same error range. We can observe that correntropy induced loss is a nonconvex, bounded, and robust loss function [23].
The robustness of correntropy to noise and outliers has been proved theoretically and experimentally. Ren et al. [21] integrated the correntropy loss and hinge loss (CHloss) into ELM and proposed a robust extreme learning machine with the CHloss (CHELM). They verified the robustness of the method at different noise levels. The results showed that correntropy loss could effectively reduce the influence of noise on classification results. In [24], Zhao et al. proposed the Closs based ELM (CELM) and applied their method to estimate the power of smallscale turbojet engines. Chen et al. [25] introduced the correntropy loss to the multilayer ELM and proposed a robust multilayer ELM autoencoder. The results showed that the feature extraction ability of the method was improved with the improvement of robustness.
In this paper, by integrating the correntropy induced loss into the ELM instead of the original \(L_{2}\)norm, an integrated model named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) is proposed. Different from the traditional ELM, we use L_{2,1}norm instead of \(L_{2}\)norm to constrain the output weight matrix to reduce the complexity of the neural network model. Moreover, the graph regularization is integrated into our method so that the neural network model can learn local structural information between data. This paper mainly makes the following research:

(1)
A new correntropy induced loss based sparse robust graph regularized extreme learning machine is proposed. Compared with the original ELM, the introduction of correntropy induced loss can improve the robustness. And the L_{2,1}norm is used as a sparse constraint to regularize the output weight matrix \({{\varvec{\upbeta}}}\), which can reduce the complexity of the model. To fully preserve the manifold structure information between the original data, the graph regularization is introduced into our method.

(2)
Based on the theory of [26], we design an iterative optimization method to cope with the nonconvex problem of CSRGELM. The convergence and the computational time complexity of the new method are proved, respectively. We also design some experiments to prove the robustness of the method. It is observed that the robustness and classification ability of CSRGELM is better than that of ELM based on the traditional \(L_{2}\)norm loss function. Compared to other robust ELMs, CSRGELM can also achieve competitive results.

(3)
We first perform the classification experiments on five benchmark datasets and evaluate the performance of CSRGELM through multiple evaluation measures. The results show that in most datasets, the classification results of CSRGELM are superior to other methods.

(4)
The new method is applied to the cancer sample classification problems of integrated TCGA datasets. Whether on integrated binary datasets or integrated multiclass classification datasets, the classification performance of CSRGELM is superior to other methods. The experimental results prove that CSRGELM can be a powerful tool for studying biological omics data.
Results
Firstly, five benchmark datasets are used to evaluate the classification performance of RELM, L_{2,1}RFELM, LR21ELM, CELM [24], and CSRGELM. And then, CSRGELM is applied to the cancer sample classification tasks of the TCGA integrated datasets. In the experiments, the sigmoid function is chosen as the activation function. The evaluation criteria for testing classification performance are commonly used measures: Accuracy (Acc); Precision (Pre); Recall; Fmeasure (Fmea). Next, we will introduce the content of the experiment in detail.
Evaluation criteria
According to the Table 1, the definition of each measure are as follows:
For a multiclass dataset, we use one of the classes as the positive class and the remaining as the negative class to compute the accuracy, precision, recall, and Fmeasure. Finally, the average of every measure for all classes is obtained. All methods are conducted in MATLAB R2016a with 64 GB of memory and 3.60GHz computer.
Datasets
We use five popular benchmark datasets to test the classification performance, and every dataset has been widely applied in supervised problems [13, 27,28,29].

(a)
Iris: Taken from the UCI database (https://archive.ics.uci.edu/ml/index.php), Iris is a multiclass classification dataset with 150 samples and 4 features, which is already widely used in unsupervised learning [30, 31] and supervised learning [5].

(b)
COIL20: As a multiclass classification image dataset, the Columbia object image library is often used as a benchmark dataset to test the performance of machine learning methods. With 1024 features, it has 1440 samples, all of which are grayscale images of 20 different objects.

(c)
USPST: As a subset of the popular handwritten digital recognition dataset USPS, USPST is the testing set of USPS. And it has 2007 samples and 256 features.

(d)
g50c: g50c is a binary dataset, and each class is generated by a 50D multivariate Gaussian distribution [13].

(e)
RNAseq: It is a multiclass dataset about cancers, which has different types of tumors: BRCA, KIRC, COAD, LUAD, and PRAD. It has 801 samples and 20,531 features, and every attribute is RNASeq gene expression levels measured by the Illumina HiSeq platform.
To evaluate the performance of CSRGELM in practical applications, we apply CSRGELM to the cancer classification. In recent years, cancer has become the biggest threat to human health. The most effective way to treat cancers has always been to develop different treatments for different types of cancers. Therefore, the improvement of cancer classification is crucial to the progress of cancer treatments [32]. In this paper, four integrated TCGA datasets are used in the experiments. As known as the world's largest cancer genome database, the TCGA database has immeasurable values in the field of cancer research [33]. There are several types of cancer data included in the TCGA database. The details of benchmark datasets are listed in Table 2.
In the experiments, each integrated dataset is a combination of data from two or more cancers. In the integration process, to reduce the sample imbalance rate and ensure the credibility of the experimental results, we remove all normal samples and integrate only the disease samples of each cancer for classification experiments. Tables 3 and 4 list the information about the cancer data used in our experiments.
Convergence and sensitivity
There are four parameters (\(\sigma , \, \lambda , \, C, \, L\)) that need to be turned in the experiments, and different combinations of parameters may produce different classification effects. Hence, ten fold crossvalidation and grid search are used to find the optimal combination of parameters. Besides, the selection range of the parameter \(\sigma\) is \(\left( {2^{  4.5} , \ldots ,2^{4.5} } \right),\) \(\lambda\) and \(C\) are set as \(\left( {10^{  4} ,10^{  3} , \ldots ,10^{5} } \right),\) and \(L\) is set as \(\left( {100,200, \ldots ,2000} \right)\). Taking datasets COIL20 and CEHP as examples, Figs. 2 and 3 depict the sensitivity of CSRGELM to different parameters. Because there are so many different combinations of parameters, we only show the first 180. As shown in the 4D figures, the Xaxis represents the range of \(\lambda\), the Yaxis represents the range of \(\sigma\), and the Zaxis represents the range of \({\text{C}}\). Each point in the figure represents the classification accuracy obtained by different parameter combinations. A conclusion can be drawn from Figs. 2 and 3 that CSRGELM is sensitive to \(\sigma\) and \({\text{C,}}\) while it is insensitive to \(\lambda .\) For the benchmark datasets, when \(\sigma > 2^{  2.5}\) and \(C < 10^{  1} ,\) the classification performance of CSRGELM is better. And for TCGA datasets, when \(\sigma \ge 2^{  2.5}\) and \(C \ge 10^{  4} ,\) the classification performance of CSRGELM is better.
Taking four datasets as the examples, we also show the effect of the number of hidden layer nodes on classification performance in Fig. 4. It is obvious that with the increase of the number of hidden layer nodes, the classification performance of CSRGELM on the benchmark dataset fluctuates greatly. On the TCGA dataset, however, CSRGELM can obtain good classification results.
Besides, L_{2,1}norm and correntropy induced loss are introduced to our method, and their iterative optimization is more complicated. So, an iterative optimization algorithm is designed to solve the above optimization problem. As shown in Figs. 5 and 6, we plot the convergence curves to prove the convergence of the method. In the experiments, we assume that the method will converge after 40 iterations. And it's worth noting that CSRGELM can achieve convergence after 10 iterations. This can prove that the convergence rate of the method is relatively fast, and our iterative optimization algorithm is very efficient.
Classification results on benchmark datasets and TCGA datasets
In this subsection, the classification results of every method are provided. On every dataset, each method runs 20 times, and the average results and variance of the 20 classification results are listed in Tables 5 and 6. Besides, the running time of each method on different datasets is also listed in Tables 7 and 8. The best results are highlighted in italics.
A conclusion can be easily drawn that, both on the benchmark datasets and the integrated TCGA datasets, our method can get better results than other methods, or at least have competitive results. By evaluating each method using different evaluation measures, we can see that our method always gets a competitive result. Compared with RELM, L_{2,1}RFELM, LR21ELM, and CELM, CSRGELM obtains better results in most cases. In terms of running time, RELM can complete the training of the network model in the shortest time because there is no iterative adjustment. Compared with other methods, CSRGELM requires the most running time. According to the analysis, in addition to constantly iterating to optimize the output weight, the calculation of \({\mathbf{H}}^{T} {\mathbf{ZH}}\) or \({\mathbf{ZHH}}^{T}\) also takes a lot of time. How to shorten the training time is also a problem we need to study in the future.
As stated in the previous section, L_{2,1}norm is applied to the output weight matrix as a sparse regularization constraint. To prove the validity of the sparse constraint and the sparseness of the output weight matrix, we analyze the weight distribution of CSRGELM and CELM. Figures 7 and 8 show the output weight distribution of CELM and CSRGELM on CE and CEHP.
From Figs. 7 and 8, we can conclude that the distribution of the elements of the output weight matrix is almost concentrated around zero. This proves that by the constraint of L_{2,1}norm to \({{\varvec{\upbeta}}}\), we can obtain a sparser network model, which makes the model easier to explain and saves storage space and resources. In the neural network model, a sparse network model can achieve feature selection, and then we can remove the unrelated hidden layer nodes to get a more simplified and efficient neural network model.
Discussion
Our method is applied to the sample classification problems, and the generalization performance is better than other methods. The main reason is that the nonconvex function of the correntropyinduced loss is introduced to improve the robustness. CSRGELM is more efficient and accurate than CELM because of the introduction of the graph regularization. What’s more, the L_{2,1}norm regularization constraint has also contributed to the improvement of classification performance. Although in another method LR21ELM [9], the L_{2,1}norm is also used as a loss function to improve the robustness, from the experimental results, in most cases, the robustness of the L_{2,1}norm is weaker than the correntropy induced loss. In other words, correntropy induced loss based methods can effectively reduce the negative influence of noise and outliers on classification results. At the same time, the introduction of the graph regularization can preserve the local structural information of data. The effective combination of them can not only improve the classification performance, but also improve the generalization ability of the model.
The introduction of L_{2,1}norm regularization tends to produce a structural sparsity. It is capable of reducing some rows of the output weight matrix to zero and simplify the inherent complexity of the neural network model. The results of Figs. 7 and 8 also prove the validity of the L_{2,1}norm regularization.
Conclusions
In this paper, we propose a new method named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM) and apply it to the classification problems of cancer samples. The introduction of correntropy induced loss weakens the influence of noise and outliers on the classification performance and improves the robustness of the method. As a powerful sparse regularization constraint, L_{2,1}norm is used to constrain the output weight matrix, which can reduce the complexity of the network model. Besides, the graph regularization is introduced to preserve the local manifold structure between data and reduce the loss of information. To solve the above optimization problem, we propose an efficient iterative optimization algorithm, and the computational complexity of the algorithm is also proved. Whether on the benchmark datasets or the TCGA integrated datasets, the classification performance and generalization performance of CSRGELM are comparable to other methods. In future work, we will still conduct indepth research on the robustness of ELM and apply it to the field of bioinformatics.
Methods
RELM
Huang et al. proposed the regularizedextreme learning machine (RELM) in [5] and proved its good performance in classification or regression problems. For a dataset \(\left\{ {{\mathbf{X}},{\mathbf{T}}} \right\} = \left\{ {{\mathbf{x}}_{i} ,{\mathbf{t}}_{i} } \right\}_{i = 1}^{N} \in {\mathbb{R}}^{N \times m} ,\) where \(N\) is the number of samples and \(m\) is the number of features. The objective function of RELM can be expressed as:
where \(\gamma\) is a regularization parameter, and \({{\varvec{\upxi}}}_{i}\) is the error vector of \(i\)th sample. \({\mathbf{T}}\) is the target label matrix. Substituting constraints into Eq. (5), we get the following unconstrained optimization problem:
Let \(L\) be the number of hidden nodes, if \(N \ge L,\) the solution of \({{\varvec{\upbeta}}}\) can be obtained by calculating the partial derivative of Eq. (6) and setting it to zero:
and
where \({\mathbf{I}}_{L}\) is an identity matrix with dimension \(L.\) If \(N < L,\) \({{\varvec{\upbeta}}}\) can be calculated as:
where \({\mathbf{I}}_{N}\) is an identity matrix with dimension \(N.\) Finally, we get the solution of \({{\varvec{\upbeta}}}\):
L_{2,1}RFELM
As a regularization constraint, Zhou et al. introduced the L_{2,1}norm to constrain the output weight matrix \({{\varvec{\upbeta}}}\) [34]. L_{2,1}norm regularization can generate rowsparsity, which can eliminate the redundant nodes and achieve the feature selection [35,36,37]. The mathematical model of L_{2,1}RFELM is:
where \(C\) is a parameter of the regularization term. Then, Eq. (11) can be rewritten as:
where \({\mathbf{D}}\) is a diagonal matrix with \(d_{ii} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\ {{{\varvec{\upbeta}}}_{i} } \right\_{2} } \right)}}} \right. \kern\nulldelimiterspace} {\left( {2\left\ {{{\varvec{\upbeta}}}_{i} } \right\_{2} } \right)}}.\) By computing the derivative of \({{\varvec{\upbeta}}}\) and setting it equal to zero, we have:
According to the relationship between the number of samples and hidden layer nodes, there are two analytic solutions for \({{\varvec{\upbeta}}}\):
LR21ELM
In [9], Li et al. introduced the L_{2,1}norm to constrain both the error matrix \({{\varvec{\upxi}}}\) and the output weight matrix \({{\varvec{\upbeta}}}\), and proposed a robust sparse ELM method named LR21ELM. The objective function of LR21ELM is:
Following the KKT theorem, the Lagrangian function of Eq. (15) is defined as:
where \({{\varvec{\uptheta}}}_{ij}\) is the Lagrange multiplier. Based on the solution in [38], Eq. (16) is equivalent to:
where \({\mathbf{D}}_{1} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\ {{{\varvec{\upxi}}}_{i} } \right\_{2} } \right)}}} \right. \kern\nulldelimiterspace} {\left( {2\left\ {{{\varvec{\upxi}}}_{i} } \right\_{2} } \right)}},\) and \({\mathbf{D}} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\ {{{\varvec{\upbeta}}}_{i} } \right\_{2} } \right)}}} \right. \kern\nulldelimiterspace} {\left( {2\left\ {{{\varvec{\upbeta}}}_{i} } \right\_{2} } \right)}}.\) According to Eq. (17), the optimal conditions can be written as:
If \(N < L,\) by substituting Eq. (19) and Eq. (20) into Eq. (18), we have:
According to Eq. (19), we have:
And if \(N \ge L,\) by combining Eq. (19) with Eq. (20), we have:
Substituting Eq. (23) into Eq. (18), we obtain an alternative solution of \({{\varvec{\upbeta}}}\):
So, the analytic solution of \({{\varvec{\upbeta}}}\) is:
Graph regularization
Graph regularization framework [39] has been widely used in semisupervised learning [13] and unsupervised learning [40,41,42,43]. In the process of data processing, the graph regularization can preserve the local manifold structure between data, so that the structural information can be extracted, which is beneficial to clustering or classification problems. In mathematics, the expression of graph regularization is as follows:
where \({\text{P}} \left( {{\mathbf{t}}{\mathbf{x}}_{i} } \right)\) and \({\text{P}} \left( {{\mathbf{t}}{\mathbf{x}}_{j} } \right)\) are conditional probabilities, and \({\mathbf{W}} = \left[ {{\mathbf{W}}_{i,j} } \right]\) is the similarity matrix. Equation (26) is equal to
where \({\mathbf{t}}_{i}\) and \({\mathbf{t}}_{j}\) are predictions of \({\mathbf{x}}_{i}\) and \({\mathbf{x}}_{j}\), respectively. And the matrix form of Eq. (27) is:
where \({\mathbf{T}}\) is the prediction matrix, \({\text{Tr}} ( \bullet)\) is the trace norm and \({\mathbf{Z}} = {\mathbf{D}}  {\mathbf{W}}\) is the graph Laplacian matrix. \({\mathbf{D}}\) is a diagonal matrix with \(d_{ii} = \sum\nolimits_{j} {{\mathbf{W}}_{i,j} .}\)
Proposed CSRGELM
In practical applications, the dataset usually includes a lot of noise and outliers, which will cause serious interference to the experiment results, so as to obtain inaccurate results [44]. Due to the noise and outliers, the classification effect of ELM always fails to meet the expectation. A large number of conclusions have proved that the introduction of the graph regularization in ELM method can effectively improve the classification performance or feature extraction ability of the algorithm [45, 46]. Therefore, it is necessary to develop a robust and efficient method for outliers and noise.
In this section, we propose a novel method which is named correntropy induced loss based sparse robust graph regularized extreme learning machine (CSRGELM). The correntropy induced loss function is introduced to replace the square loss, which can effectively improve the robustness of the method. And in our method, the L_{2,1}norm is used to constrain the output weight matrix \({{\varvec{\upbeta}}}\). As an adaptive sparse regularization term, L_{2,1}norm is used to constrain the output weight matrix, which can generate row sparsity, eliminate redundant hidden layer nodes and simplify the structure of the neural network. In recent years, how to use local consistency of data for learning to improve the performance of machine learning methods that has attracted researchers' attention [45]. Based on the theory that similar samples should have similar properties, the graph regularization is combined with our method to preserve the local structural information, which may improve the classification performance of the method [13, 47]. We use the label information of the training sample to construct an adjacent graph, and the regularization term of the graph is integrated to constrain the output weight matrix, so as to learn the similar output of similar samples.
The objective function of CSRGELM
This section introduces the objective function of CSRGELM. For a dataset \(\left\{ {{\mathbf{X}}_{train} , \, {\mathbf{T}}_{train} } \right\} = \left\{ {{\mathbf{x}}_{i} , \, {\mathbf{t}}_{i} } \right\}_{i = 1}^{N} \in {\mathbb{R}}^{N \times m} ,\) \({\mathbf{T}}_{train}\) is the label matrix of \({\mathbf{X}}_{train}\), \(N\) is the number of samples, and \(m\) is the number of features. The mathematical model of CSRGELM can be expressed as:
In Eq. (29), \({{\varvec{\upxi}}}_{i}\) is the error vector, \(\sigma\) is the bandwidth and \({\mathbf{Z}}\) is the graph Laplacian matrix. \(\lambda\) and \(C\) are regularization parameters, respectively. Since Eq. (29) is not a convex function, it can’t be solved by a commonly used optimization method. According to the solution process in [23], we can effectively solve the optimization problem of nonconvex functions.
The optimization of CSRGELM
Since the correntropy induced loss is a differentiable and smooth function, the gradient optimization algorithm can be employed [23]. However, the gradientbased optimization algorithm converges slowly, so we use the halfquadratic optimization algorithm to solve the optimization problem of CSRGELM.
Firstly, we should define a convex function as:
where \(\tau < 0.\) Following the definition and solution of conjugate function in [48]: If we define a differentiable function: \(\psi \left( x \right): \, {\mathbb{R}}^{n} \to {\mathbb{R}},\) the conjugate function \(\psi^{*} \left( x \right): \, {\mathbb{R}}^{n} \to {\mathbb{R}}\) can be expressed as: \(\psi^{*} \left( x \right) = \mathop {\sup }\limits_{p} \left( {px  \psi \left( p \right)} \right).\) And if \(\psi \left( x \right)\) is a convex function, we can obtain that \(\left( {\psi^{*} \left( x \right)} \right)^{*} = \psi \left( x \right)\) [49]. we can obtain the conjugate function of Eq. (30):
and
By letting \({{d{\text{f}}^{{\prime }} \left( \tau \right)} \mathord{\left/ {\vphantom {{d{\text{f}}^{{\prime }} \left( \tau \right)} {d\tau }}} \right. \kern\nulldelimiterspace} {d\tau }} = 0,\) the solution of Eq. (32) can be obtained:
Substituting Eq. (33) into Eq. (31), so Eq. (34) can be expressed as:
When we assume \({{\varvec{\upupsilon}}} = {{{{\varvec{\upxi}}}_{i}^{2} } \mathord{\left/ {\vphantom {{{{\varvec{\upxi}}}_{i}^{2} } {2\sigma^{2} }}} \right. \kern\nulldelimiterspace} {2\sigma^{2} }},\) we will have
As described in [23], the supremum is reached when \(\tau =  \exp \left( {  \left( {{{{{\varvec{\upxi}}}_{i}^{2} } \mathord{\left/ {\vphantom {{{{\varvec{\upxi}}}_{i}^{2} } {2\sigma^{2} }}} \right. \kern\nulldelimiterspace} {2\sigma^{2} }}} \right)} \right) < 0.\)
Combining Eq. (35) with Eq. (29), and we can get hold of the following mathematical model:
where \({{\varvec{\uptau}}} = [{{\varvec{\uptau}}}_{1} , \, {{\varvec{\uptau}}}_{2} , \ldots {,}{{\varvec{\uptau}}}_{N} ]^{T} .\) Equation (36) can be rewritten as:
Obviously, there are two variables that need to be optimized: \({{\varvec{\uptau}}}\) and \({{\varvec{\upbeta}}}\). Here we use a method of fixing one to optimize the other to solve Eq. (37).

(1)
Fixed \({{\varvec{\upbeta}}}^{n}\) to optimize \({{\varvec{\uptau}}}^{n + 1} .\)
For a given \({{\varvec{\upbeta}}}^{n} ,\) Eq. (37) can be expressed as:
Substituting constraints into Eq. (38), we can get:
According to Eq. (32), the solution of Eq. (39) is:
where \({{\varvec{\uptau}}}_{i}^{n + 1} < 0.\)

2
Fixed \({{\varvec{\uptau}}}^{n + 1}\) to optimize \({{\varvec{\upbeta}}}^{n + 1} .\)
For a given \({{\varvec{\uptau}}}^{n + 1} ,\) we focus on solving the problem as:
By eliminating the constraint conditions and rewriting the Eq. (41) into a matrix form, we can get:
Following the conclusion in [38]. Equation (42) can be rewritten as:
where \({\mathbf{D}}^{n + 1}\) is a diagonal matrix and \(d_{ii} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\ {{{\varvec{\upbeta}}}_{i}^{n + 1} } \right\_{2} } \right)}}} \right. \kern\nulldelimiterspace} {\left( {2\left\ {{{\varvec{\upbeta}}}_{i}^{n + 1} } \right\_{2} } \right)}}.\) In theory, the value of \(\left\ {{{\varvec{\upbeta}}}_{i}^{n + 1} } \right\_{2}\) can be zero, but this will make the Eq. (43) undifferentiable. To prevent this from happening, a regularization term is added and
where \(\kappa\) is a very small regularization term, in the experiment, \(\kappa = 10^{  6} .\) It is clear that \(d_{ii} = d_{ii}^{^{\prime}}\) when \(\kappa \Rightarrow 0.\)
Computing the derivative of \({{\varvec{\upbeta}}}^{n + 1}\) about \(\ell_{CSRGELM}\) and we have:
where \({{\varvec{\upomega}}} = {\text{diag}} \left( {  {{\varvec{\uptau}}}_{1}^{n + 1} , \ldots ,  {{\varvec{\uptau}}}_{N}^{n + 1} } \right).\)
For the case that the number of hidden nodes is less than the number of training samples, the output weights matrix \({{\varvec{\upbeta}}}^{n + 1}\) can be solved as:
that is
And if the number of hidden nodes is larger than the number of training samples, \({{\varvec{\upbeta}}}^{n + 1}\) may have an unlimited number of solutions. Inspired by the solution of Huang et al. [13],and according to Eq. (46), we make:
Substituting Eq. (48) into Eq. (46), we have:
And multiplying \(\left( {{\mathbf{HH}}^{T} } \right)^{  1} {\mathbf{H}}\) on both sides of the Eq. (49), we get:
Then we obtain the solution of \({{\varvec{\upalpha}}}\):
And \({{\varvec{\upbeta}}}^{n + 1}\) can be computed as:
where \({\mathbf{I}}\) is an identity matrix with dimension of \(N\). The analytical solution of \({{\varvec{\upbeta}}}^{n + 1}\) can be finally determined as:
And \(\eta = \lambda \sigma^{2} , \, \rho = C\sigma^{2} .\) It is worth noting that \({{\varvec{\upbeta}}}^{n + 1}\) is a dependence on \({\mathbf{D}}^{n + 1} ,\) so an iterative optimization algorithm is proposed for solving \({{\varvec{\upbeta}}}^{n + 1}\) and \({\mathbf{D}}^{n + 1} .\) The flow of Algorithm 1 is as follows:
Computational complexity analysis
The computational complexity of CSRGELM is analyzed in this subsection. We define \(M\) as the number of classes. In Eq. (47), we have to calculate \({\mathbf{D}}^{n + 1} ,\) \({\mathbf{H}}^{T} {\mathbf{\omega H}},\) \({\mathbf{H}}^{T} {\mathbf{ZH}},\) \({\mathbf{H}}^{T} {\mathbf{\omega T}},\) \({{\varvec{\upomega}}}\) and \(\left( {\lambda \sigma^{2} {\mathbf{D}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H}} + C\sigma^{2} {\mathbf{H}}^{T} {\mathbf{ZH}}} \right)^{  1} .\) The computational cost for \({\mathbf{D}}^{n + 1}\) is \(O\left( {LM} \right),\) and it needs \(O\left( {L^{2} N} \right)\) to compute \({\mathbf{H}}^{T} {\mathbf{\omega H}}\) and \({\mathbf{H}}^{T} {\mathbf{ZH}}.\) For \({\mathbf{H}}^{T} {\mathbf{\omega T}},\) the computational complexity is \(O\left( {LNM} \right),\) and the computational complexity for \(\left( {\lambda \sigma^{2} {\mathbf{D}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H}} + C\sigma^{2} {\mathbf{H}}^{T} {\mathbf{ZH}}} \right)^{  1}\) is \(O\left( {L^{3} } \right),\) while it needs \(O\left( {NM} \right)\) to compute \({{\varvec{\upomega}}}.\) In addition, the computational time complexity of the operation of \(\left( {\lambda \sigma^{2} {\mathbf{D}}^{n + 1} + {\mathbf{H}}^{T} {\mathbf{\omega H}} + C\sigma^{2} {\mathbf{H}}^{T} {\mathbf{ZH}}} \right)^{  1}\) multiplied by \({\mathbf{H}}^{T} {\mathbf{\omega T}}\) is \(O\left( {L^{2} M} \right).\) Owing to \(N > L,\) The computational cost of Eq. (47) is \(O\left( {L^{2} N} \right).\) Assuming that the method converges after \(K\) iterations, we can obtain that the final computational cost of CSRGELM is \(K \times O\left( {L^{2} N} \right).\)
Robustness analysis
An experiment is designed to demonstrate the robustness of CSRGELM to outliers and noise. Two groups of data subject to Gaussian distribution that are randomly generated. Class 1 includes 300 samples with mean parameter \(\chi_{1} = \left[ {  2,  2} \right]\) and covariance matrix \(\phi_{1} = \left[ {1 0; 0 1} \right],\) while class 2 includes another 300 samples with mean parameter \(\chi_{2} = \left[ {2, 2} \right]\) and covariance matrix \(\phi_{2} = \left[ {1 0; 0 1} \right].\) And in the experiments, RELM, L_{2,1}RFELM, LR21ELM and CSRGELM are trained on this dataset, respectively. The classification decision boundary has shown in Fig. 9. Figure 9a is the classification results with no noise, and it shows that these two classes are separated easily. Figure 9b is the classification results with 50 noise, these noisy points originally belong to the class 2 but are confused in the class 1. And Fig. 9b shows that under the interference of noise, the classification decision boundaries of these four methods have changed. And the changes of RELM and L_{2,1}RFELM are more obvious. Again, another dataset is generated, class 1 and class 2 have 500 samples, respectively. First, four methods are trained on this dataset and the classification decision boundary is shown in Fig. 10a. It is obvious that the data can be separated by four straight lines. And then, 100 points belonging to class 2 are confused into class 1 as the noise. The final classification results have been shown in Fig. 10b. Clearly, RELM and L_{2,1}RFELM try to fit the noise, and their classification decision boundaries are already unreliable. But due to the constraints of the robust loss function, the classification decision boundaries of CSRGELM and LR21ELM are hardly affected.
Availability of data and materials
The TCGA datasets that support the findings of this study are available in https://www.cancer.gov/aboutnci/organization/ccg/research/structuralgenomics/tcga. The UCI datasets that support the findings of this study are available in https://archive.ics.uci.edu/ml/datasets.php. The datasets of g50c, COIL20, and USPST that support the findings of this study are included in this published article [13].
Abbreviations
 ELM:

Extreme learning machine
 CSRGELM:

Correntropy induced loss based sparse robust graph regularized extreme learning machine
 SLFNs:

Single hidden layer feedforward neural network
 CIM:

Correntropy induced metric
 CHloss:

Correntropy loss and hinge loss
 CHELM:

Extreme learning machine with the CHloss
 TCGA:

The Cancer Genome Atlas
 UCI:

University of California Irvine
 Acc:

Accuracy
 Pre:

Precision
 Fmea:

Fmeasure
References
 1.
Leshno M, Lin VY, Pinkus A, Schocken S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 1993;6(6):861–67.
 2.
Huang GB, Zhu QY, Siew CK. Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat No 04CH37541): 2004. IEEE, pp. 985–990.
 3.
Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing. 2006;70(1–3):489–501.
 4.
Huang GB, Wang DH, Lan Y. Extreme learning machines: a survey. Int J Mach Learn Cybernet. 2011;2(2):107–22.
 5.
Huang GB, Zhou H, Ding X, Zhang R. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybernet Part B. 2012;42(2):513–529.
 6.
Huang GB, Chen L, Siew CK. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw Learn Syst. 2006;17(4):879–92.
 7.
Huang GB. An insight into extreme learning machines: random neurons, random features and kernels. Cognit Comput. 2014;6(3):376–90.
 8.
Huang GB. What are extreme learning machines? Filling the gap between Frank Rosenblatt’s Dream and John von Neumann’s Puzzle. Cognit Comput. 2015;7(3):263–78.
 9.
Li R, Wang X, Lei L, Song Y. L2,1norm based loss function and regularization extreme learning machine. IEEE Access. 2018;7:6575–86.
 10.
Cilimkovic M. Neural networks and back propagation algorithm. Dublin: Institute of Technology Blanchardstown; 2015. p. 15.
 11.
Man Z, Wu HR, Liu S, Yu X. A new adaptive backpropagation algorithm based on Lyapunov stability theory for neural networks. IEEE Trans Neural Networks. 2006;17(6):1580–91.
 12.
Lu H, Zheng E, Lu Y, Ma X, Liu J. ELMbased gene expression classification with misclassification cost. Neural Comput Appl. 2014;25(3–4):525–31.
 13.
Huang G, Song S, Gupta JN, Wu C. Semisupervised and unsupervised extreme learning machines. IEEE Trans Cybernet. 2014;44(12):2405–17.
 14.
Huang G, Huang GB, Song S, You K. Trends in extreme learning machines: a review. Neural Netw. 2015;61(C):32–48.
 15.
Cao F, Liu B, Park DS. Image classification based on effective extreme learning machine. Neurocomputing. 2013;102:90–7.
 16.
Ergul U, Bilgin G. MCKELM: multiple composite kernel extreme learning machine for hyperspectral images. Neural Comput Appl. 2020, 32(11):6809–19
 17.
Jiang M, Pan Z, Li N. Multilabel text categorization using L21norm minimization extreme learning machine. Neurocomputing. 2017;261:4–10.
 18.
Deng C, Wang S, Bovik AC, Huang GB, Zhao B. Blind noisy image quality assessment using subband kurtosis. IEEE Trans Cybernet. 2019;50(3):1146–56.
 19.
Huang Z, Yu Y, Gu J, Liu H. An efficient method for traffic sign recognition based on extreme learning machine. IEEE Trans Cybernet. 2016;47(4):920–33.
 20.
Liu W, Pokharel PP, Principe JC. Correntropy: a localized similarity measure. In: The 2006 IEEE international joint conference on neural network proceedings; 2006. IEEE, pp. 4919–4924.
 21.
Ren Z, Yang L. Correntropybased robust extreme learning machine for classification. Neurocomputing. 2018;313:74–84.
 22.
Singh A, Pokharel R, Principe J. The Closs function for pattern classification. Pattern Recognit. 2014;47(1):441–53.
 23.
Xu G, Hu BG, Principe JC. Robust Closs kernel classifiers. IEEE Trans Neural Netw Learn Syst. 2016;29(3):510–22.
 24.
Zhao YP, Tan JF, Wang JJ, Yang Z. Closs based extreme learning machine for estimating power of smallscale turbojet engine. Aerosp Sci Technol. 2019;89:407–19.
 25.
Liangjun C, Honeine P, Hua Q, Jihong Z, Xia S. Correntropybased robust multilayer extreme learning machines. Pattern Recognit. 2018;84:357–70.
 26.
Allain M, Idier J, Goussard Y. On global and local convergence of halfquadratic algorithms. IEEE Trans Image Process. 2006;15(5):1130–42.
 27.
Sindhwani V, Niyogi P, Belkin M. Beyond the point cloud: from transductive to semisupervised learning. In: Proceedings of the 22nd international conference on machine learning; 2005, pp. 824–831.
 28.
Sindhwani V, Rosenberg DS. An RKHS for multiview learning and manifold coregularization. In: Proceedings of the 25th International Conference on Machine Learning; 2008, pp. 976–983.
 29.
Melacci S, Belkin M. Laplacian support vector machines trained in the primal. J Mach Learn Res. 2011;12(3):1149–84.
 30.
Lekamalage CKL, Liu T, Yang Y, Lin Z, Huang GB. Extreme learning machine for clustering. In: Proceedings of ELM2014 Volume 1. Springer; 2015: 435–444.
 31.
Liu T, Lekamalage CKL, Huang GB, Lin Z. Extreme learning machine for joint embedding and clustering. Neurocomputing. 2018;277:78–88.
 32.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–37.
 33.
Hao YJ, Gao YL, Hou MX, Dai LY, Liu JX. Hypergraph regularized discriminative nonnegative matrix factorization on sample classification and codifferentially expressed gene selection. Complexity. 2019;2019:7081674.
 34.
Zhou S, Liu X, Liu Q, Wang S, Zhu C, Yin J. Random Fourier extreme learning machine with L2,1norm regularization. Neurocomputing. 2016;174:143–53.
 35.
Lu Y, Gao YL, Liu JX, Wen CG, Wang YX, Yu J. Characteristic gene selection via L 2, 1norm sparse principal component analysis. In: 2016 IEEE international conference on bioinformatics and biomedicine (BIBM): 2016. IEEE, pp. 1828–1833.
 36.
Ding C, Zhou D, He X, Zha H. R 1PCA: rotational invariant L 1norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd international conference on machine learning: 2006. ACM, pp. 281–288.
 37.
Yang Y, Shen HT, Ma Z, Huang Z, Zhou X. L21norm regularized discriminative feature selection for unsupervised learning. In: International joint conference on artificial intelligence; 2011.
 38.
Nie F, Huang H, Cai X, Ding CH. Efficient and robust feature selection via joint ℓ2, 1norms minimization. In: Advances in neural information processing systems, 2010; pp. 1813–1821.
 39.
Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res. 2006;7(1):2399–434.
 40.
Yu N, Liu JX, Gao YL, Zheng CH, Wang J, Wu MJ: Graph regularized robust nonnegative matrix factorization for clustering and selecting differentially expressed genes. In: 2017 IEEE international conference on bioinformatics and biomedicine (BIBM); 2017. IEEE, pp. 1752–1756.
 41.
He Q, Jin X, Du C, Zhuang F, Shi Z. Clustering in extreme learning machine feature space. Neurocomputing. 2014;128:88–95.
 42.
Yu N, Gao YL, Liu JX, Shang J, Zhu R, Dai LY. Codifferential gene selection and clustering based on graph regularized multiview NMF in cancer genomic data. Genes. 2018;9(12):586.
 43.
Gao MM, Cui Z, Gao YL, Liu JX, Zheng CH. Dualnetwork sparse graph regularized matrix factorization for predicting miRNA–disease associations. Mol Omics. 2019;15(2):130–37.
 44.
Horata P, Chiewchanwattana S, Sunat K. Robust extreme learning machine. Neurocomputing. 2013;102:31–44.
 45.
Peng Y, Wang S, Long X, Lu BL. Discriminative graph regularized extreme learning machine and its application to face recognition. Neurocomputing. 2015;149:340–53.
 46.
Huang G, Liu T, Yang Y, Lin Z, Song S, Wu C. Discriminative clustering via extreme learning machine. Neural Netw. 2015;70:1–8.
 47.
Yi Y, Qiao S, Zhou W, Zheng C, Liu Q, Wang J. Adaptive multiple graph regularized semisupervised extreme learning machine. Soft Comput. 2018;22(11):3545–62.
 48.
Boyd S, Vandenberghe L. Convex optimization. Cambridge: Cambridge University Press; 2004.
 49.
He R, Zheng WS, Tan T, Sun Z. Halfquadraticbased iterative minimization for robust sparse representation. IEEE Trans Pattern Anal Mach Intell. 2013;36(2):261–75.
Acknowledgements
Not applicable.
Funding
This work was supported in part by the NSFC under Grant Nos. 61872220, and 61873001. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Affiliations
Contributions
LRR and JXL proposed the CSRGELM method and performed the experiments. CHZ and JLS contributed to the data analysis. YLG and LRR drafted the manuscript and improved the writing of manuscripts. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ren, LR., Gao, YL., Liu, JX. et al. Correntropy induced loss based sparse robust graph regularized extreme learning machine for cancer classification. BMC Bioinformatics 21, 445 (2020). https://doi.org/10.1186/s12859020037901
Received:
Accepted:
Published:
Keywords
 Extreme learning machine
 Correntropy induced loss
 Supervised learning
 Bioinformatics