Binary particle swarm optimization
Particle swarm optimization (PSO) is a population-based optimization algorithm in search for the best solution by simulating the movement of flock of birds [6]. The binary PSO [21] which is used for discrete problem was proposed. Its general steps are described as follows.
The velocity of the i − th particle is represented by vi = (vi1, vi2, ..., viD) and the position of the i − th particle is represented by xi = (xi1, xi2, ..., xiD), i = 1, 2, ..., n, where n is the size of population. Once the adaptive values personal best position (pbest) and gbest are obtained, the features of the pbest and gbest particle can be traced with regard to their position and velocity. Each particle is updated according to the following equation:
$$ {\displaystyle \begin{array}{l}{v}_{ij}\left(t+1\right)=\omega \ast {v}_{ij}(t)+{c}_1\ast {r}_1\ast \left( pbes{t}_{ij}(t)-{x}_{ij}(t)\right)\\ {}+c2\ast {r}_2\ast \left( gbes{t}_j(t)-{x}_{ij}(t)\right)\end{array}} $$
(1)
$$ {x}_{ij}=\Big\{{\displaystyle \begin{array}{c}1\kern2.25em \mathit{\operatorname{rand}}\left(\right)<s\left({v}_{ij}\right)\\ {}0\kern2.25em \mathit{\operatorname{rand}}\left(\right)\ge s\left({v}_{ij}\right)\end{array}} $$
(2)
$$ s\left({v}_{ij}\right)=\frac{1}{1+\mathit{\exp}\left(-{v}_{ij}\right)} $$
(3)
where j = (1, 2, ..., D); pbestij = (pbesti1, pbesti2, ..., pbestiD) is the personal best position of the i − th particle and gbest(t) = (gbest1, gbest2, ..., gbestD) is the global best position in the t-th iteration; ω is the inertial weight of BPSO; t denotes the iteration number; c1 and c2 are two acceleration factors which can balance the impact of pbest and gbest; r1 and r2 are two numbers randomly generated in [0, 1].
Least absolute shrinkage and selection operator
To improve variable selection, Tibshirani [18] developed the least absolute shrinkage and selection operator (LASSO). LASSO is a combination of ridge regression. It can automatically select a set of informative variables through the regression coefficients in the linear regression model shrinking to zero [22].
Suppose that the data (xi, yi) contains n samples and m features, x = (x1, x2, ..., xm),where xj = (x1j, x2j, ...xnj)T are the predictor variables, y = (y1, y2, ...ynj)T and yi is the responses. Assume that the xij are standardized and the yi are centralization, there is:
$$ \sum \limits_{i=1}^n{y}_i=0,\sum \limits_{i=1}^n{x}_i=0,\sum \limits_{i=1}^n{x_{ij}}^2=1,j=1,2,...m $$
(4)
Letting regression coefficients β = (β1, β2, ...βm), the LASSO estimate is defined as follows:
$$ {\displaystyle \begin{array}{l}\underset{\beta }{argmin}\left\{\sum \limits_{i=1}^n{\left({y}_i-\sum \limits_{j=1}^m{x}_{ij}{\beta}_j\right)}^2\right\}\kern0.5em \\ {}\kern3.25em subject\ to\sum \limits_{j=1}^m\mid {\beta}_j\mid \le t\end{array}} $$
(5)
where t ≥ 0 is a tuning parameter.
Extreme learning machine
To solve the problem of gradient-based learning algorithms, a learning algorithm for single-hidden layer feedforward neural networks (SLFNs) called extreme learning machine (ELM) was proposed in [23]. In ELM, the input weights and hidden biases are randomly selected, and then the output weights are calculated by generalized inverse of hidden output matrix. ELM has much better generalization performance with much faster learning speed than gradient-based algorithms [24, 25, 26]. For N arbitrary distinct samples (xi, ti) (i = 1, 2, ..., N), where xi = [xi1, xi2, ..., xin] ∈ Rn, ti = [ti1, ti2, ..., tim] ∈ Rm. A SLFN with NH hidden nodes and activation function g( ) can approximate these N samples with zero error. This means that:
$$ H\omega \mathrm{o}=T $$
(6)
where T represents the target matrix vectors, H is the hidden output matrix:
$$ {\displaystyle \begin{array}{l}H\left(\omega {h}_1,...,\omega {h}_{N_H},{b}_1,...,{b}_{N_H},{x}_1,...,{x}_N\right)\\ {}=\mid \begin{array}{c}g\left(w{h}_1\bullet {x}_1+{b}_1\right)\kern1.5em ...\kern0.75em g\left(w{h}_{N_H}\bullet {x}_1+{b}_{N_H}\right)\\ {}g\left(w{h}_1\bullet {x}_2+{b}_1\right)\kern1.5em ...\kern0.75em g\left(w{h}_{N_H}\bullet {x}_2+{b}_{N_H}\right)\\ {}...\kern3.25em ...\kern3em ...\\ {}g\left(w{h}_1\bullet {x}_N+{b}_1\right)\kern1.5em ...\kern0.75em g\left(w{h}_{N_H}\bullet {x}_N+{b}_{N_H}\right)\end{array}\mid \end{array}} $$
$$ \omega o={\left|\begin{array}{c}\omega {o}_1^T\\ {}\omega {o}_2^T\\ {}\vdots \\ {}\omega {o}_{N_H}^T\end{array}\right|}_{N_H\times m},T={\left|\begin{array}{c}{t}_1^T\\ {}{t}_2^T\\ {}\vdots \\ {}{t}_N^T\end{array}\right|}_{N\times m} $$
(7)
where the ωhi = [ωhi1, ωhi2.., ωhin]T is the input weight vector connecting the i − th hidden neuron and input neurons, the ωoi = [ωoi1, ωoi2.., ωoim] is the output weight vector connecting the i − th hidden neuron and the output neurons.
In the process of learning, the input weight and the hidden biases are arbitrarily chosen and need not be adjusted at all. Secondly, the smallest norm least-squares solution of the Eq. (6) is obtained as follows:
$$ \omega o={H}^{+}T $$
(8)
where H+ is the Moore-Penrose inverse of H.
The proposed gene selection method based on LASSO and BPSO
The proposed method is aimed to deal with the two problems on how to take advantage of intrinsic gene structure information, avoid overfitting with less complexity and how to select the optimal gene subsets to improve the classification accuracy without filtering out key genes. In the selection process, geodesic distance is calculated as the measurement between genes, which can preserve the intrinsic geometry of high dimensional microarray data. To decrease the complexity of the model and avoid overfitting of LASSO, the initial gene pool data are divided into clusters by using the K-medoids approach. The clustering process based on geodesic distance could solve the former problem. To solve the latter problem, the improved BPSO is proposed to select the possible gene subsets, including encoding gene contribution information and defining a new mapping function, which could help particles converge to the optimal with higher possibility. The gene contribution information and the improved BPSO are depicted in detail in the following subsections.
The contribution value of each gene obtained by LASSO
As mentioned above, the candidate elite gene pool is established by LASSO. However, LASSO is an ordinary least squares cost function extended with a L1 penalty on the regression coefficients. Since the interval of parameter t is crucial for LASSO and hard to be determined in practical, in this study, Least angle regression (LAR) is used for LASSO. It would return the entire solution path directly for any fixed sample set. So the LASSO process can be described as follows:
Step 1: Normalize all the gene variables and centralize all the predictors:
$$ \sum \limits_{i=1}^n{y}_i=0,\sum \limits_{i=1}^n{x}_i=0,\sum \limits_{i=1}^n{x_{ij}}^2=1,j=1,2,...m $$
Set residual,the coefficients β = (β1, β2, ..., βm) are initialized to 0, where \( \hat{y}= X\beta \);
Step 2: Find the xj which is most correlated with r, the current correlation coefficient \( c=c\left(\hat{y}\right)={X}^T\left(y-\hat{y}\right) \);
Step 3: Move βj from 0 towards the inner product of the xj and r until some other variable xk has as much correlation with the current residual r;
Step 4: Move βj and βk the inner product of the α = (xj, xk) and r until some other variable xp has as much correlation with the current residual r; If the coefficient βm is decreased to 0, then delete the corresponding variable xm and recalculate the r;
Step 5: Repeat step 2 to step 4 until all variables have been calculated by the model.
After the LASSO method, top contributing genes are selected and the corresponding regression coefficients are obtained. Furthermore, the value of the regression coefficients β = (β1, β2, ...βm) are the contribution values of genes to the class.
The improved BPSO
In this study, BPSO is modified from two aspects. One is to encode the contribution value to the BPSO to select those genes which are much related to samples’ classes. The other is to modify the mapping function of velocity in BPSO for increasing the probability of finding the optimal with fewer iterations.
Generally, a variable with larger coefficient value makes more contribution to regression model. Consequently, it is convincing that gene with a large contribution value contributes more to samples’ classification than one with small contribution value, so it is reasonable to select those genes with large contribution values to achieve higher classification accuracy. In this study, the contribution value is encoding into the BPSO in the swarm initialization and update.
It is possible that the particles initialized by the traditional BPSO randomly are far from the global minima or near some local minima, which may lead to slow convergence and premature. In this study, to make the initial swarm near the global minima with high probability, the particles are initialized according to the contribution values of genes selected. Hence, the top twenty percentages of the genes with the largest contribution values are randomly initialized as 1 s or 0 s after all genes sorted in descending order according to their contribution values. The rest components related to the eighty percentages of the genes with the smallest contribution values are initialized as 0, which indicates that those genes are not selected.
In the swarm update process, given that the high contribution value indicating the high possibility to be selected, the formula of updating the particles is modified according to the contribution value as follows:
$$ {x}_{ij}=\Big\{{\displaystyle \begin{array}{c}\begin{array}{l}1\kern2.25em \mathit{\operatorname{rand}}\left(\right)+ avg(Contribution)\\ {}\kern2.5em \le s\left({v}_{ij}\right)+ Contribution(j)\end{array}\\ {}\begin{array}{l}0\kern2.25em \mathit{\operatorname{rand}}\left(\right)+ avg(Contribution)\\ {}\kern2.75em >s\left({v}_{ij}\right)+ Contribution(j)\end{array}\end{array}} $$
(9)
where Contribution(j) is the contribution value of the j-th gene. s( ) is the mapping function and avg(Contribution) is the average contribution value of all genes. xij and vij are the j-th component of the position and velocity respectively of the i-th particle. Under the effective guidance it directly searches the optimal gene subset sensitive to the class from the candidate elite gene pool.
To make sure the particle could converge to the global best position with higher possibility, the mapping function in BPSO is defined as follows:
$$ s\left({v}_{ij}\right)=\Big\{{\displaystyle \begin{array}{c}1-\frac{2-\frac{t}{T}}{1+{e}^{-{v}_{ij}}},\kern3em {v}_{ij}<0\\ {}1-\frac{2-\frac{t}{T}}{1+{e}^{v_{ij}}},\kern3.25em {v}_{ij}\ge 0\end{array}} $$
(10)
where T is the maximum iteration number.
The interpretation for the new mapping function can be described as follows:
In binary PSO, each particle consists of binary code, bit change probability is first proposed in [21], it represent the change probability of every bit in binary code. According to the analysis in [27], if the bit in (t-1)-th iteration is 0, the changing probability of bit in t-th iteration is s(vij(t)); Similarly, if bit is 1 in (t-1)-th iteration, the changing probability of bit in t-th iteration is (1 − s(vij(t))). Thus, the change probability of bit in t t-th iteration is calculated as follows:
$$ {\displaystyle \begin{array}{l}p(t)=s\left({v}_{ij}(t)\right)\ast \left(1-s\left({v}_{ij}\left(t-1\right)\right)\right)+\\ {}\kern2.5em s\left({v}_{ij}\left(t-1\right)\right)\ast \left(1-s\left({v}_{ij}(t)\right)\right)\\ {}\kern0.85em =\frac{1}{1+\mathit{\exp}\left(-{v}_{ij}(t)\right)}\ast \left(1-\frac{1}{1+\mathit{\exp}\left(-{v}_{ij}\left(t-1\right)\right)}\right)+\\ {}\kern0.72em \frac{1}{1+\mathit{\exp}\left(-{v}_{ij}\left(t-1\right)\right)}\ast \left(1-\frac{1}{1+\mathit{\exp}\left(-{v}_{ij}(t)\right)}\right)\end{array}} $$
(11)
The relation of the bit change probability to vij can be simply characterized in Fig. 1. As can be seen, when particle xij converge to global location gbestj, the change rate is 0.5 which is up to maximum. That is, if BPSO converges to global optimal particle, its velocity is 0 which means that the rate of bit changing is maximum, so BPSO is more stochastic and lacks search directionality thus it may can not converge to the global best position.
From the above idea, different from the mapping function in [27], the new mapping function differs from the original sigmoid in two respects. First consideration is the difference between the probability mapping function and the sigmoid function on velocity. The purpose is to make the probability function value is 0 when the speed is 0. Second consideration is from the iteration number aspect to make sure the BPSO can convergence to global optimal. The function curve is shown in Figs. 2 and 3.
From the Fig. 2, under the new mapping function, the mapping value is 0 when the velocity is 0 so the change rate is 0 which meets the requirements. Besides, as can be seen in Fig. 3, in addition to the different mappings of bit velocity, the mapping function also take the iteration number into consideration, with the iteration number increasing, the mapping value is more closer to 1, which means the gene selected with higher probability.
The steps of the proposed gene selection method
Since the proposed method combines the LASSO with BPSO based on K-medoids and ELM to perform gene selection, it is referred to as the KL-IBPSO-ELM method. Figure 4 depicts the frame of the proposed gene selection method, and the detailed steps are described as follows:
Step 1: Form an initial gene pool. The dataset is divided into training and testing datasets. Selecting 200 genes from all original genes by using SNR method [28] on the training data. Then, the training dataset is further divided into the training and validation datasets.
Step 2: Establish the candidate elite gene pool. First, calculate the geodesic distance between every two genes in initial pool. The geodesic distance can reflect flow structure of the high dimensional gene data more precisely [29]. Then, employ the K-medoids to cluster the genes. The purpose of clustering before LASSO is to give full consideration of gene structure as well as decrease the computational complexity. Finally, the top contributing genes are obtained by LASSO selecting in every clusters. Moreover, the contribution values of the elite genes are gained.
Step 3: Use the improved BPSO to select the optimal gene subsets from the candidate elite gene pool. Initialize all particles according to the contribution Initialization rule. The position xij can be coded to 0 or 1. 1 means the i − th gene is selected and 0 means the i − th gene is not selected. Set the current position of each particle as its pbest, and compute the gbest. Update the particle according to contribution updating rule and new mapping function. Compute the fitness value of each particle. In this study, the selected gene number is not fixed so as to further avoid filter out key genes. Therefore, the fitness value of the i-th particle is adopted by the corresponding accuracy obtained by ELM denoted by the i-th particle.
The KL-IBPSO-ELM method firstly filter out the irrelevant genes by the SNR method. Then the LASSO selects the candidate elite genes in every clusters obtained through K-medoids method based on the geodesic distance. Finally, to obtain the optimal gene subsets, the BPSO is modified to improve its convergence by encoding the contribution value from LASSO and changing the new mapping function. It could select the most predictive gens with low redundancy effectively.
Additionally, the proposed gene selection method contains filtering irrelevant genes to establish the gene pool and using PSO to select functional gene subsets from the gene pool, and its computational complexity is at the same order of magnitude of that of the PSO-based [16, 17, 30] gene selection methods.