An efficient gene selection method for microarray data based on LASSO and BPSO

Background The main goal of successful gene selection for microarray data is to find compact and predictive gene subsets which could improve the accuracy. Though a large pool of available methods exists, selecting the optimal gene subset for accurate classification is still very challenging for the diagnosis and treatment of cancer. Results To obtain the most predictive genes subsets without filtering out critical genes, a gene selection method based on least absolute shrinkage and selection operator (LASSO) and an improved binary particle swarm optimization (BPSO) is proposed in this paper. To avoid overfitting of LASSO, the initial gene pool is divided into clusters based on their structure. LASSO is then employed to select high predictive genes and further calculate the contribution value which indicates the genes’ sensitivity to samples’ classes. With the second-level gene pool established by double filter strategy, the BPSO encoding the contribution information obtained from LASSO is improved to perform gene selection. Moreover, from the perspective of the bit change probability, a new mapping function is defined to guide the updating of the particle to select the more predictive genes in the improved BPSO. Conclusions With the compact gene pool obtained by double filter strategies, the improved BPSO could select the optimal gene subsets with high probability. The experimental results on several public microarray data with extreme learning machine verify the effectiveness of the proposed method compared to the relevant methods.

Background DNA microarray datasets have been used to identify the optimal gene subset and perform sample classification between different disease phenotypes, for diagnostic and prognostic purposes [1]. However, many computational methods have difficulties in selecting the optimal set of genes as a result of the small number of samples compared to the huge number of genes, irrelevant genes, and noisy genes [2], which leads poor generalization in the classification process. As a data preprocessing technique, gene selection is a key step for classification [3]. Selecting a critical gene subset could not only decrease the computational complexity and gene redundancy, but also increase the classification accuracy. However, gene selection is a tough task for the high-dimensional microarray data.
Fortunately, the development of swarm intelligence optimization algorithm offers great advantages for microarray data [4]. Due to its simple operation, fast convergence, good global search ability, the swarm intelligence optimization algorithm has been widely accepted and successfully applied to solve a lot of problems.
As an efficient global search technique, particle swarm optimization (PSO) [5,6] has been widely applied to microarray data. Precisely because of its fast convergence speed and good convergence accuracy, PSO has attracted much more attention [7,8] in gene selection. In [9], a combination of teaching learning-based optimization (TLBO) and particle swarm optimization was proposed to find the small optimal gene subset. In [10], the binary PSO (BPSO) coupled with filter method was implemented in searching optimal gene subsets. Sahu et al. [11] proposed a novel feature selection algorithm using PSO for microarray data, which used filtering technique such as signal-to-noise ratio (SNR) score combined with PSO to select key genes for classification and achieved a better classification accuracy than other non-PSO algorithms. In the Kmeans-PSO-ELM [12], the initial gene pool was firstly grouped into several clusters by the the K-means method, and then a compact set of informative genes were obtained after combining the standard PSO with extreme learning machine. These hybrid methods mentioned above had the ability of searching a small predictive gene subset for sample classification. However, the genes selected by these methods were not easily interpretable. Moreover, despite the fact that PSO shows superior performance for selecting optimal feature subsets, it still suffers from the drawback that it is easy to converge to local minima and lead to premature convergence. To overcome the deficiencies of the above PSO based gene selection methods, a modified discrete PSO combined support vector machines (SVM) was proposed in [13] for tumor classification, which verified that the modified PSO was a useful tool for gene selection. In [14], an improved PSO (PSO-RG) with a new global best position (gbest) updating mechanism was proposed to avoid being trapped in a local optimum and achieved superior classification performance. In [15], a gene selection method based on hybrid model BPSO and Bayesian linear discriminant analysis (BLDA) was proposed to select genes with lower redundancy and high classification accuracy. Although the method could relieve the premature problem of PSO and select compact gene subsets, the proposed method selects genes that are not easily interpretable and it may also filtering out some critical genes.
To obtain predictive genes with more interpretability, two gene selection methods based on binary PSO and gene-to-class sensitivity (GCS) information were proposed in [16,17]. In the KMeans-GCSI-MBPSO-ELM [16], a modified BPSO coupling GCS information (GCSI) combined with ELM was used to select smallest possible gene subsets. Although it could obtain predictive genes with lower redundancy and better interpretability, it might filter out a few critical genes highly related to sample classification in some cases and thus lead into worse classification accuracy.
Least absolute shrinkage and selection operator (LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces [18]. Since it can typically extremely sparse, leading to interpretable models with only very few predictor variables, LASSO become another powerful feature selection method [19]. In [20], LASSO was used to select the top key variables in the regression process and achieved a superior performance in gene selection. However, the limitation of the method is that the gene structure in microarray data is not taken into account enough. Furthermore, it is computational costly and may cause overfitting problem.
According to the above analysis, some current PSObased gene selection methods lack interpretability as well as filtering out some key genes. Some LASSObased methods do not consider gene structure and have overfitting problem with high computational cost. To overcome those deficiencies, an efficient gene method combining LASSO with an improved BPSO is proposed in this paper. Firstly, the signal-to-noise ratio (SNR) filter method is employed to filter out some genes in order to establish the initial gene pool. The genes in initial pool are divided into different clusters based on their true geometric structure. Then, LASSO is conducted to select the top contributing genes in each cluster individually to establish the second level gene pool. Finally, an improved BPSO is proposed to select the optimal gene subset. In the improved BPSO, to obtain predictive genes with better interpretability, the contribution values from the LASSO process, indicating the genes' sensitivity to samples' classes, are encoded into the initial and update process of the BPSO. Moreover, from the perspective of the bit change probability, a new mapping function is defined to guide the updating of the particle in order that the swarm can converge to the global optimum with high possibility. With the compact gene pool obtained by double filter strategies, the improved BPSO could select the optimal gene subsets with high probability. Experimental results on several public microarray data verify the effectiveness and efficiency of the proposed hybrid gene selection method.
The remainder of this paper is organized as follows. The related preliminaries and the proposed gene selection method are described in "Methods" section. Simulations are carried out and results are analyzed in "Results" section. Finally, the concluding remarks are offered in "Conclusions" section.

Binary particle swarm optimization
Particle swarm optimization (PSO) is a population-based optimization algorithm in search for the best solution by simulating the movement of flock of birds [6]. The binary PSO [21] which is used for discrete problem was proposed. Its general steps are described as follows.
The velocity of the i − th particle is represented by v i = (v i1 , v i2 , ..., v iD ) and the position of the i − th particle is represented by x i = (x i1 , x i2 , ..., x iD ), i = 1, 2, ..., n, where n is the size of population. Once the adaptive values personal best position (pbest) and gbest are obtained, the features of the pbest and gbest particle can be traced with regard to their position and velocity. Each particle is updated according to the following equation: x where j = (1, 2, ..., D); pbest ij = (pbest i1 , pbest i2 , ..., pbest iD ) is the personal best position of the i − th particle and gbest(t) = (gbest 1 , gbest 2 , ..., gbest D ) is the global best position in the t-th iteration; ω is the inertial weight of BPSO; t denotes the iteration number; c1 and c2 are two acceleration factors which can balance the impact of pbest and gbest; r 1 and r 2 are two numbers randomly generated in [0, 1].

Least absolute shrinkage and selection operator
To improve variable selection, Tibshirani [18] developed the least absolute shrinkage and selection operator (LASSO). LASSO is a combination of ridge regression. It can automatically select a set of informative variables through the regression coefficients in the linear regression model shrinking to zero [22]. Suppose that the data (x i , y i ) contains n samples and m features, x = (x 1 , x 2 , ..., x m ),where x j = (x 1j , x 2j , ...x nj ) T are the predictor variables, y = (y 1 , y 2 , ...y nj ) T and y i is the responses. Assume that the x ij are standardized and the y i are centralization, there is: Letting regression coefficients β = (β 1 , β 2 , ...β m ), the LASSO estimate is defined as follows: where t ≥ 0 is a tuning parameter.

Extreme learning machine
To solve the problem of gradient-based learning algorithms, a learning algorithm for single-hidden layer feedforward neural networks (SLFNs) called extreme learning machine (ELM) was proposed in [23]. In ELM, the input weights and hidden biases are randomly selected, and then the output weights are calculated by generalized inverse of hidden output matrix. ELM has much better generalization performance with much faster learning speed than gradient-based algorithms [24,25,26]. For N arbitrary distinct samples ( A SLFN with N H hidden nodes and activation function g( ) can approximate these N samples with zero error. This means that: where T represents the target matrix vectors, H is the hidden output matrix: H ωh 1 ; :::; ωh N H ; b 1 ; :::; b N H ; x 1 ; :: ::: ::: where the ωh i = [ωh i1 , ωh i2 .., ωh in ] T is the input weight vector connecting the i − th hidden neuron and input neurons, the ωo i = [ωo i1 , ωo i2 .., ωo im ] is the output weight vector connecting the i − th hidden neuron and the output neurons.
In the process of learning, the input weight and the hidden biases are arbitrarily chosen and need not be adjusted at all. Secondly, the smallest norm least-squares solution of the Eq. (6) is obtained as follows: where H + is the Moore-Penrose inverse of H.

The proposed gene selection method based on LASSO and BPSO
The proposed method is aimed to deal with the two problems on how to take advantage of intrinsic gene structure information, avoid overfitting with less complexity and how to select the optimal gene subsets to improve the classification accuracy without filtering out key genes. In the selection process, geodesic distance is calculated as the measurement between genes, which can preserve the intrinsic geometry of high dimensional microarray data. To decrease the complexity of the model and avoid overfitting of LASSO, the initial gene pool data are divided into clusters by using the Kmedoids approach. The clustering process based on geodesic distance could solve the former problem. To solve the latter problem, the improved BPSO is proposed to select the possible gene subsets, including encoding gene contribution information and defining a new mapping function, which could help particles converge to the optimal with higher possibility. The gene contribution information and the improved BPSO are depicted in detail in the following subsections.
The contribution value of each gene obtained by LASSO As mentioned above, the candidate elite gene pool is established by LASSO. However, LASSO is an ordinary least squares cost function extended with a L 1 penalty on the regression coefficients. Since the interval of parameter t is crucial for LASSO and hard to be determined in practical, in this study, Least angle regression (LAR) is used for LASSO. It would return the entire solution path directly for any fixed sample set. So the LASSO process can be described as follows: Step 1: Normalize all the gene variables and centralize all the predictors: Set residual,the coefficients β = (β 1 , β 2 , ..., β m ) are initialized to 0, whereŷ ¼ Xβ; Step 2: Find the x j which is most correlated with r, the current correlation coefficient c ¼ cðŷÞ ¼ X T ðy−ŷÞ; Step 3: Move β j from 0 towards the inner product of the x j and r until some other variable x k has as much correlation with the current residual r; Step 4: Move β j and β k the inner product of the α = (x j , x k ) and r until some other variable x p has as much correlation with the current residual r; If the coefficient β m is decreased to 0, then delete the corresponding variable x m and recalculate the r; Step 5: Repeat step 2 to step 4 until all variables have been calculated by the model.
After the LASSO method, top contributing genes are selected and the corresponding regression coefficients are obtained. Furthermore, the value of the regression coefficients β = (β 1 , β 2 , ...β m ) are the contribution values of genes to the class.

The improved BPSO
In this study, BPSO is modified from two aspects. One is to encode the contribution value to the BPSO to select those genes which are much related to samples' classes. The other is to modify the mapping function of velocity in BPSO for increasing the probability of finding the optimal with fewer iterations. Generally, a variable with larger coefficient value makes more contribution to regression model. Consequently, it is convincing that gene with a large contribution value contributes more to samples' classification than one with small contribution value, so it is reasonable to select those genes with large contribution values to achieve higher classification accuracy. In this study, the contribution value is encoding into the BPSO in the swarm initialization and update.
It is possible that the particles initialized by the traditional BPSO randomly are far from the global minima or near some local minima, which may lead to slow convergence and premature. In this study, to make the initial swarm near the global minima with high probability, the particles are initialized according to the contribution values of genes selected. Hence, the top twenty percentages of the genes with the largest contribution values are randomly initialized as 1 s or 0 s after all genes sorted in descending order according to their contribution values. The rest components related to the eighty percentages of the genes with the smallest contribution values are initialized as 0, which indicates that those genes are not selected.
In the swarm update process, given that the high contribution value indicating the high possibility to be selected, the formula of updating the particles is modified according to the contribution value as follows: where Contribution(j) is the contribution value of the jth gene. s( ) is the mapping function and avg(Contribution) is the average contribution value of all genes. x ij and v ij are the j-th component of the position and velocity respectively of the i-th particle. Under the effective ※also selected in [16]; # also selected in [17];❖also selected in [30] [16]; # also selected in [17]; ❖also selected in [31]; ✺also selected in [32];✱also selected in [30] guidance it directly searches the optimal gene subset sensitive to the class from the candidate elite gene pool. To make sure the particle could converge to the global best position with higher possibility, the mapping function in BPSO is defined as follows: where T is the maximum iteration number. The interpretation for the new mapping function can be described as follows: In binary PSO, each particle consists of binary code, bit change probability is first proposed in [21], it represent the change probability of every bit in binary code. According to the analysis in [27], if the bit in (t-1)-th iteration is 0, the changing probability of bit in t-th iteration is s(v ij (t)); Similarly, if bit is 1 in (t-1)-th iteration, the changing probability of bit in t-th iteration is (1 − s(v ij (t))). Thus, the change probability of bit in t t-th iteration is calculated as follows: The relation of the bit change probability to v ij can be simply characterized in Fig. 1. As can be seen, when particle x ij converge to global location gbest j , the change rate is 0.5 which is up to maximum. That is, if BPSO converges to global optimal particle, its velocity is 0 which means that the rate of bit changing is maximum,  [16]; # also selected in [30]  ※also selected in [16]; # also selected in [30] so BPSO is more stochastic and lacks search directionality thus it may can not converge to the global best position. From the above idea, different from the mapping function in [27], the new mapping function differs from the original sigmoid in two respects. First consideration is the difference between the probability mapping function and the sigmoid function on velocity. The purpose is to make the probability function value is 0 when the speed is 0. Second consideration is from the iteration number aspect to make sure the BPSO can convergence to global optimal. The function curve is shown in Figs. 2 and 3.
From the Fig. 2, under the new mapping function, the mapping value is 0 when the velocity is 0 so the change rate is 0 which meets the requirements. Besides, as can be seen in Fig. 3, in addition to the different mappings of bit velocity, the mapping function also take the iteration number into consideration, with the iteration number increasing, the mapping value is more closer to 1, which means the gene selected with higher probability.

The steps of the proposed gene selection method
Since the proposed method combines the LASSO with BPSO based on K-medoids and ELM to perform gene selection, it is referred to as the KL-IBPSO-ELM method. Figure 4 depicts the frame of the proposed gene selection method, and the detailed steps are described as follows: Step 1: Form an initial gene pool. The dataset is divided into training and testing datasets. Selecting 200 genes from all original genes by using SNR method [28] on the training data. Then, the training dataset is further divided into the training and validation datasets.
Step 2: Establish the candidate elite gene pool. First, calculate the geodesic distance between every two genes in initial pool. The geodesic distance can reflect flow structure of the high dimensional gene data more precisely [29]. Then, employ the K-medoids to cluster the genes. The purpose of clustering before LASSO is to give full consideration of gene structure as well as decrease the computational complexity. Finally, the top contributing genes are obtained by LASSO selecting in every clusters. Moreover, the contribution values of the elite genes are gained.
Step 3: Use the improved BPSO to select the optimal gene subsets from the candidate elite gene pool. Initialize all particles according to the contribution Initialization rule. The position x ij can be coded to 0 or 1. 1 means the i − th gene is selected and 0 means the i − th gene is not selected. Set the current position of each particle as its pbest, and compute the gbest. Update the particle according to contribution updating rule and new mapping function. Compute the fitness value of each particle. In this study, the selected gene number is not  [16]; # also selected in [17]; ⊙also selected in [33]; ✱also selected in [34]; ✺also selected in [30] Fig. 1 Relation between change rate and bit velocity fixed so as to further avoid filter out key genes. Therefore, the fitness value of the i-th particle is adopted by the corresponding accuracy obtained by ELM denoted by the i-th particle.
The KL-IBPSO-ELM method firstly filter out the irrelevant genes by the SNR method. Then the LASSO selects the candidate elite genes in every clusters obtained through K-medoids method based on the geodesic distance. Finally, to obtain the optimal gene subsets, the BPSO is modified to improve its convergence by encoding the contribution value from LASSO and changing the new mapping function. It could select the most predictive gens with low redundancy effectively.
Additionally, the proposed gene selection method contains filtering irrelevant genes to establish the gene pool and using PSO to select functional gene subsets from the gene pool, and its computational complexity is at the same order of magnitude of that of the PSO-based [16,17,30] gene selection methods.
In the experiments on all data, the swarm size is 30, the maximum iteration number is selected as 100, the acceleration constants c1 and c2 are both selected as 1.49445, and the inertial weight varies from 0.9 to 0.4. The size of the second-level gene pool is 25 on all data. The parameter of cluster number is fixed as 5 on all data. The values of these parameters are determined by the cross-validation runs on the training datasets. All the experiments are run in MATLAB 8.1 environment.
The biological and functional analysis of the selected gene subsets The experiments are carried out 500 times on each microarray data, and the top ten frequently selected genes by the proposed method are listed in Tables 2, 3, 4, 5, and 6. Many genes selected by the KL-IBPSO-ELM method were also selected by one or more methods proposed in [16,17] [ [30][31][32][33][34].
The heatmap with top ten frequently selected genes for the five data is shown in Fig. 5. From Fig. 5a, the expression levels of genes 765, 14, 493, 377, 175 are distinct in two classes. From Fig. 5b, the only expression level of gene 4309 has a distinct expression level in Desmoplastic class. From Fig. 5c, most of ten genes expression levels clearly differentiate between AML and ALL. From Fig. 5d, there has no single gene whose expression levels are distinct between the two classes. From Fig. 5e, The expression levels of genes 3178 are distinct from SQ and other classes, the ones of gene 2672,1974 and 580 are distinct from PC and other classes, the ones of gene 2264, 1974 and 235 are distinct from SM and other classes, and the gene 295 has distinct expression level in ADE. According to Table 4 and   Table 7. The proposed mehod in this study outperform other four methods on the Brain cancer data. The SC-IPSO-ELM achives better performance than other methods on the Colon, Lymphoma and LUNG data, and the KMeans-GCSI-MBPSO-ELM method achieves better performance than the KL-IBPSO-ELM method on the Colon and LUNG data. On the Leukemia data, the KL-IBPSO-ELM achieves 100% 5fold CV accuracy as well as the KMeans-GCSI-MBPSO-ELM and SC-IPSO-ELM methods. These results indicate that the KL-IBPSO-ELM is also capable of selecting those predictive genes highly related to samples' classes.

The performance comparison between the original BPSO and the improved BPSO
To illustrate the performance of the improved BPSO, the experiments conducted by the KL + BPSO+ELM frame wtih the improved BPSO and original BPSO, respectively. Figure 6 shows the 5-fold CV accuracy on the test data on the five data versus the iteration number of the original BPSO compared with that of the improved BPSO. From  Fig. 6, the improved BPSO finds the optimal gene subsets with only 30, 40, 38, 31 and 32 epochs on the Colon, Brain cancer, Leukemia, Lymphoma, LUNG data, respectively, whereas the original BPSO require 42, 50, 42, 38 and 40 epochs, on the above five data, respectively, which shows the improved BPSO could find the optimal with less iteration number than the original BPSO. Furthermore, on each specific epoch, the 5-fold CV accuracy of the improved BPSO is always higher than that of the original BPSO. These results indicate that the improved BPSO has the ability to converges slightly faster than the original BPSO and could selected the optimal gene subsets. Figure 7 shows the contribution values of selected genes in every iteration of IBPSO on the five data. In the process of selecting the optimal gene subset, the KL-IBPSO-ELM is apt to select those genes with high contribution values, so the subsets' contribution value has a increase trend as the iteration increases. The KL-IBPSO-ELM method does not always select those genes with the highest contribution values, and it also selects those critical genes with comparatively low contribution values to form the predictive gene subsets for achieving higher classification accuracy. Hence, the contribution curve fluctuates at the early iterations.

Discussion on the parameter selection
To cluster the genes in initial gene pool, it is critical to determine the number of the clusters. Fig. 8 shows the relationship between the classification accuracy on the test data obtained by ELM and the number of the clusters. From Fig. 8, the 5-fold CV accuracy does not have a specific trend as the values of the parameter k increases, and the accuracy is highest when the k is selected as 5 on the Colon, Brain cancer, Leukemia, Lymphoma and LUNG data. Thus the clusters number k is fixed as 5 in the experiments.

Conclusions
In this study, a gene selection method based on LASSO and BPSO was proposed to obtain the most predictive genes subsets. To give full consideration of gene structure as well as avoid LASSO overfitting, the candidate elite genes were selected by double filter method. Then by encoding the contribution value into the BPSO and defining a new mapping function, the improved BPSO was able to select a highly predictive and compact gene subset. Experimental results verified that the proposed method outperformed other PSObased and GCSI-based gene selection methods. Although the proposed could avoid filter out some of the key genes and reduce the rate at which the selection of new important is ignored by other relevant method, the proposed method may increase the computational cost because of complex establishment process of the candidate elite gene pool. Future work will include how to simplify the model for gene selection and apply the new method to more complex microarray data including RNA-seq data. Fig. 8 The number of the clusters versus the 5-fold CV accuracy on the test data obtained by ELM