A hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization

Han, Fei; Tang, Di; Sun, Yu-Wen-Tian; Cheng, Zhun; Jiang, Jing; Li, Qiu-Wei

doi:10.1186/s12859-019-2773-x

Volume 20 Supplement 8

Decipher computational analytics in digital health and precision medicine

Research
Open access
Published: 10 June 2019

A hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization

Fei Han^1,2,
Di Tang^1,2,
Yu-Wen-Tian Sun^1,2,
Zhun Cheng³,
Jing Jiang^1,2 &
…
Qiu-Wei Li^1,2

BMC Bioinformatics volume 20, Article number: 289 (2019) Cite this article

2051 Accesses
9 Citations
2 Altmetric
Metrics details

Abstract

Background

Gene selection is one of the critical steps in the course of the classification of microarray data. Since particle swarm optimization has no complicated evolutionary operators and fewer parameters need to be adjusted, it has been used increasingly as an effective technique for gene selection. Since particle swarm optimization is apt to converge to local minima which lead to premature convergence, some particle swarm optimization based gene selection methods may select non-optimal genes with high probability. To select predictive genes with low redundancy as well as not filtering out key genes is still a challenge.

Results

To obtain predictive genes with lower redundancy as well as overcome the deficiencies of traditional particle swarm optimization based gene selection methods, a hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization is proposed in this paper. To select the genes highly related to out samples’ classes, a gene scoring strategy based on randomization and extreme learning machine is proposed to filter much irrelevant genes. With the third-level gene pool established by multiple filter strategy, an improved particle swarm optimization is proposed to perform gene selection. In the improved particle swarm optimization, to decrease the likelihood of the premature of the swarm the Metropolis criterion of simulated annealing algorithm is introduced to update the particles, and the half of the swarm are reinitialized when the swarm is trapped into local minima.

Conclusions

Combining the gene scoring strategy with the improved particle swarm optimization, the new method could select functional gene subsets which are significantly sensitive to the samples’ classes. With the few discriminative genes selected by the proposed method, extreme learning machine and support vector machine classifiers achieve much high prediction accuracy on several public microarray data, which in turn verifies the efficiency and effectiveness of the proposed gene selection method.

Background

One of the major applications of microarray data analysis is to perform sample classification between different disease phenotypes, for diagnostic and prognostic purposes [1]. However, for small size of samples in comparison to high dimensionality, along with experimental variations in measured gene expression levels, it is difficult to implement a particular biological classification problem as well as gain deeper understanding of the functions of particular genes [1]. Gene selection is one of the critical steps in the course of the classification of microarray data [2]. Selecting a useful gene subset not only decreases the computational complexity, but also increases the classification accuracy.

The methods for gene selection are broadly divided into three categories: filter, wrapper and embedded methods [3]. A filter method relies on general characteristics of the training data to select genes without involving any classifier for evaluation. Most filter methods consider each feature separately with ignoring feature dependencies, which may lead to worse classification performance when compared to other types of feature selection methods [4]. In addition to considering feature dependencies, wrapper methods take into account the interaction between feature subset search and model selection. However, wrapper methods have a higher risk of overfitting than filter ones and are very computationally intensive [5]. Embedded methods have the advantage that they include the interaction with the classification model, while being far less computationally intensive than wrapper methods [6].

Since it has no complicated evolutionary operators and fewer parameters need to be adjusted [7, 8], particle swarm optimization (PSO) [9, 10] has been used increasingly as an effective technique for global optimization in past decades. In recent years, PSO has been also implemented to perform gene selection. In [11], a combination of Integer-Coded GA (ICGA) and particle swarm optimization, coupled with extreme learning machine (ELM) was used to select an optimal set of genes. In [12, 13], binary PSO (BPSO) combined with filter method was applied to search optimal gene subsets. The method in [12] simplified gene selection and obtained a higher classification accuracy compared with some similar gene selection methods based on GA, while the method in [13] could determine the appropriate number of genes and obtained high classification accuracy by support vector machine. In [14], the Kmeans-PSO-ELM method used K-means method to group the initial gene pool into several clusters, and the standard PSO combined with ELM was used to perform gene selection, which could obtain a compact set of informative genes. Since traditional PSO is apt to converge to local minima which lead to premature convergence, the above PSO based gene selection method still has much room to improve.

To overcome the deficiencies of the above PSO based gene selection methods and obtain predictive genes with more interpretability, two gene selection methods based on binary PSO and gene-to-class sensitivity (GCS) information were proposed in [15, 16]. In the KMeans-GCSI-MBPSO-ELM [16], GCS information combined with K-means method was used to identify relevant genes for subsequent sample classification, and a modified BPSO coupling GCS information (GCSI) combined with ELM was used to select smallest possible gene subsets. Although the KMeans-GCSI-MBPSO-ELM could obtain predictive genes with lower redundancy and better interpretability, it might filter out a few critical genes highly related to samples’ classes in some cases and thus lead into worse classification accuracy [16]. To overcome the weakness of the KMeans-GCSI-MBPSO-ELM, the BPSO-GCSI-ELM [15] method also encoded GCS information into binary PSO to perform gene selection by initializing particles, updating the particles, modifying maximum velocity, and adopting mutation operation adaptively. Although the BPSO-GCSI-ELM method could avoid filtering out some critical genes, it may increase the computational cost because of the large initial gene pool.

To obtain predictive genes with lower redundancy as well as overcome the deficiencies of the above mentioned gene selection methods, a hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization (PSO) is proposed in this paper. Firstly, with the initial gene pool obtained with double filter strategies, randomization method combined with ELM is proposed to score each gene, and the third-level gene pool for further gene selection is established. Secondly, an improved PSO aiming at improving the search ability of the swarm is proposed to perform gene selection. In the improved PSO, to decrease the probability of converging into local minima, the Metropolis criterion of simulated annealing (SA) algorithm is introduced to update the particles, and the half of the swarm are reinitialized when the swarm is trapped into local minima. With the compact and relevant gene pool obtained by multiple filter strategies, the improved PSO could select the optimal gene subsets with high probability. Finally, experimental results on six public microarray data verify the effectiveness and efficiency of the proposed hybrid gene selection method.

The remainder of this paper is organized as follows. The related preliminaries are briefly described in “Background” section. The proposed gene selection method is introduced in “Methods” section. “Results” section gives the experimental results on six public microarray data. Finally, the concluding remarks are offered in “Conclusions” section.

Methods

Particle swarm optimization

Particle swarm optimization (PSO) is a population-based stochastic optimization technique developed by Eberhart and Kennedy [9]. PSO works by initializing a flock of birds randomly over the searching space, where each bird is called a particle with no quality or volume. Each particle flies with a certain velocity according to its momentum and the influence of its own previous best position (P_ib) as well as the best position of all particles (P_g). Assume that the dimension of searching space is D and the total number of particles is n. Then the original PSO is described as follows

$$\begin{array}{@{}rcl@{}} v_{id}(t+1)&=v_{id}(t)+c_{1}\times Y_{1}()\times \left[p_{ibd}(t)-x_{id}(t)\right]\\&+c_{2}\times Y_{2}()\times \left[p_{gd}(t)-x_{id}(t)\right] \end{array} $$

(1)

$$ \begin{aligned} x_{id}(t+1)=x_{id}(t)+v_{id}(t+1),1\leq i\leq n, 1\leq d\leq D \end{aligned} $$

(2)

where v_i(t) and x_i(t) denote the velocity vector and the position of the i-th particle, respectively, at the t-th iteration; P_ib(t) and P_g(t) denote the previous best position of the i-th particle and the best position of all particle, respectively; c₁ and c₂ are the positive acceleration constants; Y₁() and r₂() are random number between 0 and 1. In addition, it needs to place a limit on the velocity.

To improve the convergence performance of the original PSO, a modified particle swarm optimization [10] was proposed. An inertial weight was introduced in the velocity vector evolution equation described as follows:

$$ \begin{aligned} {} v_{id}(t+1)&=w_{t}\times v_{id}(t)+c_{1}\times Y_{1}()\times \left[p_{ibd}(t)-x_{id}(t)\right]\\ &\quad+c_{2}\times Y_{2}()\times \left[p_{gd}(t)-x_{id}(t)\right] \end{aligned} $$

(3)

where w is the inertial weight. Shi & Eberhart [10] advised the linearly decreasing method to adjust the weight as follows:

$$\begin{array}{@{}rcl@{}} w(t)=w_{ini}-\frac{w_{ini}-w_{end}}{T_{max}}\times t \end{array} $$

(4)

where t is the current iteration number; w_ini,w_end and T_max are the initial inertial weight, the final inertial weight and the maximum number of iteration, respectively.

Extreme learning machine

In [17], a learning algorithm for single-hidden layer feedforward neural networks (SLFN) called extreme learning machine (ELM) was proposed to solve the problem caused by gradient-based learning algorithms. ELM randomly chooses the input weights and hidden biases, and analytically determines the output weights of SLFN. ELM has much better generalization performance with much faster learning speed than gradient-based algorithms [18, 19].

For N arbitrary distinct samples (XX_i,T_i)(i=1,2,…,N.), where XX_i=[xx_i1,xx_i2,…,xx_in]∈R_n, T_i=[ t_i1, t_i2, …, t_im] ∈ R_m. A SLFN with N_H hidden neurons and activation function g() can approximate these N samples with zero error. This means that

$$\begin{array}{@{}rcl@{}} {Hw}_{o}=T \end{array} $$

(5)

where

$$ \begin{aligned} & H\left({wh}_{1},...,{wh}_{N_{H}},b_{1},...,b_{N_{H}},{XX}_{1},...,{XX}_{N}\right)\\ &=\left[ \begin{array}{ccc} g\left({wh}_{1}\cdot {XX}_{1}+{b}_{1}\right) & \cdots & g\left({wh}_{N_{H}}\cdot {XX}_{1}+{b}_{N_{H}}\right)\\ \vdots & \ddots & \vdots \\ g\left({wh}_{1}\cdot {XX}_{N}+{b}_{1}\right) & \cdots & g\left({wh}_{N_{H}}\cdot {XX}_{N}+{b}_{N_{H}}\right)\\ \end{array} \right]\\ \end{aligned} $$

$$ \begin{aligned} w_{o}= \left[ \begin{array}{c} {w_{o1}}^{T} \\ \vdots \\ {w_{o_{N_{H}}}}^{T}\\ \end{array} \right] \end{aligned} \quad\quad \texttt{and} \quad\quad \begin{aligned} T= \left[ \begin{array}{c} {t_{1}}^{T} \\ \vdots \\ {t_{N}}^{T} \\ \end{array} \right] \end{aligned} \texttt{.} $$

The wh_i=[wh_i1,wh_i2,...,wh_in]^T is the input weight vector connecting the i-th hidden neuron and the input neurons, the wo_i=[wo_i1,wo_i2,...,wo_im]^T is the output weight vector connecting the i-th hidden neuron and the output neurons, and the b_i is the bias of the i-th hidden neuron.

In the course of learning, first, the input weights and the hidden biases are arbitrarily chosen and need not be adjusted at all. Second, the smallest norm least-squares solution of the Eq. 5 is obtained as follows:

$$\begin{array}{@{}rcl@{}} w_{o}=H^{+}T \end{array} $$

(6)

where H⁺ is the Moore-Penrose (MP) generalized inverse of matrix H.

It was concluded that the ELM has the minimum train-ing error and smallest norm of weights [18, 19]. The smallest norm of weights tends to have the best generalization performance [18, 19]. Since the solution is obtained by an analytical method and all the parameters of SLFN need not be adjusted, ELM converges much faster than gradient-based algorithm.

The proposed gene selection method

Gene selection generally consists of two steps, which are to identify relevant genes and to tend to select smallest subsets from the relevant genes. Different from the KMeans-GCSI-MBPSO-ELM [16] and BPSO-GCSI-ELM [15] methods, a scoring criterion following the double filter strategy is proposed to select highly relevant genes in this paper, which may decrease the size of the gene pool dramatically. For selecting compact gene subset from the refined gene pool, an improved PSO with the new strategies for reinitializing the swarm and updating of the P_g is proposed.

Since the proposed method combines the scoring criterion with the improved PSO, coupled with ELM, to perform gene selection, it is referred to as the SC-IPSO-ELM method. The rough frame of the proposed method is shown in Fig. 1, and the detailed steps are described as follows.

Step 1: Form a first-level initial gene pool. The dataset is divided into training and testing datasets. Select 200–400 genes from all original genes by using information index to classification (IIC) method [16, 20] as follows:

$$\begin{array}{@{}rcl@{}} d(g)\,=\,\sum_{j=1}^{c}\sum_{k=1,k=j}^{c}\left[\frac{1}{2}\frac{|\mu_{g^{j}}-\mu_{g^{k}}|}{\sigma_{g^{j}}+\sigma_{g^{k}}}\,+\,\frac{1}{2}ln\left(\frac{\sigma_{g^{j}}^{2}\,+\,\sigma_{g^{k}}^{2}}{2\sigma_{g^{j}}\sigma_{g^{k}}}\right)\right] \end{array} $$

(7)

where $\phantom {\dot {i}\!}\mu _{g^{j}}$ and $\phantom {\dot {i}\!}\mu _{g^{k}}$ are the means of expression value of the gene g in the j-th and k-th classes, respectively, and $\phantom {\dot {i}\!}\sigma _{g^{j}}$ and $\phantom {\dot {i}\!}\sigma _{g^{k}}$ are the standard deviations of expression value of gene g in the j-th and k-th classes, respectively. c is the total number of classes. From [16, 20], the higher the value of d(g), the more classification information the gene g contains, so the gene g is more relevant to samples categories. The high classification accuracy will be obtained with high probability by a classifier if the microarray data is projected onto the gene g whose IIC value, d(g), is high. The genes are ranked by their IIC values on the training dataset, and those genes with higher IIC values are chosen to establish the first-level gene pool.

Step 2: Establish a second-level initial gene pool. Randomly generate different gene subsets from the first-level gene pool. Then, each gene subset’s predictive ability is evaluated according to the 5-fold cross validation (CV) classification accuracy obtained by ELM on the training dataset projected onto the gene subset. When the 5-fold cross validation classification accuracy is less than the predetermined value (θ_ac), the corresponding gene subset is deleted. Thus, the genes in the remained gene subsets have comparatively high predictive ability and form the second-level initial gene pool. The number of the gene subsets in the second-level gene pool is noted as l_se. Each gene subset is ranked as integer number (from 1 to l_se) according to the corresponding 5-fold cross validation classification accuracy. The higher the classification accuracy is, the smaller the rank number of the corresponding gene subset is.

Step 3: Establish a third-level initial gene pool by scoring strategy. The psedo-code of the scoring rule for the i-th gene in the second-level gene pool is listed as Algorithm 1.

where R_j is the ranked number of the j-th gene sub-set in the second-level gene pool. After obtaining the scores of all genes in the second-level gene pool, they are normalized into the interval of [0, 1] with linear transformation. Obviously, the higher value of the gene score is, the more relevant to the samples classes of the gene is. Further filter out those genes with much lower score values, and the remaining genes in the second-level pool form the third-level gene pool.

Step 4: Use an improved PSO to select the optimal gene subsets from the third-level initial gene pool. The i-th particle X_i=(x_i1,x_i2,…,x_iD) represents a candidate gene subset, and the element x_ij is the serial number of the selected gene. The dimension of the particles is equal to the number of the selected genes from the third-level initial gene pool, which is predetermined according to [15, 16]. The fitness function of the i-th particle, f(X_i), is the 5-fold cross validation classification accuracy obtained by ELM on the training dataset projected onto the selected gene subset represented by the i-th particle. The optimization process of the improved PSO is the same as the traditional PSO except the following respects.

One is the strategy of updating the best position of the swarm. To decrease the probability of premature convergence of the swarm, the Metropolis criterion in SA [21] is introduced to update the best position of the swarm. In the (i+1)-th optimization generation, the best position of the swarm, p_g, is updated by the Eq. 8 as follows:

$$ {\begin{aligned} p_{g}(i+1)=\left\{ \begin{array}{lr} X_{j}, & f(X_{j})-f(p_{g}(i))\geq \varepsilon\\ X_{j} \,\texttt{with the} \,P = e^{-\frac{|f(X_{j})-f(p_{g}(i))|}{T(i+1)}},& |f(X_{j})-f(p_{g}(i))|< \varepsilon \end{array} \right. \end{aligned}} $$

(8)

where T(i+1) is the annealing temperature which decreases linearly as the following equation

$$\begin{array}{@{}rcl@{}} T(i+1)=T_{0}-\frac{T_{0}-T_{end}}{{It}_{max}}\times (i+1) \end{array} $$

(9)

In Eq. 7, T₀, T_end, and It_max are the initial annealing temperature, final annealing temperature and maximum optimization generation number.

The other is the strategy of mutating the swarm. When the swarm converges to the local minima, the particles in the swarm are close to each other, and the swarm loses its diversity. Mutating the swarm makes the particles repel each other and improves the diversity of the swarm, so the swarm jumps the local minima with high probability. In the improved PSO, the mutation operation is taken if the global best fitness value of the swarm does not change for predetermined generations (Nmu), which randomly select half number of particles in the swarm to reinitialize.

The SC-IPSO-ELM method firstly identifies the relevant genes by the randomization method combined with ELM. Then, with the proposed gene scoring criterion, the much more relevant and compact gene pool is obtained. Finally, to obtain the optimal gene subsets, the tradition PSO is modified to improve its global search ability. Although the SC-IPSO-ELM method does not encode prior information to perform gene selection as the KMeans-GCSI-MBPSO-ELM [16] and BPSO-GCSI-ELM [15] methods, it could also select the most predictive genes with low redundancy effectively. Moreover, the multiple filter strategies produce much more compact gene pool than the methods in [15, 16], which could decrease the computational cost of PSO searching the optimal gene subsets. Compared to the gene-to-class sensitivity information, genes’ rank information obtained by the scoring strategy is more robust, so the SC-IPSO-ELM method may not filter out predictive genes with higher probability than the methods in [15, 16].

The proposed gene selection method contains filtering irrelevant genes to establish the gene pool and using PSO to select functional gene subsets from the gene pool, and its computational complexity can be calculated as follows:

$$\begin{array}{@{}rcl@{}} {CC}_{SC-IPSO-ELM}=O(N_{TG}\times N_{Train})+O(l\times N_{g1}) \\ +O(l_{se}\times N_{g2})+O(N_{PSO}\times {Iter}_{PSO}) \end{array} $$

(10)

where N_TG,N_Train,l,N_g1,l_se,N_g2,N_PSO and Iter_PSO are the number of the original total genes, the number of training data, the number of the initial randomly generated gene subsets in Step 2, the size the first-level gene pool, the number of the selected gene subsets in Step 2, the size of the second-level gene pool, the swarm size and the maximum iteration number in the improved PSO, respectively. The four items on the right side of Eq. 10 are the computational complexity of Step 1, Step 2, Step 3 and Step 4 of the proposed method, respectively. The first and fourth terms are as the same as those of the methods in [15, 16]. The N_g1 and N_g2 both are much smaller than N_TG. Generally, the l and l_se are not greater than N_Train. The computational complexity of the SC-IPSO-ELM method can be approximated as the sum of the first and fourth terms on the right side of Eq. 10 which is similar to the methods in [15, 16], so the time complexity of the proposed method is at the same order of magnitude of that of the methods in [15, 16]. Since the third-level gene pool is established by multiple filter strategy, the size of the third-level gene pool is small. The small third-level gene pool leads to small N_PSO and Iter_PSO, which may decrease the computational cost of Step 4.

Results

Datasets

To verify the effectiveness and efficiency of the proposed gene selection method, we conduct experiments on the six public microarray datasets including Leukemia, Colon, SRBCT, Brain cancer data,Lung and Lymphoma data. The detailed description of the datasets is listed in Table 1.

Table 1 Six microarray datasets

Full size table

The Leukemia data [22] contains total 72 samples in two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), which contain 47 and 25 samples, respectively. Every sample contains 7129 gene expression values. The Leukemia data are available at https://link.springer.com/article/10.1186/1471-2105-7-228#SupplementaryMaterial.

The Brain cancer data contains 60 samples in two classes, 46 patients with classic and 14 patients with desmoplastic brain cancer. The Lymphoma data includes 58 samples where 32 patients did cured and 26 patients did not cured. Each sample in the Brain cancer and Lymphoma has 7129 genes. These two data are available at http://linus.nci.nih.gov/~brb/DataArchive_New.html.

The Colon data consists of expression levels of 62 samples of which 40 samples are colon cancer samples and the remaining are normal samples. Although original expression levels for 6000 genes are measured, 4000 genes out of all the 6000 genes were removed considering the reliability of measured values in the measured expression levels. The measured expression values of 2000 genes are publicly available at http://microarray.princeton.edu/oncology/.

The entire SRBCT data [23] includes the expression data of 2308 genes. There are totally 63 training samples and 25 testing samples, five of the testing samples being not SRBCT. The 63 training samples contain 23 Ewing family of tumors (EWS), 20 rhabdomyosarcoma (RMS), 12 neuroblastoma (NB), and 8 Burkitt lymphomas (BL). The 20 testing samples contain 6 EWS, 5 RMS, 6 NB, and 3 BL. The data are available at https://link.springer.com/article/10.1186/1471-2105-7-228#SupplementaryMaterial.

The LUNG data [24, 25] contains in total 203 samples in five classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, which have 139, 21, 20, 6,17 samples, respectively. Each sample has 12600 genes. The genes with standard deviations smaller than 50 expression units were removed and a dataset with 203 samples and 3312 genes was obtained [24, 25]. The data is also available at https://link.springer.com/article/10.1186/1471-2105-7-228#SupplementaryMaterial.

In the experiments on all data, the swarm size is 60, the maximum iteration number is selected as 20, the acceleration constants c₁ and c₂ are both selected as 1.49445, and the inertial weight varies from 0.9 to 0.4. The size of the third-level gene pool is 40 on all data. The parameter N_mu is fixed as 3 on all data. The values of these parameters are determined by the cross-validation runs on the training datasets and according to [15, 16].

The prediction ability of the selected gene subsets

To verify the prediction ability of the selected gene subsets obtained by the proposed method, ELM is used to perform sample classification with some gene subsets selected by the SC-IPSO-ELM method on the six datasets. Each experiment is conducted 100 times, and the mean classification accuracies are listed in Table 2.

Table 2 The classification accuracy obtained by elm with different gene subsets selected by the sc-ipso-elm method on the six microarray data

Full size table

From Table 2, with the small gene subsets selected by the proposed approach, ELM obtains 100% 5-fold cross validation and test accuracies both on the Leukemia and SRBCT data, With the about five and thirteen genes selected by the SC-IPSO-ELM method on the Brain cancer and Colon, respectively, ELM obtains high prediction accuracies. These results indicate that the SC-IPSO-ELM method has the ability of selecting those predictive genes highly related to samples’ classes.

Biological and functional analysis of the selected gene subsets

The experiment on each microarray data is conducted 500 times, and the top ten frequently selected genes are listed in Tables 3, 4, 5, 6, 7 and 8 for the six microarray data.

Table 3 The top ten frequently selected genes with the sc-ipso-elm method on the leukemia data

Full size table

Table 4 The top ten frequently selected genes with the sc-ipso-elm method on the brain cancer data

Full size table

Table 5 The top ten frequently selected genes with the sc-ipso-elm method on the colon data

Full size table

Table 6 The top ten frequently selected genes with the sc-ipso-elm method on the srbct data

Full size table

Table 7 The top ten frequently selected genes with the sc-ipso-elm method on the lung data

Full size table

Table 8 The top ten frequently selected genes with the sc-ipso-elm method on the lymphoma data

Full size table

From Tables 3, 4, 5, 6, 7 and 8, many genes selected by the SC-IPSO-ELM method were also selected by one or more methods proposed in [15, 16, 22, 23, 26–31]. On the Leukemia data, gene U05259, a B lymphocyte antigen receptor, encodes cell surface proteins for which monoclonal antibodies have been demonstrated to be useful in distinguishing lymphoid from myeloid lineage cells [18]. Gene M63138 is the member of the peptidase C1 family involved in the pathogenesis of breast cancer and possibly Alzheimer’s disease [18]. A muscle index can be calculated based on an average intensity of 17 ESTs in the array that are homologous to smooth muscle genes which included gene H20709 in the Colon data. Although the SC-IPSO-ELM method does not encode gene-to-class sensitivity (GCS) information extracted from the microarray data, it could also select some genes with comparatively high GCS values selected by the GCSI-based methods. Since the expression levels of all genes in the Brain cancer and Lymphoma data are not distinct in two classes, the different approaches considering different factors may select different discriminative gene subsets. Thus, the genes selected by the SC-IPSO-ELM are surely different from ones selected by other gene selected methods, which is verified by Tables 4 and 8.

Figure 2 shows the heatmap with top ten frequently selected genes for the six data. It can be found that most of frequently selected genes’ expression levels clearly differentiate between/among two/multi classes on all data but the Brain cancer and Lymphoma data. From Fig. 2b and e, there has no single gene whose expression levels are distinct between two classes, which was verified in [15, 16]. Hence, the proposed method is capable of selecting predictive genes whose expression levels are distinct among different classes in most cases.

Comparison with the GCSI based gene selection methods

In [15, 16], two effective gene selection methods by considering GCS information we proposed. Experimental results on several public microarray data verified that the two methods, the KMeans-GCSI-MBPSO-ELM and BPSO-GCSI-ELM methods, outperformed than some PSO-based methods and other classical gene selection methods such as GS2, GS1, Cho’s and F-test. To avoid repetition of the comparison with the PSO-based and other classical gene selection methods, the SC-IPSO-ELM method is compared with only the KMeans-GCSI-MBPSO-ELM and BPSO-GCSI-ELM methods on the six data by using ELM and support vector machine (SVM), and the corresponding results of the average of 100 trials are listed in Tables 9 and 10.

Table 9 The 5-fold cv classification accuracies of elm based on the three gene selection methods on the six microarray data

Full size table

Table 10 The classification accuracies of svm based on the three gene selection methods on the six microarray data

Full size table

From Tables 9 and 10, the SC-IPSO-ELM method selects the almost same number of genes as the two GCSI based methods on the Leukemia, Brain cancer, SRBCT, LUNG and Lymphoma data, while it selects the most number of genes on the Colon data among three methods. ELM achieves 100% 5-fold CV accuracy on the Leukemia and SRBCT data with the genes selected by the three methods, and SVM achieves the same 5-fold CV accuracy on the Leukemia data with the genes selected by the three methods. ELM and SVM both obtain the highest 5-fold CV accuracy on the Brain cancer, Colon data and Lymphoma data with the genes selected by the SC-IPSO-ELM method, SVM obtains the slightly higher 5-fold CV accuracy on the SRBCT data with the SC-IPSO-ELM than that with the two GCSI based methods, and SVM obtains the highest 5-fold CV accuracy on the LUNG data with the BPSO-GCSI-ELM. On the whole, the SC-IPSO-ELM could select more predictive gene subsets than the two GCSI based methods.

Discussion on the parameter selection

To establish second-level gene pool, it is critical to determine the value of the parameter, θ_ac. Figure 3 shows the relationship between the classification accuracy on the training data obtained by ELM and the parameter, θ_ac. On the Leukemia, Colon data, LUNG and Lymphoma data, the 5-fold CV and test accuracy both have an upward trend as the values of the parameter, increases, while they have a downward trend as the values of the parameter increases on the Brain cancer data. On the SRBCT data, the test accuracy decreases as the value of the parameter increases, while the 5-fold CV accuracy increases as the value of the parameter increases.

For using the improved PSO to select the gene subset, the dimension of the particle is the number of the selected genes. Figure 4 shows the effect on different number of the selected genes. The 5-fold CV accuracy obtained by ELM has an upward trend as the number of the selected genes increases on the six data but the Colon data, while the curves of the test accuracy obtained by ELM fluctuate as the number of the selected genes increases on the six data.

Figures 3 and 4 provide a guide on how to select the values of the parameters θ_ac and the number of the selected genes in the SC-IPSO-ELM. In general, these parameters should be selected empirically in particular applications.

Conclusions

To obtain predictive genes with lower redundancy, a hybrid gene selection method based on gene scoring strategy and improved PSO was proposed in this paper. To decrease the computational cost, the relevant genes are filtered out through different strategies to establish more compact gene pool for further gene selection. Then, the improved PSO was proposed to select the most predictive gene subsets from the gene pool. Experimental results verified the proposed method could select highly predictive and compact gene subsets and outperformed than other PSO-based and GCSI-based gene selection methods. However, the proposed method selects genes lack of much interpretability. Future work will include how to encode some prior information into the proposed method for gene selection and apply it to RNA-Seq data analysis.

References

Maulik U. Analysis of gene microarray data in a soft computing framework. Appl Soft Comput. 2011; 11:4152–60.
Article Google Scholar
Cao HB, Lei SF, Deng HW, Wang YP. Identification of genes for complex diseases using integrated analysis of multiple types of genomic data. Plos One. 2012; 7(9):42755.
Article Google Scholar
Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997; 97(1-2):273–324.
Article Google Scholar
Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23(19):2507–17.
Article CAS PubMed Google Scholar
Aldonado S, Weber R. A wrapper method for feature selection using support vector machines. Inf Sci. 2009; 179(13):2208–17.
Article Google Scholar
Bermejo P, Puerta JM. A grasp algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets. Pattern Recog. 2011; 32:701–11.
Article Google Scholar
Lee CM, Ko CN. Time series prediction using rbf neural networks with a nonlinear time-varying evolution pso algorithm. Neurocomputing. 2009; 73(1):449–60.
Article Google Scholar
Yu JB, Wang SJ, Xi LF. Evolving artificial neural networks using an improved pso and dpso. Neurocomputing. 2008; 71(4):1054–60.
Article Google Scholar
Kennedy J, Eberhart R. Particle swarm optimization. In: IEEE International Conference on Neural Networks. Perth: IEEE: 1995. p. 1942–8.
Google Scholar
Shi YH, Eberhart RC. A modified particle swarm optimizer. In: IEEE World Congress on Computational Intelligence. Anchorage: IEEE: 1990. p. 69–73.
Google Scholar
Saraswathi S, Sundaram S, Sundararajan N, Zimmermann M, Nilsen-Hamilton M. ICGA-PSO-ELM approach for accurate multiclass cancer classification resulting in reduced gene sets in which genes encoding secreted proteins are highly represented. IEEE/ACM Trans Comput Biol & Bioinforma. 2011; 8(2):452–63.
Article Google Scholar
Yang C, Chuang LY, Ke CH, Yang C. A hybrid feature selection method for microarray classification. Int J Comput Sci. 2008; 35(3):285–90.
Google Scholar
Shen Q, Shi WM, Kong W, Ye BX. A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta. 2007; 71(4):1679–83.
Article CAS PubMed Google Scholar
Yang S, Han F, Guan J. A hybrid gene selection and classification approach for microarray data based on clustering and pso. Commun Comput & Inf Sci. 2013; 375:88–93.
Google Scholar
Han F, Yang C, Wu YQ, Zhu JS, Ling QH, Song YQ, Huang DS. A gene selection method for microarray data based on binary pso encoding gene-to-class sensitivity information. IEEE/ACM Trans Comput Biol & Bioinforma. 2017; 14(1):85–96.
Article Google Scholar
Han F, Sun W, Ling QH. A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information. Plos One. 2014; 9(5):97530.
Article Google Scholar
Huang GB, Zhu QY, Siew CK. Extreme learning machine: a new learning scheme of feedforward neural networks. In: IEEE International Joint Conference on Neural Networks. Budapest: IEEE: 2004. p. 985–990.
Google Scholar
Soria-Olivas E, Gomez-Sanchis J, Martin JD, Vila-Frances J, Martinez M, Magdalena JR, Serrano AJ. Belm: Bayesian extreme learning machine. IEEE Trans Neural Netw. 2011; 22(3):505–9.
Article PubMed Google Scholar
Han F, Huang DS. Improved extreme learning machine for function approximation by encoding a priori information. Neurocomputing. 2006; 69(16–18):2369–73.
Article Google Scholar
Li YX. Feature selection for cancer classification based on support vector machine. J Comput Res & Dev. 2005; 42(10):1796–1801.
Article Google Scholar
Strobl MA, Barker D. On simulated annealing phase transitions in phylogeny reconstruction. Mol Phylogenet Evol. 2016; 101:46–55.
Article PubMed PubMed Central Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999; 286(2):531–6.
Article CAS PubMed Google Scholar
Khan J, Wei JS, Ringner M, Lao HS, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001; 7(6):673–9.
Article CAS PubMed PubMed Central Google Scholar
Yang K, Li J, Cai Z. A stable gene selection in microarray data analysis. BMC Bioinformatics. 2006; 7:228–43.
Article PubMed PubMed Central Google Scholar
Bhattacharjee A, Staunton J, Richards WG. Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci. 2001; 98:13790–5.
Article CAS PubMed Google Scholar
Tong DL. Hybridising genetic algorithm-neural network (gann) in marker genes detection. In: International Conference on Machine Learning and Cybernetics. Warsaw: Springer: 2009. p. 1082–7.
Google Scholar
Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK. Gene selection: a bayesian variable selection approach. Bioinformatics. 2003; 19(1):90–7.
Article CAS PubMed Google Scholar
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 1999; 96(12):6745–50.
Article CAS PubMed PubMed Central Google Scholar
Huang TM, Kecman V. Gene extraction for cancer diagnosis by support vector machines. In: International Conference on Artificial Neural Networks. Baoding: IEEE: 2005. p. 617–24.
Google Scholar
Kar S, Sharma KD, Maitra M. Gene selection from microarray gene expression data for classification of cancer subgroups employing pso and adaptive k-nearest neighborhood technique. Expert Syst Appl. 2015; 42(1):612–27.
Article Google Scholar
Chu F, Wang L. Applications of support vector machines to cancer classification with microarray data. Int J Neural Syst. 2005; 15(6):475.
Article PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their time and their valuable comments.

Funding

Publication costs are funded by the National Natural Science Foundation of China [Nos. 61572241 and 61271385], the National Key R &D Program of China [No. 2017YFC0806600], the Foundation of the Peak of Six Talents of Jiangsu Province [No. 2015-DZXX-024], the Fifth “333 High Level Talented Person Cultivating Project” of Jiangsu Province [No. (2016) III-0845]

Availability of data and materials

Not applicable.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 20 Supplement 8, 2019: Decipher computational analytics in digital health and precision medicine. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-8.

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Xuefu Road, Zhenjiang, Jiangsu, China
Fei Han, Di Tang, Yu-Wen-Tian Sun, Jing Jiang & Qiu-Wei Li
Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang, Jiangsu, China
Fei Han, Di Tang, Yu-Wen-Tian Sun, Jing Jiang & Qiu-Wei Li
School of Engineering, Nanjing Agricultural University, Weigang Road, Nanjing, Jiangsu, China
Zhun Cheng

Authors

Fei Han
View author publications
You can also search for this author in PubMed Google Scholar
Di Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Wen-Tian Sun
View author publications
You can also search for this author in PubMed Google Scholar
Zhun Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jing Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Qiu-Wei Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

FH proposed the frame and wrote the manuscript. DT and YS conducted the experiments. ZC, JJ and QL designed the experiments. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fei Han.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Han, F., Tang, D., Sun, YWT. et al. A hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization. BMC Bioinformatics 20 (Suppl 8), 289 (2019). https://doi.org/10.1186/s12859-019-2773-x

Download citation

Published: 10 June 2019
DOI: https://doi.org/10.1186/s12859-019-2773-x

Decipher computational analytics in digital health and precision medicine

A hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization