Deep gene selection method to select genes from microarray datasets for cancer classification

Background Microarray datasets consist of complex and high-dimensional samples and genes, and generally the number of samples is much smaller than the number of genes. Due to this data imbalance, gene selection is a demanding task for microarray expression data analysis. Results The gene set selected by DGS has shown its superior performances in cancer classification. DGS has a high capability of reducing the number of genes in the original microarray datasets. The experimental comparisons with other representative and state-of-the-art gene selection methods also showed that DGS achieved the best performance in terms of the number of selected genes, classification accuracy, and computational cost. Conclusions We provide an efficient gene selection algorithm can select relevant genes which are significantly sensitive to the samples’ classes. With the few discriminative genes and less cost time by the proposed algorithm achieved much high prediction accuracy on several public microarray data, which in turn verifies the efficiency and effectiveness of the proposed gene selection method.


Background
Studying the correlation between microarray data and diseases such as cancer plays an important role in biomedical applications [1]. Microarray data contains gene expressions extracted from tissues (samples). We can obtain more information about the disease pathology by comparing the gene expressions of the normal tissues with the ones of the diseased tissues [1]. Exploring the difference between the cancerous gene expression in tumor cells and the gene expression in normal tissues can reveal important information from microarray datasets, based on which a number of classification techniques have been used to classify tissues into cancerous / normal or into types/subtypes [2][3][4][5][6]. However, microarray data generally has its own high dimensionality problem, i.e., usually there are thousands of genes/attributes but a few samples in a dataset. Moreover, most of these attributes are irrelevant to the classification problem. Therefore, reducing the attribute dimensionality and meanwhile ensuring that the selected attributes still contain rich and relevant information could address this data imbalance problem, although it remains a big challenge. In addition, small sample set makes the problem much harder to solve because the Machine Learning (ML) algorithms do not have enough space to learn (training examples) and this will increase the risk of over fitting. Moreover, microarray data is known as of highly complicated because most of the attributes (genes) in microarray data are directly or indirectly correlated with each other [7]. Selecting a small relevant attribute subset can solve many problems related to microarray data [8,9]. By removing irrelevant and redundant attributes, we can reduce the dimensionality of the data, simplify the learning model, speed up the learning process and increase the classification accuracy. Several studies have developed and validated a novel gene expression signature and used it as a biomarker to predict cancer in clinical trials [10,11]. Cancer-associated microarray biomarkers allow less-invasive monitoring and can facilitate patient diagnosis, prognosis, monitoring, and treatment in the oncology field [12,13].
Several gene selection methods have been developed to select the genes that are directly related to the disease diagnosis, prognosis, and therapeutic targets [14]. In addition to statistical methods, recently data mining and machine learning solutions have been widely used in genomic data analysis [9,15]. However, still most of the existing gene selection approaches are suffering from several problems such as the stagnation in local optima and the high computational cost [16][17][18]. Therefore, to solve these problems an efficient new selection approach is needed.
Evolutionary Algorithms (EA) have recently played an important role in gene selection field due to their ability in global search [19]. Besides, many hybrid EA have been proposed to improve the accuracy of the classification methods [20][21][22][23]. Various evolutionary algorithms aim to find an optimal sub-set of features by using bioinspired solutions (such as Genetic Algorithm (GA) [24], Genetic programming (GP) [25], particle swarm optimization (PSO) [26], and Honey Bee [27]). These kinds of algorithms have shown appropriate performances over various problems but are dependent on expert's intervention to obtain the desired performance.
Recently, a new gene selection method called Gene Selection Programming (GSP) [28] was proposed which showed good results in terms of accuracy, the number of selected genes and time cost. However, the problem of search space is still unsolved.
Gene Expression Programming (GEP) [29] is a new evolutionary algorithm, which was widely used for classification and gene selection [30][31][32][33][34][35]. GEP has two merits: flexibility which makes it easy to implement, and the capability of getting the best solution, which is inspired by the ideas of genotype and phenotype. In this paper, we use GEP to construct our algorithm.
The purpose (and contribution) of this paper is to present a simple and thus computational efficient algorithm to solve the problem of attribute selection from microarray gene expression data. To this end we explore how to extract the important features from massive datasets.
The rest of this paper is organized as follows: In Gene Expression Program a brief background of GEP is presented. The proposed gene selection algorithm DGS is presented in Results. Evaluation results and discussions, as well as statistical analysis, are presented in Discussion. Finally, Conclusion gives the conclusions.

Gene expression program
Gene Expression Program (GEP) [36] is an evolution algorithm that creates a computer programing/ model from two parts. The first part, which is also known as genotype, is the characteristic linear chromosomes with a fixed length. Each chromosome consists of one or more genes and each gene consists of a head (h) and a tail (t). The head consists of terminals (attributes) and functions while the tail consists of attributes only, and the head length and tail length follow the rule t = h (n-1) + 1 where n is the maximum number of parameters required in the used functions. The second part is the expression tree (ET) which is also known as phenotype. For example, suppose h = 5 and the chromosome has only one gene. The function set is {+, Q, /} where Q is the square root and the terminals set (the attributes in the data) is coded as {a 0 ,…, a 6 } then an example of chromosome could be. +/a 4 Qa 2 a 1 a 5 a 6 a 3 a 0 a 3 ,(Genotype) where the bold part represents the head and the rest represents the tail. The ET is. (Phenotype) The basic GEP algorithm consists of four steps: creating the chromosomes to initialise the population, evaluating the fitness of each individual/ chromosome by using a predefined fitness function, identifying a suitable stop condition/s and applying the genetic operations to modify the individuals for the next generation. GEP was successfully applied on microarray data to find different biological characteristics [30,37]. More details about GEP algorithm and process can be found in [29,36,38]. The data include various prognosis information, we used lung cancer recurrence information to predict the lung cancer recurrence. To this end, we extracted only the samples with recurrence or free survival (non-recurrence) and delete all the unrelated information such as the dead patients and the disease-free patients. After the preparation the total number of the patients in the dataset was 362. The number of cancer recurrence patients was 205 while the number of free survival patients was 157. The total number of attributes (probe sets) was 22, 283. Regarding the training and testing of the method, we used 10-fold cross-validation method. The 9 folds were used for training DGS while the left fold was used for testing. For more reliability we repeated the experiment ten times and obtained the average results of these experiments.

Materials
To make the evaluations more reliable, we validated the prediction model using another independent dataset with the same statistical measures. The validation dataset from South Korea (GSE8894) can be downloaded from NCBI. GSE8894 dataset had 138 NSCLC samples from Affymetrix Hu133-plus2 platform microarray chips. It had an equal number of samples for two classes, i.e. 69 samples were labelled 'recurrence' and 69 samples were labelled 'nonrecurrence'.
The best setting for the number of chromosome (CH) and the number of genes (N) To find out the best settings for the number of chromosomes in each generation (CH) and the number of genes (N) in each chromosome, we did experiments with different values of CH and N. To show the effect of CH and N on the DGS classification performance, we selected nine different settings. Three different values for CH, 100, 200 and 300, and for each CH value, three different N values are selected: 1, 2 and 3. The values of CH are increased by 100 to make the effect of CH values clear, especially when the effect of increasing CH is very slight. To make the experiments more reliable, we repeated the experiment 10 times and took the average as a final result. The parameters used in DGS, which is based on gene expression programming (GEP) algorithm, are showed in Table 1.
The average experimental results are presented in Table 2. AC avg , I avg , S avg and TM avg represent the average accuracy, the number of iterations, the number of selected attributes and CPU time respectively for ten runs, while AC std , I std , S std . and TM std. represent the standard deviation of the classification accuracy, the number of iterations, the number of selected attributes and CPU time respectively.
We observed from Table 2   (PPV), Negative predictive value (NPV), the number of selected genes (S), and processing time (TM) with confidence intervals (CI 95%).
To make the evaluations more reliable, we compared DGS with five representative models on the integrated lung cancer dataset. These five gene selection algorithms were Correlation-based Feature Selection (CFS), Consistency Subset Feature Selection (CSFS), Wrapper Subset (WS), Support Vector Machine (SVM) which applied using WEKA with their default configurations, and Gene Expression Programming (GEP) using GEP4J package. All the values are the average (avg) values over ten runs of the models. Table 3 gives the performance evaluation values for all the prediction models.
In term of AC, the experimental results showed that the DGS method achieved the highest average accuracy result DGS achieves the smallest number of selected genes (3.9) which is almost half of the number of genes selected by other comparison methods.
Regarding TM, the less processing time was for DGS (218.85) while the average time results of other models were 600.12, 600.02, 600.01, 600.21 and 620.51 for CSF, CSFS, WS, SVM, GEP respectively. Figure 1 shows the effectiveness of DGS method in term of AC, SN, SP, PPV, NPV, S, TM and AUC.
For more reliability, we validated the prediction model using an independent dataset (GSE8894). The selected genes were used as biomarkers to classify the recurrence/ non-recurrence patients. The evaluation results for DGS on the validation dataset in terms of AC, SN, SP, PPV, NPV and AUC are presented in Table 4, which show the effectiveness of the proposed gene selection algorithm DGS that enabled the prediction model to achieve the accuracy of 87.68%. Figure 2 shows that the selected genes are able to separate risk groups (recurrence/non-recurrence) characterized by differences in their gene expressions.

The biological meaning for the selected genes from DGS method
In this section we present the biological meanings of the selected genes obtained from "Expression Atlas" database of EMBL-EBI (http://www.ebi.ac.uk/gxa/). Table 5 shows the genes that were selected by DGS method for the ten runs.
We used the OMIM, Expression Atlas and NCBI websites to find the biological meanings of the selected microarray probe-ids and list their corresponding genes. The specifications are shown in Table 6.

DGS comparison with up-to-date models
We also compared DGS method with models recently proposed, which are IBPSO [39], IG-GA [40], IG-ISSO [41], EPSO [42], mABC [43] and IG-GEP [32]. The comparison results were based on two criteria: the classification accuracy and the number of the selected genes regardless of the methods of data processing.
We used the same datasets that were used by these up-to-date models to compare DGS results. A brief description of these datasets is presented in Table 7.
The comparison results are presented in Table 8. Across the ten datasets used in the comparison, DGS  Table 8.

Discussion
We improve the genetic operations that can improve the generation quality effectively. The experimental results show that the proposed DGS can provide a small set of reliable genes and achieve higher classification accuracies in less processing time. These superior achievements are due to the following DGS features - (see Generation size controlling) 2-The ability to select the related genes. In each generation DGS removes the unrelated genes to increase the probability of choosing related genes for generating 200 chromosomes, and after several generations DGS can finally find the most related genes. Table 5 shows the gene selection process and results. 3-DGS is faster compared with other comparative methods. This feature comes from the DGS's abilities.
The ability of narrowing the search space. The ability of resizing the chromosomes in each iteration Table 9 shows the differences between DGS and the related methods GA and GEP.

Conclusion
In this paper, an innovative DGS algorithm is proposed for selecting informative and relevant genes from microarray data sets to improve cancer classifications. The proposed method inherits the evolutionary process from GEP. DGS has the ability of reducing the size of attribute space iteratively and achieve the optimal solution. We applied this method on an integrated dataset and selected 4 genes which can achieve better classification results.

Proposed method
A novel evolutionary method named Deep Gene Selection (DGS) is presented in this section, which is based on the gene expression programming (GEP) algorithm.   DGS is developed to explore the subset of highly relevant genes. The proposed evolutionary method consists of several steps as depicted in Fig. 3. According to Fig. 3, the attributes/genes are coded as a 0 , ----, a m where m represents the number of attributes in the dataset. T is the size of the terminal set which is used to create a population of chromosomes. In the first-generation T = m. The length of each chromosome (L) is defined based on the dimensionality of the dataset. Furthermore, the minimum length of L could also be defined. Next, the population is evaluated using a fitness function that employs a classifier and the number of the attributes. After being assigned fitness values, all chromosomes of the population are sorted to find the best individuals that have the higher fitness values. Improved genetic operators are then applied to the selected population individuals and accordingly the top individuals (the individuals with the highest fitness values) are selected to generate the next generation. Then a new attribute subset with new T is extracted from these best individuals of the new generation. In other words, the output (new attribute set) of previous generation is the input of the next generation. After several generations, the attribute set will represent the minimum genes that can achieve the highest fitness values, because in each generation only the attributes that can achieve the highest fitness values will be selected. One termination condition of this iteration process is that there is no change in the top fitness values. This means the selected genes are the same (same attribute set) and the classification results are the same. Another termination condition is the number of generations reaches the maximum number although the program cannot reach the ideal solution. The selection operation will stop once one of these two termination conditions is met. The application of this algorithm on real data sets is presented in Materials. It is worth noting that the proposed method is taking the advantages of evaluation algorithms and dynamic attribute extraction to reach the optimal solution in a very simple and effective way.
Overall, the proposed method focuses on searching for superior solutions with the smallest number of attributes by using the evolutionary structures to evaluate the best solution and using the dynamic attribute extraction approach to narrow the search space. With the progress of iteration, the cost of search will decrease, and the quality of the solution will increase until the optimal solution (or the solution close to the optimal one) in the smallest space is achieved. DGS was implemented using Java. To implement the expression tree (ET), we used GEP4J package [54]. The DGS flowchart is presented in Fig. 3.
The detailed descriptions of the proposed method, including chromosome representation, initial DGS population, DGS fitness function and improved genetic operations, are presented in the following sub-sections.

DGS population generation
DGS population is the base of the proposed method. The chromosome concept and representation of DGS population are inherited from gene expression programming (GEP) algorithm (see section 2.2). The chromosomes are constructed from two sets: terminal set (ts) and function set (fs). The function set can be a set of any mathematic operators such as {−, +, /, *, sqr, log}. Terminal set in this paper represents the attribute set of the microarray dataset. The first generation is generated from all attributes in the microarray dataset. Each individual (chromosome) of the generation is evaluated by the fitness function and assigned a fitness value. All the individuals are then sorted in a descending order from the highest individuals (the individual with the highest fitness value) to the lowest individual. Then the attributes of the first 50% individuals are extracted to generate a new terminal set (ts) for generating the next generation. This means the attribute output of an iteration will be the input of the next iteration for generating a new generation. This iterative population generation process will continue until one of the program termination conditions is met. In this way, DGS is able to reduce the dimension of the attribute search space by extracting the attributes that can achieve the high fitness values.
The details of this population generation process are outlined in Algorithm.1.
The following simulation example illustrates the generation of a DGS population.

Example 1
If we have a dataset that has13 attributes, then. ts = {a 1 , a 2 , a 3 , a 4 , a 5 , a 6 , a 7 , a 8 , a 9, a 10 , a 11    10 DLBCL 77 5469 2 [53] individuals/chromosomes, as well as their fitness values, is listed below: Take chromosome 0 as an example to show how to calculate the fitness function. +,-,a12 is the head, and a9,a3,a11 , a7 is the tail of chromosome 0.
The Phenotype/ET of chromosome 0 is.
DGS will use the gene expression of a 12 , a 9 , a 3 genes to calculate the fitness.
DGS sorts the individuals in a descending order based on their fitness values, then selects the top 50% individuals from them (the highlighted individuals in the above example). DGS then extracts the attributes from these selected individuals to form a new terminal set which is {a3, a4, a5, a6, a7, a8, a9, a11, a12}.  DGS will use this new terminal set which is smaller than the original one and the function set to generate a new population. This process will continue until the program reaches the best solution (e.g., Accuracy = 100%) with no changes to the consecutive terminal sets, or the program reaches the maximum number of generations.

Generation size controlling
The generation size is determined by three values: the number of individuals/ chromosomes (CH) in a generation, the length of each chromosome (L) and the size of the terminal set (T). The generation size must be properly defined. If the size is too big, it will lead to the increment of the computational time, and if it's too small, the generation may not cover all attributes /terminals. In the original evolution algorithms, the number of chromosomes in each generation (i.e., the generation size) is fixed, so the other values that are suitable for the first generation, are also suitable for all other generations. However, in our method, the first generation is generated from all attributes, and the number of attributes may be thousands in the big datasets. The attributes used for generating the second generation are a subset of the attributes of the first generation as we see in example 1. Usually, the number of attributes used for generating a generation is dynamic, i.e. it decreases or non-decreases with the progress of the evolution program. Therefore, the values of CH and L that are suitable for a generation may not be suitable for other generations. To ensure the generation size is properly defined, we define the following rule in Eq. (1) for these three values.
Actually L*CH is the overall size of a generation in terms attributes and functions. The constant 2 in Eq. (1) is to ensure that each attribute in the terminal set has nearly a double chance to be selected to generate a generation.
Our previous experiments [32] showed that the value of L has more impact on classification results and computational time than CH. So usually we use a fixed CH value (200) for all generations and changeable values for L.
In fact, let N be the number of genes of a chromosome/individual, then where h is the length of gene head and t is the length of gene tail, and where n represents the maximum number of parameters needed in the function set.
From our experiments, we found that N = 2 can provide the best classification results from microarray data sets. If we choose N = 2, then Usually n = 2 for commonly used functions, therefore h can be defined as the integer number of (T/CH-1)/n, i.e.
On the other hand, it is necessary to set a minimum value of h (h = 3 which is a commonly used value) to guarantee the genes of a chromosome contain enough information for evolution.
Based on the above rules and the minimum requirement, we can define the head size (h) of each gene in a chromosome as: Since CH is fixed (e,g. 200) and the number of genes in a chromosome is set as 2, once the value of h is defined according to (3), the overall size of a generation is defined. The following simulation example shows different h values with different sizes (T) of terminal set.

Example 2
If a microarray dataset originally has 2200 attributes and we set CH = 150, the values of h and T are listed in Table 10.

Fitness function
The purpose of using gene selection methods is to obtain a smallest gene subset that can provide the best classification results. To this end, a new fitness function is proposed to enable DGS to select the best individuals/ chromosomes. The fitness value of an individual i can be calculated by the following equation This fitness function consists of two parts. The first part is based on the classification accuracy AC(i) of the individual i. We use support vector machine (SVM) as a classification method to calculate the accuracy of an individual/chromosome because it is a powerful classification algorithm which is widely used to solve the binary and multi-classification problems [55,56] and can achieve a high classification accuracy. To calculate the AC, we use the following Eq. (5), which is widely used in cancer classification.
where TP, TN, FP and FN represent True Positive, True Negative, False Positive and False Negative respectively. The second part is based on the number of selected genes, specifically t is the total number of attributes in the terminal set and s i is the selected number of attributes in the individual/chromosome i, rϵ [0,0.5) is a predefined weight controlling the importance of AC(i) and s i . Table 10 The results of example 2   Generation  T  h  Generation  T  h   1  2200  7  11  650  3   2  2000  6  12  402  3   3  1852  6  13  254  3   4  1723  5  14  102  3   5  1583  5  15  79  3   6  1296  4

Improved genetic operations and DGS algorithm
The reason of using genetic operations is to improve the individuals for achieving the optimal solution. In this paper, we improve two genetic operations: Mutation and Recombination. The improved genetic operations depend more on the weight of genes, as we explain below.

Attribute weight
The weight (w) of each attribute (i) is calculated based on Eq. (6) where sum ¼ X i k i i∈ts, k i is the rank value of the attribute i, and X i w i ¼ 1 .
In this study we used Gain Ratio to calculate the rank of the individual i as follow: The details of calculating the information gain and the intrinsic information can be found in [57][58][59].
The attributes with a higher weight contain more information for classification.

Mutation
Mutation is an important genetic operator which can significantly affect the individual's development. It marks a minor variation in the genomes by exchanging one component with another. In evolution algorithms, the changes made by mutation might bring substantial differences to chromosomes. For example, a mutation might make a chromosome better in terms of fitness, or the important attributes might be lost due to a random mutation which could result in the decreasing of accuracy and the increasing of processing time.
The critical question is which attribute/terminal should be added or deleted when performing a mutation. Ideally, a weak terminal deleted by the mutation operation should be replaced by a strong one. This can be achieved by using the following improved mutation operation.
To clarify the DGS mutation operation, we provide a simple example shown in Fig. 4. In the example, the chromosome consists of a single gene (− / a6 a2 a0 a9 a7). The gene head size (h) is 3. The function set is {Q, +, −, *, /} which means n = 2. According to Eq. (2), the gene tail size (t) is 4 and the chromosome length is (3 + 4) =7.
All the terminals in the database are weighed once at the beginning of the program and sorted in a descending order based on their weights as shown at the top of Fig. 4. In this example a 3 has the highest weight while a 8 has the lowest weight. Terminal a 6 is identified by the DGS mutation as the weakest terminal as it has the lowest weight among all terminals in the example chromosome.
For this weak terminal a 6 , DGS mutation has two options to replace it: either it is replaced by a function such as (+) or by a terminal. In the latter option, the replacing terminal should have a weight higher than that of a 6 . In this example terminal a 7 is selected as a replacing terminal. With the stronger terminals/ attributes after mutation, the new chromosome might achieve a higher fitness value than the previous one. The details of this mutation operator are outlined in Algorithm 2.

Recombination
The second genetic operation we used in this proposed method is the recombination operation.
Generally, in the recombination operation pairs of chromosomes (parents) are randomly selected and combined to generate new pair. To generate the new chromosomes, the parents will exchange one or more parts (short sequences) with each other. The exchanging part can also be the entire gene from one parent with the equivalent gene from the other parent.
In this study, we replace the random exchange process with a new controlling process. To clarify DGS recombination process we use the example in Fig. 5. DGS program records all the fitness functions for all the chromosomes. The program selects two chromosomes. In this example, the fitness value of chromosome1 is 80% and the fitness value of chromosome2 is 70%. DGS recombination gene operation selects the "strong" gene (gene with the highest weight summation ∑w i ) from the chromosome that has a lower fitness value (lc) and exchanges it with the "weak" gene (gene with the lowest weight summation) from another chromosome that has a higher fitness value (hc). The process is repeated until the program obtain a new chromosome (hc') with a higher fitness value than both parents (the original chromosomes). This idea comes from the gene structure [60].
Based on the above improvements and innovations, the deep gene selectin (DGS) algorithm is presented as pseudocode in Algorithm 3 below.