Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering

Background The four heterogeneous childhood cancers, neuroblastoma, non-Hodgkin lymphoma, rhabdomyosarcoma, and Ewing sarcoma present a similar histology of small round blue cell tumor (SRBCT) and thus often leads to misdiagnosis. Identification of biomarkers for distinguishing these cancers is a well studied problem. Existing methods typically evaluate each gene separately and do not take into account the nonlinear interaction between genes and the tools that are used to design the diagnostic prediction system. Consequently, more genes are usually identified as necessary for prediction. We propose a general scheme for finding a small set of biomarkers to design a diagnostic system for accurate classification of the cancer subgroups. We use multilayer networks with online gene selection ability and relational fuzzy clustering to identify a small set of biomarkers for accurate classification of the training and blind test cases of a well studied data set. Results Our method discerned just seven biomarkers that precisely categorized the four subgroups of cancer both in training and blind samples. For the same problem, others suggested 19–94 genes. These seven biomarkers include three novel genes (NAB2, LSP1 and EHD1 – not identified by others) with distinct class-specific signatures and important role in cancer biology, including cellular proliferation, transendothelial migration and trafficking of MHC class antigens. Interestingly, NAB2 is downregulated in other tumors including Non-Hodgkin lymphoma and Neuroblastoma but we observed moderate to high upregulation in a few cases of Ewing sarcoma and Rabhdomyosarcoma, suggesting that NAB2 might be mutated in these tumors. These genes can discover the subgroups correctly with unsupervised learning, can differentiate non-SRBCT samples and they perform equally well with other machine learning tools including support vector machines. These biomarkers lead to four simple human interpretable rules for the diagnostic task. Conclusion Although the proposed method is tested on a SRBCT data set, it is quite general and can be applied to other cancer data sets. Our scheme takes into account the interaction between genes as well as that between genes and the tool and thus is able find a very small set and can discover novel genes. Our findings suggest the possibility of developing specialized microarray chips or use of real-time qPCR assays or antibody based methods such as ELISA and western blot analysis for an easy and low cost diagnosis of the subgroups.


Background
Gene expression data are ever increasingly being used in the fields of medicine including categorization of cancers into different diagnostic subgroups, which often appear similar in routine histology [1][2][3]. We focus our attention on one such important problem, proper classification of various groups of childhood cancers, collectively known as small round blue cell tumors (SRBCTs). The worldwide incidence rates of childhood cancers vary widely from as high as 155 per million persons in Nigeria to 40 per million persons in the Indian population of Fiji. The incidence rate of childhood cancer in the USA is approximately 125 per million persons [4,5]. Amongst the various childhood cancers, SRBCT is the third most frequently occurring type (18%) and consists of neuroblastoma (NB, 7%), non-Hodgkin lymphoma (NHL, 6%), rhabdomyosarcoma (RMS, 3%), and Ewing sarcoma (EWS, 2%) [4,5]. These heterogeneous types of cancer present a similar histology of small blue tumor cell and thus often leads to misdiagnosis. Accurate diagnosis of these subgroups is important as the treatment options, monitoring the response and prognosis may vary widely between these subgroups. Development of cancer is a complex reorganization and remodeling of multiple cellular pathways affecting thousands of genes. Thus the identification of global gene expression signatures rather than depending on a particular gene marker for a specific type of cancer may be ideally suited in categorizing various subgroups of SRBCTs. Our objective here is to identify a small set of biomarkers to design useful diagnostic prediction systems for accurate classification of the four categories of SRBCTs.
Typically gene expression data suffer from three problems : (i) limited number of available examples, (ii) very high dimensional nature of the data, and (iii) noisy characteristics of the data. Moreover, for a given problem, usually a few genes are required. So we face two challenging tasks: finding a minimal set of genes that has an adequate discriminating power to categorize the subgroups and designing of a prediction system using the selected genes to accurately classify unseen examples. Use of a minimal number of genes is consistent with the principal of minimum description length (MDL). Systems designed keeping in mind MDL are likely to yield better generalization (less number of free variables, and hence less likely to result in poor generalization). Moreover, with a small set of genes, it is easy to optimize the diagnostic system better. Model selection can also be done using Akaike Information Criterion (AIC). Several attempts have been made to solve these problems. The different machine learning tools used by researchers include multilayer perceptron (MLP) networks [3], Self-organizing maps [1], nearest centroid classifiers [2], support vector machines (SVMs) [6,7]. Similarly, many gene selection methods have also been used. The gene selection methods often ignore the learning machine used to design the prediction system. Some methods, although, take into account the learning machines, they remove one (or a set of features) at a time in a stepwise manner. Such a method cannot capture the subtle nonlinear interactions between different genes and consequently, one ends up with more genes than what is needed to solve the problem.
To overcome this problem, we use a neural network, which can pick up what is needed (select the required genes) when it learns the diagnostic classification task. This helps the network to honor possible nonlinear interactions between genes. The set of selected genes is further reduced with clustering of correlated genes based on fuzzy sets theory. We apply our method on the same SRBCT data set used by Khan et al. [3] and other researchers [2,6,7]. This data set consists of expression levels of 2,308 genes, which were obtained from glass-slide cDNA microarrays, prepared in accordance with the standard protocol of the National Human Genome Research Institute. We have identified only seven (7) genes that can do the diagnostic classification task with 100% accuracy both on the training data and the blind test data. This set of genes includes three novel genes, NAB2, EHD1, and LSP1, that are not identified by others as important. Our method clearly outperforms other results because for the same data set, the number of marker genes reported by other researchers vary between 19 and 96. Moreover, we have demonstrated that these seven genes perform equally well with other machine learning tools like radial basis function network and support vector machines.

Data Set
Khan et al. [3] considered the 2308 genes that passed a filter requiring a minimum expression level and all other processing was done on this set of 2308 genes. We also use the same set of 2308 genes. There are 88 samples of which 63 (EWS: 23 Table 1). The data set is available at [8].

Neural Networks with online Gene Selection Ability Select Good Biomarkers
We have used a modified multilayered perceptron (MLP) network [9] with online feature selection capability for identification of biomarkers. We call it Feature Selection MLP (FSMLP). Conceptually, each input node (hence each gene) of the FSMLP has a gate associated with it. At the beginning of training, these gates are kept almost closed, and the learning algorithm opens the required gates (allows genes to enter the network) depending on the ability of genes to reduce the training error. This is our first stage. In this stage, we have selected only twenty genes based on the importance of genes as defined by the gate opening values (see Materials and Methods). For the FSMLP network to reduce the chances of bad generalization, we have used just one hidden layer with 150 nodes. The set of selected twenty genes are listed in Table 2. These twenty genes have enough characteristic signatures to discriminate the four categories of tumors with 100% accuracy both in the training and test data sets. With these twenty genes, we have trained neural networks 20 times with different initializations, and in each case the system is found to achieve 100% accuracy on the training data as well as on the twenty blind test data.

Relational Fuzzy Clustering and Neural Networks Together Select Only Seven Biomarkers
In order to further reduce the number of biomarkers, we have proceeded as follows: Again we have used the FSMLP to select the best ten marker genes from among the selected twenty genes. We have ensured with repeated trials that these ten biomarkers {CDH2, FGFR4, EHD1, LSP1, FVT1, FCGRT, NAB2, AF1Q, PMS2L12, HCLS1} can do the intended job of classifying both training and blind test data with 100% accuracy. These ten genes are marked with asterisk in Table 2. We have then used the non-Euclidean relational fuzzy c-means (NERFCM) clustering algorithm [10] to cluster the twenty selected genes (not the samples). We have not used Euclidean distance to generate the dissimilarity relation for NERFCM because our objective is to eliminate positively correlated genes, if any, in the selected genes. Note that, two highly correlated genes may have a higher distance than two uncorrelated genes. The dissimilarity relation, R, to be used for clustering is This partition is found to be very consistent between different runs of the algorithm. We have also experimented with five clusters. Typically, when NERFCM is used to find 5 clusters, the fourth and sixth clusters listed above are merged together. The following analysis reveals that with 6 clusters, the selection of genes becomes easy. For the first cluster the gene with Image ID 450152 is not in the list of selected ten and the gene PMS2L12 has the least gate opening in the list of ten genes, so we have removed both from the list.  We used a network with 2308 input nodes, 150 hidden nodes and 4 output nodes. These twenty genes are selected based on the gate opening values. We made several runs of ordinary MLP using these twenty genes and in each case the network was able to correctly classify all training and blind test examples. In the second stage, the FSMLP network is used to select ten best genes from amongst the twenty genes selected in the first stage. These ten genes are marked by asterisks. This set of ten genes has adequate cancer specific signatures to categorize the four types of SRBCTs.

The identified Biomarkers are Important in Cancer Biology
The set of seven genes that our system identified is involved in the biological process of cancer. For example, this set of seven genes includes an interesting gene NAB2 (EGR1 binding protein 2) which neither Khan et al. [3] and nor Fu & Fu-Liu [7] had found important. Typically, this gene is downregulated in various tumors. NAB2 is a corepressor of EGR (early growth response gene) and its expression depends on tumor types and stages. For example, NAB2 is often downregulated in prostate cancer but upregulated in malignant melanoma [11,12]. In our analysis we observed that NAB2 is moderate to highly upregulated in EWS and in a few cases of RMS; while for the NHL and NB cases it is practically absent (see Figure 1D). Interestingly, a search in GEO profiles (Gene Expression Omnibus, NCBI) also showed a moderate expression of NAB2 in few cases of Ewing sarcoma (GPL1977 1934). Thus, not only the involvement of NAB2 in tumorigenesis but also its distinct signature in various types of tumors are singled out by our analysis.
EH domain containing 1 (EHD1) is another novel gene in our list of biomarkers which others could not find important [3,7]. EHD1 protein is involved in endocytosis and trafficking of various membrane protein including MHC class proteins, insulin-growth factors and secretion of glucose transporter 4 (GLTU-4) [13]. Although EHD1 has not been studied in the context of cancer biology, it seems to be highly expressed in metastatic colon cancer than in tumor of the colon as per the GEO profiles (GEO: GPL96 208112). However, in EWS (GEO: GPL1977 1465), breast tumor (GEO: GPL180 3948) and in B-cell lymphoma (GEO: GPL176 5453) the gene expression is downregulated but slightly upregulated in RMS (GEO: GPL1977 1465). We observed that EHD1 upregulated in Non-Hodgkin Lymphoma (NHL) and in a few cases of RMS; while in a majority of RMS and EWS cases it is moderately expressed.
CDH2 belongs to the family of cell-cell adhesion molecules and mostly their reduced expression leads to tumor invasiveness [14]. Loss of or impaired cell adhesion are important determinants in epithelial neoplasia [15]. In pancreatic cancer CDH2 expression is silenced [16]. We observed that CDH2 expression is practically absent in the EWS and NHL groups of tumors, while for the NB class its expression varied from moderate to very high levels. For a few RMS cases also it is found to be moderately expressed. This might be an indicator that plausibly CDH2 had either acquired mutation or protein truncation in RMS. Also a search in the GEO profiles showed that CDH2 is largely downregulated in EWS (GPL1977 7918). Therefore, it seems that CDH2 expression is tumor specific. And in case of SRBCT family of tumors CDH2 provides us a distinctive signature for categorizing the various tumor classes.
A fourth relevant gene inferred by our system to have class-specific signature is fibroblast growth factor receptor 4 (FGFR4). This tyrosine kinase receptor binds to fibroblast growth factor, a mitogenic ligand, and carry out the signal transduction to the intracellular environment in cellular proliferation, differentiation and migration [17]. In normal tissues, FGFR4 expression is hardly detectable. However, overexpression of FGFR4 has been shown in various cancers, including pituitary, prostate, thyroid [18][19][20]. In these cases either mutation in FGFR4 (Gly338Arg) or truncation in its protein was involved resulting in deregulated FGFR4 mediated signaling. However, in lung adenoarcinoma, FGFR4 is downregulated [21]. Curation from GEO profiles of NCBI also supports our observation that shows downregulation of FGFR4 in EWS but moderate expression in RMS (GPL1977 11439). We observed that for the RMS group, it is significantly upregulated but for NB, EWS and NHL groups the FGFR4 expression is practically absent revealing a remarkable RMS-specific signature.
The gene LSP1 is involved in transendothelial migration of neutrophil and actin cytoskeleton organization through MEK1 and ERK2 pathways [22,23]. We observed downregulation of LSP1 in EWS, NHL and NB but reasonably higher expression in the RMS group of tumors. Interestingly, in diffuse large B-cell lymphoma (DLBL) LSP1 is The NERFCM algorithm is used to find 6 clusters in the 20 genes selected in the first stage of the scheme. The partition obtained by the relational clustering and the results of stage two are combined to select just 7 marker Genes.
also very much downregulated (GEO: GPL176 13667). However, LSP1 expression has been found to increase in NHL class of B-cell lymphoma [24]. Thus, LSP1 regulation is tumor class specific and useful for subcategorization of tumors.
The ALL1-fused gene from chromosome 1q (AF1Q) is known to play important roles in leukemia. In particular, it was detected as a mixed-lineage leukemia (MLL) fusion partner from infant acute myelomonocytic leukemia carrying the t(1;11)(q21;q23) translocation [25]. This MLL fusion partner also plays an important role in acute myeloid leukemia (AML). The expression level of AF1Q is shown to be correlated with the clinical outcome in pediatric patients with AML. The elevated expression of AF1Q is found to be an independent adverse prognostic factor in pediatric AML [26]. Further AF1Q is found to be expressed in high metastatic potential breast cancer cells rather than low metastatic potential breast cancer cells and overexpression of AF1Q in the later cell renders it highly metastatic [27]. Thus AF1Q plays important roles in different types of cancer. We found that AF1Q is very much downregulated for the NHL, EWS and RMS groups of tumors and is moderate to highly upregulated for the neuroblastoma group.
The gene FVT1 also an important one with cancer specific characteristic. Its expression is practically absent for RMS, NB and NHL groups of tumors, while for the EWS group it is highly expressed signifying a very distinct tumor-specific signature.

The Seven Biomarkers Can Detect Non-SRBCT samples
In the original SRBCT data set [3] there were five non-SRBCT samples : two normal muscle tissues, three cell lines consisting of an undifferentiated sarcoma, an osteosarcoma, and a prostate carcinoma. It is surprising to note that for these five non-SRBCT examples almost all of the seven genes are downregulated (see Figure 2). Although, we did not use any non-SRBCT examples to decide on the biomarkers, our genes can detect non-SRBCT examples, at least for four samples in Figure 2.   Figure 2 shows that for the test case 5 (sarcoma) only NAB2 is highly expressed. This example may be confused as Ewing Sarcoma. For the test case 3 (osteosarcoma), EHD1 is upregulated along with moderately expressed NAB2. This is not typical of the SRBCT classes. The upregulation of EHD1 and NAB2 clearly reveals the cancer-specific characteristic signatures of the identified genes. This is also revealed by Figure 3, which displays the expression levels of different genes as an image for the five outliers. Figure 4 has four panels, each displaying the expression values of the training samples of a particular class. Comparison of these four panels reveals four simple approximate diagnostic rules, one for each of the four SRBCT groups:

Simpler techniques may be used for easy diagnosis of SRBCTs with visual inspection
EWS : Moderate to high upregulation of NAB2 or FVT1 and downregulation of other five genes.
NHL : Very high upregulation of EHD1 and downregulation of other six genes.
NB : Moderate to high upregulation of AF1Q and CDH2 and downregulation of FGFR4 and FVT1.
RMS : Upregulation of FGFR 4 or LSP1 and downregulation of the FVT1.
Note that, in these (approximate) rules, the upregulation of a particular gene is associated with only one rule or group. This mutual exclusion further suggests that the set of identified biomarkers are essential. Figure 4 and the above rules bespeak the possibility of developing simpler and low cost methods for an easy diagnosis of the subgroups. For example, based on antibodies of the respective protein products of the genes we can use western blot analysis or Enzyme-Linked Immu-noadSorbent Assay (ELISA) to classify the SRBCT samples. One can also use real-time qPCR assays.
Another possibility would be to design specialized microarray chips. For example, one may design microarray chips only for these seven genes with replication. More specifically, chips with 7 × T probes, where each row will represent T probes for only one of the seven genes. Thus, the expression level of each gene will be replicated T times in a row. This replication will make the visual assessment easy and will help to eliminate the effect of noise that is typically encountered in gene expression values. We think that (T =) 6-7 times replication of each probe would be enough for visual inspection because human eyes can easily perceive the contrast between lines with thickness of 6-7 pixels.

The identified Biomarkers are Universal in Nature (Essential and Sufficient)
We want to emphasize that typically the discriminating ability of a feature or gene should be evaluated keeping in mind the machine learning tool that will be used to design the diagnostic system because the best set of genes for a neural network may not necessarily be the best for a decision tree or for nearest centroids classifiers. On the other hand, a set of good biomarkers should be able to do a good job of prediction using different machine learning tools. Thus, if the selected set is essential and sufficient then it should have "universal" characteristics and hence should be able to do a good job with different tools. To assess this universal character, we evaluated the seven genes with RBF net [28], support vector machines (SVMs) [29], and the nearest centroid classifier. The RBF net with 12 Scatterplot of the 7 genes for the 5 nonSRBCT samples  Support vector machines finds a separating hyperplane between two classes either in the original input space or in some high dimensional projected feature space. For this four-class problem, we used one-vs-one (OVA) strategy which trains one classifier (SVM) for each pair of classes. Then voting is used to decide the class label of a data point. Use of Gaussian (RBF) kernels with a very low spread for all SVMs results in zero error both on the training data as well as on the unseen test data. It is worth mentioning here that Fu and Fu-Liu [7] required nineteen (19) genes selected using and for SVMs to achieve zero training and test errors. This reconfirms the usefulness and universal characteristics of the identified biomarkers. Note that, we use the word "essential and sufficient" with respect to a set as a whole and hence there could be other such sets.
Like any other data driven approach we assume that the training data set is representative of the four categories.
We have also analyzed the seven genes using unsupervised learning. In particular, we used the single linkage clustering algorithm to cluster the relation (1-R), where R is the Pearson's correlation matrix between the samples, each sample is treated as a sequence and 1 is a 63 × 63 matrix with each entry equal to unity (1). The four natural groups (clusters) found by single linkage algorithm match exactly with the four classes of the SRBCT (see figure 5).

Discussion
We have identified a set of just seven genes using an online feature selection method based on neural networks and designed a diagnostic system that can classify both the training and unseen test examples with 100% accuracy. To establish that the selected marker genes indeed constitute a minimal set, we made 20 runs of for each possible subset of six genes from the set of selected genes. The networks are trained with the 63 training data points and tested with the same blind set of 20 examples. For each subset of six genes we report the maximum, minimum and the average number of misclassifications on the training as well as on the blind test data in Table 5.
Of the 140 trials, only in one run the error on the blind test data was zero. The average test error is quite high, the minimum value of the average test error rate is more than 20%. Similarly, for the training data also the minimum value of the average training error is more than 10%. This suggests that if we remove any other gene from this set, we shall not be able to distinguish between the four categories of tumors.
To find how confident our network is in making decisions, we analyzed the output of a typical run of network for the blind test samples. For each example, each output node of the network produces a value between 0 and 1. Except for two blind test examples, the network produced almost crisp output (full confidence of 1 only for the correct class and for other classes it is practically zero, see Table 1. For test case 16, although the output was not crisp but the strength for the correct class is about 12 times more than that of its closest competitor, and hence this is a prediction with a high confidence too. The test example 5, which belongs to RMS, although is correctly classified by the MLP, it suggests a good resemblance with the EWS tumor group (The network outputs for the two classes are very close, 0.151 for EWS and 0.157 for RMS). This signals that this case should be looked at more carefully, probably we should look for further information. Due to nonavailability of the patient's identity, we could not make a follow up study on it. This test case is also one of the two cases that are confused by the nearest centroids classifier.
Pseudo color image of the 7 genes for the 5 nonSRBCT sam-ples Figure 3 Pseudo color image of the 7 genes for the 5 nonSR-BCT samples. The dark blue to dark red colors correspond to low to high expression values. For each of test cases 5 and 3 the expression level of only one gene is high, for all other genes and cases the expression levels are quite low.
A RBF network with just five Gaussian nodes can yield zero training error and can classify correctly all but this example in the blind test set.
In order to make a fair comparison of performance of our method with other methods in the literature, we have used the same training-test partition as used by others and like others achieved 100% training and test accuracies. It may not always be possible to achieve such a high accuracy with classification of other types of cancer or even with new data sets for the same problem. In general, to avoid the dependency of the classifier on the particular training data set used (in other words, to reduce the variance part of the classification error), one should use mul- Ensemble classification methods such as bagging [30], boosting [31,32] help us to reduce the classification error by reducing the variance.

Pseudo color image of the training data for the 4 SRBCT classes
Khan et al. [3] used PCA and then used a single layer feedforward network. They analyzed the sensitivity of the network output with respect to changes in the expression level (input) and used this information to rank the genes.
Dendrogram of the 63 samples in the training data set Figure 5 Dendrogram of the 63 samples in the training data set. The dendogram is obtained by the single linkage clustering algorithm. We have used the Pearson's correlation matrix computed using only the selected seven genes. If we cut the dendrogram for 4 clusters, training data from different groups are found to form separate clusters as shown by different colors.
They used 96 top ranked genes (see Table 4) because the training error was reduced to zero with 96 genes. We think, one of the reasons their method ended up having so many genes is that it used a single layer network that cannot capture nonlinear boundaries. The other reason may be that their gene selection method looked at one gene at a time and hence could not exploit possible subtle nonlinear interactions between genes. We have demonstrated that the same task can be done using the same tool with just seven marker genes. These seven genes are equally effective with other machine learning tools also. Of these seven genes only four are included in the list identified by Khan et al.
Tibshirani et al. [2] used a nearest centroid method with shrunken centroids. The nearest shrunken centroids method identifies subsets of genes that best characterize each class. This method shrinks the centroid of each class towards the overall centroid using the within-class standard deviation of each gene. A higher importance is given to a gene whose expression is stable within samples from the same class. As a result, many genes are effectively eliminated from the centroids. The shrunken centroid method yielded 43 genes that include only four of the genes identified by us.
Ramaswamy et al. [6] used support vector machine (SVM) for gene selection and multi-class cancer diagnosis.
Recently, Fu and Fu-Liu [7] applied the method of Ramaswamy et al. on the SRBCT data set and could find a set of 8 genes that can yield zero training error but 90% accuracy on the blind test data.
Then Fu and Fu-Liu [7] proposed a modified method based on SVM for gene selection and classification. This method is also iterative in nature that finds the least important feature, eliminates it and reevaluates the rest. Using the SRBCT data set, they came up with a set of nineteen (19) genes that can achieve 100% accuracy both on the training data and on the unseen test data samples. This list of 19 genes includes four of the marker genes identified by our method.
Of the seven genes that we found, only four (FVT1, CDH2, FGFR4, AF1Q) are included in the set of genes identified by Kahn et al. [3], Tibshirani et al. [2] and Fu and Fu-Liu [7]. The role of these four genes in cancer has been well demonstrated [3,15,16,[18][19][20]25]. Expressions of the remaining three genes (NAB2, EHD1 and LSP1) are either upregulated or downregulated in the various tumors and also in SRBCTs depending on the tumor subgroups [11,12,24]. It is worthwhile to discuss the role of these three genes in cancer biology. Cancer is a sequential process, involving breaking off cells from the primary tumor sites, migration through bloodstream (transendothelial migration), and setting in new places of the organs. NAB2 is a corepressor of transcription factor family EGR (early growth response gene) and inhibits cellular proliferation and growth. Dysregulation of NAB2 will involve unregulated activity of EGR resulting in tumor growth [11,12]. The primary role of EHD1 is endocytosis and protein trafficking such as MHC class molecules that participate in antigen presentation and destruction of abnormal cells [13]. Therefore, aberrant regulation of EHD1 likely would give rise to tumors. The gene LSP1, on the other hand, plays an important part in transendothelial migration of tumor cells in the bloodstream and thus helps in cancer development [22,23]. It, therefore, appears that these three genes are involved in the tumor and cancer progression pathways. This does not necessarily mean that these genes will have the same discriminating power for all types of cancers. In order to find the best set of discriminating biomarkers for different cancers, one needs to use our scheme to analyze data on those types of cancer.
Recently, Lee et al. [33] made an extensive study to compare three feature selection methods in conjunction with eleven classification schemes. These 11 methods include The maximum, minimum and the average training and test errors obtained using MLP networks with all possible subsets of size six genes (i.e., after removing one gene at a time from the set of identified marker genes). With each combination of 6 genes, the network was trained 20 times with different initializations and the statistics are computed based on the results of these 20 trials six classical methods like k-nearest neighbor, Fisher's linear discriminant analysis and five tree methods such as CART, Bagging, Boosting. Lee et al. used 50 top ranked genes to evaluate the performance of the 11 classifiers. The average accuracy on the test examples by the five classical methods varied between 11% to 37% while that for the tree methods varied between 1% to 37%. Considering the performance of classifiers and the fact that authors used 50 genes, it is reasonable to conclude that all selected features were not biomarkers.

Conclusion
We proposed a computational intelligence based scheme for identification marker genes for distinguishing cancer subgroups and tested it on a well known data set on small round blue cell tumors. All methods that we have discussed identified between 19-96 biomarkers to classify the training and test data with 100% accuracy, while our method found only seven genes to do the same task with neural networks. These seven genes include three novel genes which are not found by other researchers. The main reason for this is that our method (see Methods) looks at all genes together and picks up whatever is needed. The relational fuzzy clustering has helped to reduce the number of correlated genes. The genes identified by us are equally good with other tools like RBF network and SVM, although they were identified keeping in view an MLP. Even unsupervised clustering using these seven genes can discover the actual class structure. These seven genes bear distinct cancer specific attributes and as a group plays important roles in cancer biology. In this investigation although we have analyzed the SRBCTs, the proposed methodology can be used for knowledge discovery related to other diseases Figure 6 depicts the overall flow of the method. First we train a FSMLP with 2308 input nodes and 150 hidden nodes using the 63 training samples as mentioned above. We did not try to optimize the network size because in this stage our problem is not to find an optimal network for prediction but a set of useful genes (not necessarily the minimal set). Since the usefulness of features is not likely to depend much on the size of the network (size can, of course, influence the learning speed and the minimum where the network lands at, and hence the feature set; but we are interested in one such set), the choice of the number of hidden nodes is not a critical issue at this stage. The choice should be adequate so that the data can be learnt. So, to decide on the number of nodes we made a few trails and keeping in mind the principle of minimum description length (smaller network, less degrees of freedom, lesser chance of poor generalization) decided on 150. Based on the gate opening values in the first stage we select twenty genes as listed in Table 2. In the next stage of the procedure, as shown in Figure 6, we again use FSMLP to select ten genes from the set of 20. Now we apply the NERFCM algorithm to cluster the set of twenty genes selected in the first stage. A natural question comes : why are we not clustering the 10 genes extracted in stage two? We cluster the 20 genes for two reasons: (1) we wanted to make sure that no interesting gene in the list of twenty is excluded in the set of ten genes selected in stage two; i.e., we do not want to exclude a gene with no correlated counterpart in the list of ten genes. This check is important as our analysis is based on a limited training data. For example, if we could have found a cluster with no member from the list of 10, then we would have included one of the genes from that cluster into the final list of selected genes. (2) The other reason is that finding 5-6 clusters in a set of ten data points may not be informative. In this case the average number of points per cluster is two and hence assessing the quality of clusters would be difficult. Moreover, since the list of ten is expected to have less number of correlated genes, good clusters are not expected to be found.

Outline of the Method
The results of the relational clustering and that of the second stage of gene selection are combined to pick seven genes with sufficient cancer specific signatures to discriminate between the four types of SRBCTs. This final set of seven genes are now used to train MLPs, RBF networks and SVMs. The trained machines are then used for classification of the blind test samples. Note that, the last stage in Fig. 6 shows only MLP and RBF, but the system, as we have discussed, can be augmented with other machine learning tools like SVMs.
Since FSMLP looks at all genes together and uses the same machine learning tool that is finally used to design the classifier, our method takes into account the interaction between genes and between genes and the tool. Moreover, since we use gradient descent, genes having higher discriminating power (i.e., which can reduce error faster) would open their associated gates faster. Thus, FSMLP is likely to select a small but adequate set of genes. The relational cluster analysis further reduces the number of correlated genes and hence the selected set of genes is unlikely to have redundant ones. So, although we cannot guarantee, compared to the techniques discussed here, our method is expected to select less number of genes.

The Online Feature Selection Net
Accurate prediction of the diagnostic category of a tumor based on gene expression data is a very important problem because this can help planning of better treatment of patients. Moreover, identification of specific gene expression patterns that are linked to metabolic characteristics which contribute to different diseases is also very impor-Flow-chart of the entire process Figure 6 Flow-chart of the entire process. First the data set is divided into training and test sets. Then in the first stage, FSMLP is used to select 20 important genes. In the next stage, again FSMLP is used to select 10 useful genes from the 20 genes selected in the first stage. Now, NERFCM is applied to cluster the 20 genes selected in the first stage considering only the training data. Six clusters are found using NERFCM. The results of NERFCM and the second stage of gene selection are combined to choose just seven genes with sufficient cancer-specific signatures to distinguish between the 4 types of SRBCTs. Now different classifiers such as MLP, SVMs, RBF nets are trained using the training set with the selected 7 genes. The trained system (MLP/RBF net/ SVM) is then used for classification of blind test data.
tant. But the very high dimensional nature of the microarray data combined with the fact that such data are usually contaminated with noise, makes such a task very difficult one. This problem involve two tasks : selection of genes and designing of the diagnostic system. Most approaches to solve these problems, view them as two separate problems ignoring the fact that a set of genes best for one type of machine learning tool may not necessarily be the best for another kind of tool. Often one gene is evaluated and removed at a time in a stepwise manner and such methods may not capture subtle non-linear interactions among a set of genes. Hence the best strategy would be to select the genes simultaneously with the main learning task, the designing of the prediction system. We call such a system, "online" system. We use an online feature selection multilayer perceptron network for gene selection and designing of the diagnostic system. This novel scheme picks up the necessary genes online when the system learns the diagnostic prediction task.
For a multilayer perceptron network, the effect of some genes can be eliminated by not allowing them into the network, i.e., by equipping each input node (hence each gene/feature) with a gate and closing the gate.
For useful genes, the associated gates should be completely opened and for a partially important genes the gates should be partially opened. Pal and Chintalapudi [9] suggested a mechanism for realizing such a gate so that not useful features/genes can be identified and attenuated.
To model the gates we associate a gate function to each node in the input layer. Each input node computes the product of the input (expression value of a gene) and the gate function value as its output. This modulated output is passed on to the next layers of the network. The gate function should produce high values for marker genes (useful features); while for a redundant/not important gene, it should be nearly 0. The gate functions attenuate the genes before they propagate through the net, so we call these gate functions attenuating functions. As an attenuation function, we can use any function, the j th node of the first hidden layer to the i th node of the input layer; and be the error term for the j th node of the first hidden layer [9].
It is straightforward to show that except for , the update rules for all weights remain the same as that for an ordinary MLP trained with backpropagation. Assuming that the first hidden layer has q nodes, the update rules for and m i are: The gate parameters for all genes are so initialized that when the training starts F(m) is practically zero for all gates, i.e., no gene is allowed to enter the network. As the learning proceeds, gates for the genes that can reduce the error faster are opened faster. The learning of the gate function continues along with other weights of the network. At the end of the training we pick up important genes based on the values of the gate opening. The training can be stopped when the training error is reduced to zero or to an acceptable level or after a fixed number of iterations. Note that, different initializations of the network may lead to different subsets of important genes. If this happens, this indicates that there are different sets of features that can do the prediction task equally well. Since we do not use any regularization on w ij and m i , given a set of w ij and m i , there exists many sets of scaled version of connection weights and gate opening parameters with the same network behavior. However, since we use gradient descent with a low learning rate and keep all gates almost closed at the beginning of training, on termination we get a useful network without any problem. We use a small set of genes for which gates are reasonably open.
In this investigation, in the first stage we used an MLP with 2308 nodes in the input layer, 150 nodes in the hidden layer, and 4 nodes in the output layer. The network was trained till the misclassification on the training set reduced to zero or it attained a maximum number of 5000 iteration. As discussed earlier, we used the backpropagation learning algorithm for this phase. Based on the trained network, we selected twenty genes as listed in Table 2.
To decide on the number of features to be selected based on the gate opening one can proceed as follows. Let G be the maximum gate opening of a feature in a trial. Then select all features having gate opening more than P% (say P = 80) of G. Train a net with these features. If it can learn the data satisfactorily (can achieve the same or almost same level of classification accuracy as can be obtained using all features), you increase P by, say, 2%; otherwise, reduce it by 2% and repeat the process of training. This can be easily automated. If enough data are available, we can use training-validation-test partition to get better choices of such thresholds. The other alternative is to look at the gate openings. Typically useful gates are opened much more than redundant ones and based on that one can decide on a threshold. In the present case, analyzing the gate opening values we simply experimented with a few possibilities like top 18, 20 and 25 features and all sets were found to have adequate discriminative power and we decided on 20.
For a very high dimensional data set, two reasonably correlated genes x and y in conjunction with some other genes may be able to reduce the error faster, but both x and y may not be needed. Thus, this set of selected genes may contain genes which are to some extent correlated and hence redundant. If it is so, it may be possible to remove further some of them. To guard against this we do two things: (i) we use the FSMLP again on the set of twenty selected genes; and (ii) we use the non-Euclidean relational fuzzy c-means clustering algorithm (see next section) to cluster the twenty selected genes. Unlike hierarchical clustering algorithm, the NERFCM algorithm produces a set of membership values with each data point that helps us to assess how strongly a point belongs to a cluster. Moreover, different runs of NERFCM can produce different partitions which can be used to check the consistency of clusters. We want to exploit this information.
In the second phase of gene selection we apply FSMLP on the twenty genes from phase 1 and select ten genes based on the gate openings. We made many runs of MLP using the selected ten genes to ensure that this set has the required discriminating signatures. Then we use the results of NERFCM clustering to remove 3 more genes from this set of ten genes. Finally, in the third phase, an MLP is trained with seven input nodes, six nodes in the hidden layer and four output nodes. We used the Levenberg-Marquardt search method as implemented in Matlab neural network toolbox.
We have decided the desired number of clusters (c) by analyzing the consistency of partitions generated by NER-FCM for different initializations and choices of c. In order to automate this process, we may need to define a cluster validity index [34,35] for relational clustering. Our broad cluster analysis guidelines are : if a cluster does not intersect the list of top ten, then select the gene with the highest gate opening value, else from the cluster members belonging to the top ten, select the one with the highest gate opening. Further, if more than one cluster member belong to the top ten, and their gate openings are very high and very close, we retain them. Although not a trivial task, such a process may be automated if we have sufficient data. In that case we can partition the data into training, validation and test sets and use the validation data to decide the thresholds on the gate opening values. The system can then be tested on the test set.

NonEuclidean Relational Fuzzy c-means (NERFCM) Clustering of the selected genes
If R = [r ij ] n × n is any relation satisfying the conditions :(i) r ij ≥ 0∀ i,j ; (ii) r ii = 0; and (iii) r ij = r ji , we call R a dissimilarity relation. The relation R may not satisfy the triangle inequality, if it does, then it is a distance function. A dissimilarity relation R = [r ij ] n × n is Euclidean, if there exists a set of vectors {x 1 , ..., x n } ⊂ R n-1 such that r ij = the Euclidean distance between x i and x j ; otherwise, it is non-Euclidean.
In the present case, our objective is to find whether there exists some genes which are to some extent positively correlated. If there are m samples and n genes, then we have n vectors, each in R m , {x 1 , ..., x n } ⊂ R m . Note that, we are not talking about distance between such vectors because two highly correlated vectors may have a very high distance. Even two positively correlated genes and two negatively correlated genes may result in similar distances. Hence, we compute the Pearson's correlation coefficient between pairs of vectors : R = [r ij ], r ij = correlation coefficient between gene i and gene j, i.e., between x i and x j . This R can have both positive and negative values and hence is not a dissimilarity relation. To convert it to a dissimilarity relation we use the following transformation : R = 1 -R, i.e., r ij = 1 -r ij . This transformation will make r ii = 0, r ij ≥ 0.
The symmetry property of R will be maintained and the negatively correlated vectors will show higher dissimilarity while the positively correlated vectors will show lower dissimilarity. This is a valid dissimilarity relation, not necessarily Euclidean. We want to cluster this relation into c clusters. The NERFCM algorithm finds a fuzzy partition matrix U = [u i,k ] c,n . Here u i,k gives the membership value (the degree) with which x k belongs to the i th cluster, u i,k ≥ 0; . This membership values help us to assess how compact a cluster is. The NERFCM algorithm [10] converts a non-Euclidean relation to an Euclidean one u i k i i c , = = ∑ = 1 1 using a transformation, known as β-transformation and then uses an iterative algorithm to estimate the membership matrix U [36]. The beta transformation is R ⇒ R β = R + β (M -I n ), where M i,j = 1∀i,j, I n is the n × n identity matrix and β is suitably chosen scaler.

The RBF network and Support Vector Machines
For the Radial basis function network, we have used the Matlab neural network toolbox and used the function newrb. We did not use a fully optimized network because we have used the same spread for all Gaussian nodes.
The SVM [29] maximizes the margin of separation between two classes to find a separating hyperplane f(x) = = w'x + b, where p is the number of genes.
The SVM finds w = (w 1 , w 2 , ..., w p )' minimizing ||w|| 2 subject to the constraints y i (w'x + b) ≥ 1 ∀ i = 1, ..., n, where y i = +1 or -1 depending on the class. When the training data are not linearly separable, SVM minimizes ||w|| 2 + C subject to the constraints y i (w'x + b The ξ i 's are the slack variables. More often than not, the inputs are implicitly projected into a high dimensional space to make them more separable and the separating hyperplane is then found in the projected space. To realize this projection, different kernels can be used. In this investigation, we have used the Gaussian (also known as RBF) kernels with same spread for all cases. All experiments are done using the SVM tool available at [37].