Feature selection environment for genomic applications
 Fabrício Martins Lopes^{1, 2}Email author,
 David Corrêa MartinsJr^{1}Email author and
 Roberto M CesarJr^{1}
DOI: 10.1186/147121059451
© Lopes et al; licensee BioMed Central Ltd. 2008
Received: 30 May 2008
Accepted: 22 October 2008
Published: 22 October 2008
Abstract
Background
Feature selection is a pattern recognition approach to choose important variables according to some criteria in order to distinguish or explain certain phenomena (i.e., for dimensionality reduction). There are many genomic and proteomic applications that rely on feature selection to answer questions such as selecting signature genes which are informative about some biological state, e.g., normal tissues and several types of cancer; or inferring a prediction network among elements such as genes, proteins and external stimuli. In these applications, a recurrent problem is the lack of samples to perform an adequate estimate of the joint probabilities between element states. A myriad of feature selection algorithms and criterion functions have been proposed, although it is difficult to point the best solution for each application.
Results
The intent of this work is to provide an opensource multiplataform graphical environment for bioinformatics problems, which supports many feature selection algorithms, criterion functions and graphic visualization tools such as scatterplots, parallel coordinates and graphs. A feature selection approach for growing genetic networks from seed genes (targets or predictors) is also implemented in the system.
Conclusion
The proposed feature selection environment allows data analysis using several algorithms, criterion functions and graphic visualization tools. Our experiments have shown the software effectiveness in two distinct types of biological problems. Besides, the environment can be used in different pattern recognition applications, although the main concern regards bioinformatics tasks.
Background
Pattern recognition methods allow the classification of objects or patterns in a number of classes [1, 2]. In statistical pattern recognition, given a set of classes and an unknown object, a pattern recognition system associates the object to one of the classes, based on defined measures in a feature space. In many applications, specifically in bioinformatics, the feature space dimension tends to be very large, making difficult both classification tasks and network inference. In order to overcome this inconvenient situation, the study of dimensionality reduction problem in pattern recognition is very important.
The so called "curse of dimensionality" [3] is a phenomenon in which the number of training samples required to a satisfactory classification performance or network inference is given by an exponential function of the feature space dimension. This is the main motivation by which performing dimensionality reduction is important in problems with large number of features and small number of training samples. Many applications in bioinformatics belong to this context. Data sets containing mRNA transcription expressions from microarray or SAGE, for example, present thousands of genes (features) and only some dozens of samples that may come from cell states or types of tissues.
In this context, this work presents an opensource feature selection environment, which includes different search algorithms and criterion functions. Due to the curse of dimensionality phenomenon, error estimation is a crucial issue. We have developed two ways to embed error estimation in the criterion functions, one based on classification error and another based on classification information. The main idea is based on the penalization of nonobserved or rarely observed instances. After the feature selection, it is possible to apply classical error estimation methods like resubstitution and holdout crossvalidation. Illustrative results obtained on gene regulation networks inference and on classification of breast cancer cells using the proposed software are included in the present paper.
Methods
Data preprocessing
A critical issue in gene expression data analysis is the preprocessing of the data. Depending on the criterion function used, a step of data quantization may be required if the original input data is not discrete. The quantization usually requires a normalization of the data. In the software, the normal transformation is implemented. It consists in subtracting the signal (expression profile) by its mean and then dividing by its standard deviation. Formally, for every signal g(t), its equation is given by $\eta [g(t)]=\frac{g(t)E[g(t)]}{\sigma [g(t)]}$, where E[g(t)] and σ[g(t)] are, respectively, the expectation and standard deviation of g(t). There are two options: 1 – normalization by feature (g is a feature and t is a sample); 2 – normalization by sample (g is a sample and t is a feature).
The quantization of a normalized signal η[g(t)] (g being feature or sample depending on the chosen option for normalization) is a mapping from the continuous expression into k qualitative expressions {0, 1, ..., k  1}. It is performed by a threshold mapping that divides the interval given by the minimum and the maximum values of g in k intervals of equal size. A given continuous value r of η[g(t)] is mapped to an integer i  1, where i is the ith interval to which r belongs.
Implemented feature selection algorithms
The simplest implemented feature selection algorithm is the exhaustive search. This algorithm searches the whole search space. As a result, the selected features are optimal according to the criterion function chosen to guide the algorithm. However, the computational cost often makes this approach inadequate, thus requiring alternative (nonoptimal) algorithms.
In this work we have implemented two suboptimal approaches with unique solution, being also known as wrappers [4]. In the first one, the selected subset starts empty and features are inserted by optimizing a criterion function until a stop condition is satisfied, which is often based on the subset size or a threshold. This method is known as Sequential Forward Selection (SFS) [5]. Although simple and efficient, SFS presents an undesirable drawback known as nesting effect. This effect arises because the selected features are never discarded.
Considering the adopted feature selection and criterion function, the intermediate results are stored in a list. The best sets of this list are displayed as the algorithm result.
Other feature selection methods will be implemented, both wrappers (e.g., "plus ltake away r", simulated annealing, genetic algorithm) and filters (e.g., correlationbased, consistencybased, entropybased), and embedded methods (e.g., CART) [3, 4, 6, 7]. Also, methods for feature extraction, such as PCA and ICA [8, 9], will be implemented. We hope that the availability of the software as opensource will help the inclusion of other functions by the international community.
Implemented criterion functions
We implemented criterion functions based on classification information (mean conditional entropy) and classification error (Coefficient of Determination [10]), introducing some penalization on poorly or nonobserved patterns. Introducing penalization on rarely observed instances embeds the error estimation into the criterion functions. This approach has two main advantages. First, the selected features are obtained not only by their informative content about the classes, but also by the number of well observed instances they have (large number of well observed instances means small error estimation). Another advantage is that the right dimension of the feature subset is also estimated (the dimension parameter is not required). In the software, the implemented feature selection algorithms (SFS and SFFS) requires the dimension parameter just to limit the execution time, performing the procedures until the subset size reaches that parameter. According to the criterion function values, the resulting subsets can have smaller dimension than the specified by the parameter.
Mean conditional entropy
where P(Y  x) is the conditional probability of Y given the observation of the instance x, H(Y  x) = Σ_{y∈Y}P(y  x) log P(y  x). Shannon's entropy can be generalized by using Tsallis Entropy [12] with a parameter q. We have that q = 1 is the Shannon's Entropy (extensive entropy), q < 1 is the subextensive entropy and q > 1 is the superextensive entropy. The Tsallis Entropy is given by $H(Yx)=\frac{1}{q1}(1{\displaystyle {\sum}_{y\in Y}{(P(Yx))}^{q})}$ Lower values of H yield better feature subspaces (the lower H, the larger is the information gained about Y by observing X).
Coefficient of Determination
where ε_{ Y }= 1  max_{y∈Y}P(y) is the prior error, i.e., the error by predicting Y in the absence of other observations, and ε_{ Y }(X) = Σ_{ x∈X }P(x) (1  max_{y∈Y}P(y  x)) is the average error by predicting Y based on the observation of X. Larger values of CoD yield to better feature subspaces (CoD = 0 means that the feature subspace does not improve the prior error and CoD = 1 means that the error is fully eliminated).
Penalization of nonobserved instances
where M is the number of possible instances of the feature vector X, N is the number of observed instances (so, the number of nonobserved instances is given by M  N), f_{ i }is the absolute frequency (number of observations) of x_{ i }and s is the number of samples.
Penalization of rarely observed instances
and N is the number of instances x with $P(x)>\frac{1}{s}$ (more than one observation).
Classifier design and generalization
After the feature selection, the classifier is designed from the table of conditional probabilities of the classes (Y) given the patterns (X). This table is used as a Bayesian classifier where, for each given instance, the chosen label is the one with maximum conditional probability for the considered instance. In case of ties, i.e., instances that have two or more labels of maximum probability (including nonobserved instances), generalization according to some criterion is carried out. A commonly used criterion is the nearest neighbors with some distance metric [1, 2]. The nearest neighbors using Euclidean distance has been implemented for generalization. In this implementation, the nearest neighbors with equal distances are taken and the occurrences of each label given these patterns are summed. If the tie persists, the next nearest neighbors are taken and the process is repeated until only one of such labels has the maximum number of occurrences (i.e., until the tie is broken). The label with the maximum number of occurrences at the end of this process is chosen as the class to which the considered instance belongs.
There are many classifiers described in the literature, such as Parzen, Fisher Linear Discriminant, Binary Decision Tree, Perceptron, Support Vector Machines and so on [3]. The software has been designed so that such additional classifiers may be incorporated. One of the objectives of the software is to implement as many classifiers as possible. However, the primary focus is given on feature selection algorithms and criterion functions.
Error estimation methods
In order to validate and compare feature selection algorithms, error estimation methods are required, especially for data with high dimensionality and small number of samples which is usually the case in bioinformatics problems.
The first method implemented in the software is the simple resubstitution in which the training set is also used as testing set. For many bioinformatics problems, the resubstitution error estimation works fine since usually there is not enough data samples to perform a good estimation, and partitioning the data leads to an even smaller training sample size (see [16] for a detailed discussion).
The Holdout cross validation is also available in the software. It consists in partitioning the data in two sets: training and test sets. The size of the training set can be defined in terms of percentage of total data samples. Thus, the feature selection method is applied to the training set, the classifier is designed taking into account the resulting feature set and the conditional probability distributions given by the training set. Finally, the designed classifier is applied to the test set and the classification error rate is obtained. Alternative error estimation methods like Kfold crossvalidation, leaveoneout (a particular kind of crossvalidation) and bootstrap will be implemented in the software.
Network inference
If one considers the training set as time series data, i.e., each sample representing a state of the system at time t, it is possible to generate networks of prediction based on the Probabilistic Genetic Network model described in [15]. This model is a Markov chain that mimics the properties of a gene as a nonlinear stochastic gate. Given a fixed gene Y as target, the general idea of this model is to determine the subset of genes Z ⊆ X that makes the best prediction of the Y value in the next instant of time. In order to achieve this intent, the system is considered translation invariant, i.e., the transition function is the same for every time instant t. In this way, it is possible to apply a feature selection method for every target, since a sample is represented by the pair (X[t], Y[t + 1]) and the training set is composed by a collection of these samples along the time domain (so, the number of samples is given by t  1). After the application of the feature selection process for every target, the best feature subsets obtained for each target gene define the network connections.
Software description
There are other implemented utilities, including the visualization of the results of the feature selection as scatterplots, parallel coordinates [18] and graphs. Every panel has its own help button that activates a popup window explaining the functionalities and options regarding the corresponding panel. Also, description boxes are displayed when the mouse cursor is placed over some option in order to quickly recall the user about the functionalities of the system.
Results and discussion
This section presents some illustrative results in two different typical situations in bioinformatics. All results shown here were obtained by using the default values in the software: SFFS with mean conditional entropy and penalization of nonobserved instances, α = 1, Tsallis parameter q = 1, and maximum set size = 10.
Conclusion
The proposed feature selection environment allows data analysis using several algorithms, criterion functions and graphic visualization tools. Our experiments have shown the software effectiveness in two distinct types of biological problems. Besides, the environment can be used in different pattern recognition applications, although the main concern regards bioinformatics tasks, especially those involving highdimensional data (large number of genes, for example) with small number of samples. Users without programming skills are allowed to manipulate the software in an easy way, just by clicking to select file inputs, quantization, algorithms, criterion functions, error estimation methods and visualization of the results, while an intuitive help system quickly presents the instructions of the present functionalities that the user may look for. On the other hand, the availability of the software as opensource allows programmers to explore the implemented methods as libraries in their own programs. The environment is implemented in "wizard style", i.e., with tabs delimiting each procedure. A complete technical report that complements the help information is available at the project web site. Two video demos of the software in use are available at http://dimreduction.incubadora.fapesp.br/.
This software opens space for future work. The next steps consist in the implementation of other classical feature selection algorithms, including wrappers, filters and embedded methods, criterion functions (e.g., based on distances between classes [1]); classifier design methods like Parzen, Fisher Linear Discriminant, Binary Decision Tree, Perceptron and Support Vector Machines; error estimation methods such as leaveoneout, Kfold crossvalidation and bootstrap; and the inclusion of classical methods of feature extraction, such as PCA and ICA [8, 9].
Availability and requirements

Project name: DimReduction

Project home pages: http://dimreduction.incubadora.fapesp.br and http://sourceforge.net/projects/dimreduction

Operating system: Platform independent

Programming language: Java

Other requirements: Java Runtime Environment (JRE) 1.5 or higher

License: code available; GNU Lesser General Public License

Video demos: http://dimreduction.incubadora.fapesp.br

Complete technical report: http://arxiv.org/abs/0805.3964
Declarations
Acknowledgements
This work was supported by FAPESP, CNPq and CAPES.
Authors’ Affiliations
References
 Theodoridis S, Koutroumbas K: Pattern Recognition. 1st edition. USA: Academic Press; 1999.
 Duda RO, Hart PE, Stork DG: Pattern Classification. 2nd edition. NY: WileyInterscience; 2000.
 Jain AK, Duin RP, Mao J: Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000, 22: 4–37. 10.1109/34.824819View Article
 Guyon I, Elisseeff A: An introduction to variable and feature selection. Journal of Machine Learning Research 2003, 3: 1157–1182. 10.1162/153244303322753616
 Pudil P, Novovičová J, Kittler J: Floating search methods in feature selection. Pattern Recogn Lett 1994, 15(11):1119–1125. 10.1016/01678655(94)901279View Article
 Hall MA, Smith LA: Feature Selection for Machine Learning: Comparing a CorrelationBased Filter Approach to the Wrapper. Proc FLAIRS Conference AAAI Press; 1999, 235–239. [http://www.aaai.org/Library/FLAIRS/1999/flairs99–042.php]
 Kohavi R, John G: Wrappers for feature selection. Artificial Intelligence 1997, 97: 273–324. 10.1016/S00043702(97)00043XView Article
 Jolliffe IT: Principal component analysis. New York: SpringerVerlag; 1986.View Article
 Comon P: Independent Component Analysis, a New Concept? Signal Processing 1994, 36(3):287–314. 10.1016/01651684(94)900299View Article
 Dougherty ER, Kim S, Chen Y: Coefficient of determination in nonlinear signal processing. Signal Processing 2000, 80: 2219–2235. 10.1016/S01651684(00)000797View Article
 Shannon CE: A mathematical theory of communication. Bell System Technical Journal 1948, 27: 379–423. 623–656 [http://plan9.belllabs.com/cm/ms/what/shannonday/]View Article
 Tsallis C: Nonextensive Statistical Mechanics and its Applications. Lecture Notes in Physics, Berlin: Springer; 2001. 10.1007/354040919X_1
 Hsing T, Liu L, Brun M, Dougherty ER: The coefficient of intrinsic dependence (feature selection using el CID). Pattern Recognition 2005, 38(5):623–636. 10.1016/j.patcog.2004.09.002View Article
 Martins DC Jr, Cesar RM Jr, Barrera J: Woperator window design by minimization of mean conditional entropy. Pattern Analysis & Applications 2006, 9: 139–153. 10.1007/s1004400600310View Article
 Barrera J, Cesar RM Jr, Martins DC Jr, Vencio RZN, Merino EF, Yamamoto MM, Leonardi FG, Pereira CAB, Portillo HA:Methods of Microarray Data Analysis V, SpringerVerlag 2006 chap. Constructing probabilistic genetic networks of Plasmodium falciparum from dynamical expression signals of the intraerythrocytic development cycle. [http://www.springer.com/life+sci/bioinformatics/book/978–0387–34568–0]
 BragaNeto U, Dougherty ER: Is crossvalidation valid for smallsample microarray classification? Bioinformatics 2004, 20(3):374–380. 10.1093/bioinformatics/btg419View ArticlePubMed
 Kohavi R: A study of crossvalidation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence 1995, 2: 1137–1145.
 Inselberg A: The Plane with Parallel Coordinates. Journal The Visual Computer 1985, 1(4):69–91. 10.1007/BF01898350View Article
 Asuncion A, Newman D: UCI Machine Learning Repository.2007. [http://mlearn.ics.uci.edu/MLRepository.html]
 Lopes FM, Cesar RM Jr, da F Costa L: AGN Simulation and Validation Model. In Advances in Bioinformatics and Computational Biology, Third Brazilian Symposium on Bioinformatics. Volume 5167. LNBI, Springer; 2008:169–173. 10.1007/9783540855576_17View Article
 de Campos T: Técnicas de Seleção de Características com Aplicações em Reconhecimento de Faces. In Master's thesis. University of Sao Paulo; 2001. [Awarded as the Best Computer Science MSc. Thesis in the CTD/SBC 2002]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.