- Methodology Article
- Open Access
Structured feature selection using coordinate descent optimization
- Mohamed F. Ghalwash^{1, 2}Email author,
- Xi Hang Cao†^{1},
- Ivan Stojkovic†^{1, 3} and
- Zoran Obradovic^{1}
https://doi.org/10.1186/s12859-016-0954-4
© Ghalwash et al. 2016
- Received: 14 July 2015
- Accepted: 16 February 2016
- Published: 8 April 2016
Abstract
Background
Existing feature selection methods typically do not consider prior knowledge in the form of structural relationships among features. In this study, the features are structured based on prior knowledge into groups. The problem addressed in this article is how to select one representative feature from each group such that the selected features are jointly discriminating the classes.
The problem is formulated as a binary constrained optimization and the combinatorial optimization is relaxed as a convex-concave problem, which is then transformed into a sequence of convex optimization problems so that the problem can be solved by any standard optimization algorithm. Moreover, a block coordinate gradient descent optimization algorithm is proposed for high dimensional feature selection, which in our experiments was four times faster than using a standard optimization algorithm.
Results
In order to test the effectiveness of the proposed formulation, we used microarray analysis as a case study, where genes with similar expressions or similar molecular functions were grouped together. In particular, the proposed block coordinate gradient descent feature selection method is evaluated on five benchmark microarray gene expression datasets and evidence is provided that the proposed method gives more accurate results than the state-of-the-art gene selection methods. Out of 25 experiments, the proposed method achieved the highest average AUC in 13 experiments while the other methods achieved higher average AUC in no more than 6 experiments.
Conclusion
A method is developed to select a feature from each group. When the features are grouped based on similarity in gene expression, we showed that the proposed algorithm is more accurate than state-of-the-art gene selection methods that are particularly developed to select highly discriminative and less redundant genes. In addition, the proposed method can exploit any grouping structure among features, while alternative methods are restricted to using similarity based grouping.
Keywords
- Structured feature selection
- Block coordinate gradient descent
- Gene expression
- Microarray analysis
- Prior knowledge
Background
The objective of supervised feature selection methods is to select a discriminative but concise list of features among a possibly large set of features in order to differentiate between classes. Using only a small set of features improves the accuracy and increases the interpretability of the classification model [1–3]. Several types of feature selection methods have been developed to address that problem. Filter-type methods select features independently from a classification model, whereas wrapper and embedded methods use feature selection as a part of training the classifier, which typically involves fitting more hyper parameters, and requires to use nested cross validations [4]. Therefore, wrapper and embedded types typically suffer from increased computational cost and possible overfit, especially when a small number of examples are available. Nevertheless, filter-type methods are allowed to utilize the labels of the subjects. The outcome of a filter-type method is the selected features list, regardless of their weights, where the selected features can be used later to learn a classifier. In this paper we focus on the filter type feature selection method.
In general, feature selection methods do not consider the structure among the features. For example, the features may be clustered such that the features in the same cluster are more similar to each other than features in different clusters. In many applications, the requirement is to select one feature from each group such that all features are jointly discriminative. This problem exists in many applications (see Additional file 1 for more details). Analytics of sports: One major objective of analytics in sports is to enhance team performance by selecting the best possible players and make the best possible decisions on the field or court [5]. Imagine that a coach needs to select a set of best players for the team. Intuitively, the set of all possible players can be grouped (based on their positions in the field) into G groups where each group contains all players who play in that position. Since the objective is to select the best team, one may claim that the problem can be solved by selecting the best player in each position separately. However, using this approach synergy among the players is not considered. For example, players 1 and 2 might be the best players for positions A and B, respectively, but the players might not be so cooperative as to be in the same team. Therefore, the idea is to select one player from each group such that the selected team has the best performance. Multivariate time series classification: This problem can be addressed by using discriminative multivariate temporal patterns that are extracted from each class [6, 7]. One example of such interpretable multivariate pattern is that if gene X and gene Y are up-regulated at the same time followed by the down-regulation of gene Z, then the patient is developing the condition. In order to discover such patterns, one can extract all patterns from gene X as one group and all patterns from gene Y as another group, and so on. In other words, the grouping structure among genes is based on all patterns extracted from one variable (gene). Therefore, the problem is to select one pattern from each gene. The list can be analyzed by another method to extract a low dimensional multivariate pattern. Dummy variables: Dummy variable is an artificial variable created to represent a categorical variable. Therefore, the coefficients of the dummy variables are naturally partitioned into groups, where it is naturally to select only one variable from each group. Microarray analysis: The genes can be grouped based on their correlation or similarity, based on prior knowledge about their molecular functions, a cellular pathway, or based on annotation by a specific term of the gene ontology [8]. Therefore, it would be enough to choose only one gene from each group.
The main advantage of performing analysis on groups of features is the compactness and improved interpretability of analysis results due to the smaller number of groups and greater prior knowledge available to such groups. In this study, we address a novel problem where the objective is to select a representative feature from each group such that the selected features are jointly discriminative. Our contribution can be summarized as follows. (1) We formulate the feature selection problem in order to select a representative feature from each group simultaneously and jointly as convex-concave optimization, which is transformed into a sequence of convex optimization problems that can be solved using any standard efficient optimization algorithm; (2) We develop a block coordinate gradient descent (BCGD) algorithm that is four times faster than any standard optimization algorithm for the proposed feature selection method; (3) The experimental results show evidence of the efficiency and scalability of the proposed algorithm. In order to evaluate the proposed method, we applied it to perform a feature selection for microarray analysis as a case study. Related work in feature selection for microarray analysis: Feature selection for microarray analysis has been extensively studied [9–12], where many of them can be categorized as filter based approach, in which genes are selected prior to learning the classification model. Attempts to address similar problems include clustering genes by utilizing the biological relevance of those genes and then using the representative medoid from each biologically enriched cluster separately [13, 14]. This clearly leads to a sub-optimal solution because it does not consider the interaction among genes from different clusters. This problem is addressed by proposing an efficient double sparsity optimization formulation that simultaneously identifies mutually exclusive feature groups from the low-level features (genes), such that the groups contain correlated features, and then the groups are generalized to higher level features [15]. The high-level features (metagenes) are constructed as a linear combination of low-level genes from that group. The problem with that method is that the meta genes might not be quite interpretable [16].
A Maximum Relevance Minimum Redundancy (mRMR) method was developed for feature selection of microarray data [17]. The method is based on mutual information criteria that maximizes the relevance of the feature to the target and simultaneously minimizes the redundancy to other features. The features are then ranked based on that criteria such that the high-rank features are the more informative features. Another method is proposed to select the most informative features while minimizing the redundancy among the selected features [18].
The problem is formulated as a quadratic programming formulation, which can be solved by any standard efficient optimization algorithm. However, the formulation involves a matrix that is not positive semi-definite; hence, it might lead to a poor local optima. A very recent method [19] formulates the problem as a convex formulation with two terms, one to select features with maximum class separation and the other to select non-redundant features. The redundancy among features is computed based on Pearson correlation, which is encoded as a positive semi-definite matrix. In order to apply the method for high dimensional data, the authors have applied low rank approximation to the quadratic term so that the solution can be found efficiently. Although those studies [17–19] look similar to our proposed method, their methods were developed particularly to minimize the redundancy among features, whereas our method is general enough to exploit any structure among features. In other words, the features can be grouped based on similarity such as Pearson correlation as in [19] or mutual information as in [17, 18], or based on any other prior knowledge about genes, such as molecular function. Therefore, our work can be exploited to any application where the features can be grouped in advance using prior knowledge.
The importance of selecting features from gene subsets or groups was recently studied [20]. The method first partitions the features into smaller blocks. Then, in each block, a small subset of r features are chosen based on their classification accuracy. Once the top-r features from each block is obtained, they are mutually compared to obtain the best feature subset. We note that the interaction among features are not fully considered but only the interaction among the top-r features from each subset. In addition, their method was developed using a wrapper-based approach, while the main focus of this paper is based on a filter-type feature selection approach.
Methods
Problem definition
where ℓ(D) is the loss induced from the dataset D. We can use any loss function as long as the function is convex to ensure a global solution. In order to show that our formulation can incorporate different loss functions, we utilized the class separation loss and the logistic loss in experiments on gene expression and synthetic data, respectively; see Additional file 1 for details.
The constraint (3d) ensures that the maximum weight within each group is 1. Therefore, constraints (3c) and (3d) jointly ensure that all weights within each group are 0 except only one weight that has value 1, which means that we select one feature from each group. However, all these prototypes are selected simultaneously such that the joint effects among them are considered.
where λ _{1}>0 and λ _{2}>0 are the Lagrangian multipliers. The first penalization term is the difference between the sum of weights and 1. Since the sum of weights can be larger or smaller than 1 and, hence, the difference can be positive or negative; therefore, we instead penalize the quadratic term. The second penalization term is to penalize the difference between the maximum and 1, which can not be negative because the maximum can not be larger than 1 according to constraint (4b). Higher value of λ _{2} forces the weight of the representative feature to reach the maximum and, therefore, validates the equality constraint (3d). Since the main objective is to force one of the weights to be large (not necessarily reaching the maximum) and the remaining weights to be very close to zero, the value of λ _{2} is not set to be very high (similarly λ _{1}). As explained in Additional file 1, values of these two parameters are set to λ _{1}=λ _{2}=100 to balance the two constraints.
\(\mathcal {L}_{1}\) is a convex loss function, \(\mathcal {L}_{2}\) is a quadratic function and therefore convex, and \(\mathcal {L}_{3}\) is convex because log-sum-exp is a convex function. Then, the objective function (6) becomes difference of two convex functions. In order to solve this problem we have applied a recent convex-concave procedure (CCCP) [23, 24]. CCCP linearizes the concave function around a solution obtained in the current iterate with tangent hyperplane function, which serves as an upper-bound for the concave function. This leads to a sequence of convex programs where the convergence of the method is guaranteed [25].
where the term \((d\mathcal {L}_{3}/d\boldsymbol {w})_{\boldsymbol {w}=\boldsymbol {w}^{t}}\) is the derivative of \(\mathcal {L}_{3}\) at the current iterate w ^{ t }.
The application of CCCP is shown in Algorithm 1. The advantage of CCCP is that no additional hyper-parameters are needed. Furthermore, each update is a convex minimization problem and can be solved using classical and efficient convex apparatus. Since we now have a smooth, differentiable objective function \(\mathcal {J}\) with only inequality constraints, we can use any optimization algorithm for solving the problem. In order to solve the problem efficiently we compute first derivatives of the objective function with respect to the weights w, and approximate the Hessian with a diagonal matrix. In Additional file 1, we show the derivation of Jacobian and Hessian matrices for the logistic loss [26] and class-separable loss functions [19].
The trust-region-reflective algorithm [27] is the fastest optimization algorithm for solving (7). However, in our application it is not efficient for large scale problems. In the next section, we develop a customized optimization algorithm based on coordinate descent that is four times faster than standard apparatus.
Block coordinate gradient descent
Coordinate gradient descent is a simple technique that is surprisingly efficient and scalable [28]. In general, given convex and differentiable function, the coordinate descent algorithm minimizes the function along each coordinate axis \({w_{i}^{g}}\), nevertheless, it is guaranteed that the algorithm will converge to the global optimal solution [29]. Moreover, in many cases we can replace individual coordinates with blocks of coordinates, e.g. coordinates w ^{ g } for a group g [30].
In order to develop a block coordinate gradient descent (BCGD) algorithm to solve (7), we build our work on the seminal work of [31, 32], where they have developed an algorithm to solve a smooth function with bound constraints as in (7). The key idea of the algorithm is to iteratively combine a quadratic approximation of the objective function \(\mathcal {J}\) at w to generate a feasible direction d with an additional line search to find the best move along that direction. The procedure continues in iterative mode until convergence. Precisely, BCGD Algorithm (2) iteratively runs over four steps until convergence. In the first step, the algorithm identifies a set of features (coordinates) in order to optimize (the iterations are performed such that each T consecutive iterations run over the entire w). Typically, the algorithm iterates over those non-zero weights (active set) and optimize their corresponding features. In the second step, the algorithm approximates the objective function as a quadratic optimization at the active set and then performs line search in step 3 to find the best step size to move along the direction of the quadratic approximation. This ensure a feasible movements towards the minimum. Finally, it updates only the active set weights. The key issue of Algorithm (2) is how to identify the active set so that the algorithm runs efficiently and optimizes the active weights.
Active weights for solving (7)
Cyclically updating one coordinate at a time in coordinate descent process might slow down the optimization for large scale problems. Therefore, in our approach the update is performed based on blocks of coordinate [30]. Our application is easily fitted in this situation where the blocks can be naturally chosen based on groups of features. In each iteration we update the weights of all features within one group.
Nevertheless, in our application we did not see benefit from iterating over each group. Instead, we initially set the active set as the entire w, and we update the parameters based on all coordinates at once. Such an update is successfully used in a previous study where it is noted that “after a complete cycle through all the variables, we iterate on only the active set till convergence” [28]. Then, after a few iterations some of the groups get stable (i.e., one of the features becomes close to 1 and the rest become close to 0). In this case, we do not need to optimize this group anymore and we can exclude that group from the active set. If we keep that group in the working set the algorithm will try to fit precisely the weights of the features within that group while the selected feature will remain stable. In other words, the optimizer will try to move the max weight to be closer to 1 while keeping the rest of the weights closer to 0. Therefore including those groups in the working set will just slow down the optimization without changing the representative for those groups. This resembles the well know active-set method, which iteratively predicts a correct split of zero and non-zero elements in w and optimizes the function based on only non-zero weights [33].
Results and discussion
Microarray gene expression
We compared the proposed feature selection formulation using block coordinate gradient descent (BCGD) algorithm to two baseline and two state-of-the-art feature selection filter-type methods. (1) The Pearson Correlation (PC) method which ranks the correlation between the feature and the target and selects the top m features; (2) Relief which is one of the most successful strategies in feature selection [34, 35]. It chooses instances randomly and changes the weights of the feature relevance based on the nearest neighbor so that it gives more weights to features that discriminate the instance from neighbors of different classes; (3) mRMR ranks the features according to the minimal-redundancy-maximal-relevance criteria [17, 36], which is based on mutual information; (4) STBIP formulates the feature selection problem as a quadratic objective function to select m features with maximal discriminative power and minimal redundancy [19], where the redundancy among features is computed based on Pearson correlation. We note that all methods we compare to, including our method, are filter-type feature selection methods, where the objective is to rank or select features without learning a classifier.
In order to apply our methods, we need to cluster the genes. The genes can be clustered in different ways. For example, each cluster may include genes that encode for similar polypeptides or proteins, which are often located few thousands pairs apart from each other. Gene Ontology (GO) has been utilized to cluster genes based on their common function where they are not constrained by gene expression or other properties [37]. Another way to cluster genes is to group co-expressed genes in the same cluster, which do not necessarily have similar functions [38]. For a survey on clustering genes, the reader is referred to [39, 40] and references therein. Our method is decoupled from the clustering step. However, in order to have a fair comparison with other baseline methods and to select the top m features selected by our method, we clustered the genes based on Pearson correlation into m clusters and applied our method to select one gene from each group.
For each dataset, we sampled training data from each class (as indicated in the last column in Table 1) for training the feature selection method, and the remaining samples were used as test data. Using only the selected features, linear SVM was optimized on the training data using the LIBLINEAR package [42], where the parameter C={10^{−3},10^{−2},…,10^{3}} was chosen based on a nested 3-cross validation on the training set. Note that the test data were never used for training in either the feature selection method or SVM. Since the microarray datasets are imbalanced, we used the area under the ROC curve (AUC) as the evaluation performance. The average AUC is computed based on 40 repetitions of random splits for training and test data.
Evaluation of gene selection methods on 5 benchmark datasets using the top m genes
Method | m=20 | m=50 | m=100 | m=200 | m=1000 | |
---|---|---|---|---|---|---|
Tumor14 | BCGD | 0.786 ±0.036 | 0.797 ±0.049 | 0.821 ±0.041 | 0.825 ±0.036 | 0.846 ±0.042 |
PC | 0.766 ±0.047 | 0.786 ±0.041 | 0.793 ±0.041 | 0.805 ±0.045 | 0.830 ±0.033 | |
Relief | 0.748 ±0.069 | 0.788 ±0.050 | 0.803 ±0.033 | 0.822 ±0.036 | 0.844 ±0.038 | |
mRMR | 0.785 ±0.041 | 0.803 ±0.036 | 0.813 ±0.038 | 0.817 ±0.039 | 0.824 ±0.033 | |
STBIP | 0.672 ±0.054 | 0.733 ±0.035 | 0.761 ±0.044 | 0.795 ±0.038 | 0.837 ±0.047 | |
Lung | BCGD | 0.762 ±0.180 | 0.806 ±0.168 | 0.789 ±0.183 | 0.789 ±0.163 | 0.785 ±0.169 |
PC | 0.732 ±0.200 | 0.777 ±0.189 | 0.756 ±0.185 | 0.777 ±0.188 | 0.789 ±0.153 | |
Relief | 0.652 ±0.243 | 0.689 ±0.245 | 0.732 ±0.206 | 0.752 ±0.224 | 0.797 ±0.147 | |
mRMR | 0.721 ±0.207 | 0.750 ±0.195 | 0.755 ±0.195 | 0.788 ±0.184 | 0.783 ±0.156 | |
STBIP | 0.637 ±0.212 | 0.721 ±0.198 | 0.739 ±0.192 | 0.767 ±0.170 | 0.771 ±0.165 | |
Myeloma | BCGD | 0.662 ±0.077 | 0.706 ±0.061 | 0.712 ±0.062 | 0.709 ±0.058 | 0.717 ±0.061 |
PC | 0.675 ±0.071 | 0.694 ±0.068 | 0.705 ±0.058 | 0.709 ±0.055 | 0.713 ±0.056 | |
Relief | 0.583 ±0.085 | 0.624 ±0.085 | 0.650 ±0.075 | 0.679 ±0.064 | 0.709 ±0.058 | |
mRMR | 0.647 ±0.077 | 0.691 ±0.058 | 0.702 ±0.061 | 0.715 ±0.061 | 0.721 ±0.057 | |
STBIP | 0.565 ±0.093 | 0.619 ±0.079 | 0.648 ±0.082 | 0.672 ±0.079 | 0.701 ±0.062 | |
DLBCL | BCGD | 0.970 ±0.031 | 0.971 ±0.034 | 0.975 ±0.024 | 0.981 ±0.023 | 0.987 ±0.017 |
PC | 0.947 ±0.041 | 0.957 ±0.042 | 0.963 ±0.047 | 0.964 ±0.043 | 0.980 ±0.027 | |
Relief | 0.947 ±0.062 | 0.974 ±0.027 | 0.983 ±0.020 | 0.988 ±0.016 | 0.990 ±0.013 | |
mRMR | 0.962 ±0.055 | 0.981 ±0.025 | 0.987 ±0.019 | 0.985 ±0.021 | 0.980 ±0.025 | |
STBIP | 0.808 ±0.101 | 0.905 ±0.060 | 0.925 ±0.063 | 0.943 ±0.066 | 0.978 ±0.029 | |
Colon | BCGD | 0.874 ±0.110 | 0.878 ±0.101 | 0.879 ±0.095 | 0.886 ±0.096 | 0.858 ±0.119 |
PC | 0.886 ±0.164 | 0.879 ±0.150 | 0.863 ±0.150 | 0.868 ±0.125 | 0.856 ±0.115 | |
Relief | 0.896 ±0.113 | 0.888 ±0.098 | 0.877 ±0.104 | 0.859 ±0.128 | 0.859 ±0.113 | |
mRMR | 0.874 ±0.115 | 0.889 ±0.104 | 0.872 ±0.120 | 0.870 ±0.093 | 0.850 ±0.118 | |
STBIP | 0.781 ±0.181 | 0.821 ±0.143 | 0.847 ±0.128 | 0.847 ±0.139 | 0.862 ±0.115 | |
Average | BCGD | 0.811 ±0.148 | 0.832 ±0.132 | 0.835 ±0.133 | 0.838 ±0.127 | 0.840 ±0.137 |
PC | 0.801 ±0.158 | 0.819 ±0.146 | 0.816 ±0.144 | 0.825 ±0.137 | 0.834 ±0.131 | |
Relief | 0.765 ±0.191 | 0.792 ±0.179 | 0.809 ±0.159 | 0.820 ±0.158 | 0.841 ±0.132 | |
mRMR | 0.798 ±0.160 | 0.823 ±0.145 | 0.826 ±0.145 | 0.835 ±0.133 | 0.832 ±0.131 | |
STBIP | 0.693 ±0.167 | 0.760 ±0.153 | 0.784 ±0.148 | 0.805 ±0.141 | 0.830 ±0.138 |
Top 10 GO terms enriched in the BCGD selected genes from the Myleoma dataset
GO ID | Ontology | GO term | Percentage | P-value |
---|---|---|---|---|
GO:0044444 | Cellular Component | cytoplasmic part | 45.8 | 2.0E-4 |
GO:0015629 | Cellular Component | actin cytoskeleton | 8.3 | 6.8E-4 |
GO:0044449 | Cellular Component | contractile fiber part | 5.2 | 3.4E-3 |
GO:0043412 | Biological Process | biopolymer modification | 19.8 | 4.3E-3 |
GO:0032991 | Cellular Component | macromolecular complex | 30.2 | 4.3E-3 |
GO:0043292 | Cellular Component | contractile fiber | 5.2 | 4.4E-3 |
GO:0005622 | Cellular Component | intracellular | 75.0 | 6.1E-3 |
GO:0005515 | Molecular Function | protein binding | 60.4 | 6.2E-3 |
GO:0005737 | Cellular Component | cytoplasm | 55.2 | 6.7E-3 |
GO:0008081 | Molecular Function | phosphoric ester hydrolase activity | 7.3 | 1.1E-2 |
Top 10 diseases associated with the BCGD selected genes
Disease | # Gene | P-value |
---|---|---|
Stress | 8 | 1.6E-3 |
Nevi and Melanomas | 6 | 1.6E-3 |
Large granular lymphocytic leukemia | 3 | 1.7E-3 |
cancer or viral infections | 10 | 2.1E-3 |
Hemoglobinuria | 3 | 2.1E-3 |
Neuroendocrine Tumors | 5 | 2.1E-3 |
HIV | 9 | 2.1E-3 |
Leukemia, T-Cell | 5 | 2.1E-3 |
Corneal Neovascularization | 3 | 2.1E-3 |
Leukemia | 7 | 2.1E-3 |
Efficiency of BCGD on synthetic data
The proposed feature selection formulation can be solved using any standard optimization algorithm. We have used trust region reflective (TR) algorithm as it is the best and fastest implemented algorithm in Matlab for the proposed constrained optimization problem (7). However, we have developed an efficient block coordinate gradient descent (BCGD) algorithm that is four times faster than the standard algorithm for high dimensional applications (million of features). In order to show the efficiency of BCGD, we have conducted several synthetic experiments, where all synthetic datasets have been generated using the process described in Additional file 1. First, we conducted experiments to show the efficacy of utilization of active set and how that contributed to the reduction in computational cost of the BCGD algorithm. Then, we compared the computational time of BCGD and TR in 42 settings with different number of features N={10e ^{3},100e ^{3},200e ^{3},400e ^{3},600e ^{3},800e ^{3},1e ^{6}} distributed over different number of groups G={100,200,400,600,800,1000}.
Utilization of active set
The BCGD algorithm does not update the weights for all groups at each iteration. Instead, it updates the entire weight vector (i.e., all groups) at the first few iterations and then at each next iteration it identifies the non-stable groups and optimizes only those groups. We hypothesize that if the group has clear discriminative features then the observed group will become stable at earlier iterations, while groups that have most confusing features will go for longer iterations.
Scalability
To show the efficiency of the proposed BCGD algorithm, experiments were conducted to compare the running time for both algorithms TR and BCGD. Ten datasets were generated with 100 samples with N features distributed over G groups. We varied the number of groups G={100,200,400,600,800,1000} and the number of features N={10e ^{3},100e ^{3},200e ^{3},400e ^{3},600e ^{3},800e ^{3},1e ^{6}} and. Then, both algorithms were applied on each dataset and the results were computed as the average over all 10 datasets.
Conclusion
Feature selection method was proposed to select features in order to jointly maximizing the discriminative power of the features. This is performed by considering the structural relationships among features, where the features are grouped based on prior knowledge. The feature selection problem is then formulated as selecting a feature from each group. We developed a block coordinate gradient descent algorithm to solve the optimization function. The results of comparing the proposed method with 4 stat-of-the-art methods on five bench mark gene expression datasets showed evidence that the proposed method was accurate on 13/25 experiments where the other methods was accurate in no more than 6/25 experiments. In addition, several synthetic experiments were conducted to show the efficiency of the proposed BCGD algorithm over the standard optimization algorithm. The BCGD algorithm was four times faster than the standard algorithms indicating the applicability of BCGD on high dimensional data. In future work, we will investigate convergence properties for the proposed method. In addition, it might be interesting to learn the clusters of genes simultaneously with the feature selection method.
Declarations
Acknowledgements
This work was funded, in part, by DARPA grant [DARPAN66001-11-1-4183] negotiated by SSC Pacific grant. The authors thank anonymous reviewers for their valuable comments that improved the presentation of the paper.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Dramiński M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J. Monte carlo feature selection for supervised classification. Bioinformatics. 2008; 24(1):110–7.View ArticlePubMedGoogle Scholar
- Marczyk M, Jaksik R, Polanski A, Polanska J. Adaptive filtering of microarray gene expression data based on gaussian mixture decomposition. BMC Bioinformatics. 2013; 14(1):101.View ArticlePubMedPubMed CentralGoogle Scholar
- Su Y, Murali T, Pavlovic V, Schaffer M, Kasif S. Rankgene: identification of diagnostic genes based on expression data. Bioinformatics. 2003; 19(12):1578–9.View ArticlePubMedGoogle Scholar
- Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. bioinformatics. 2007; 23(19):2507–17.View ArticlePubMedGoogle Scholar
- Fry MJ, Ohlmann JW. Introduction to the special issue on analytics in sports, part I: General sports applications. Interfaces. 2012; 42(2):105–8. doi:10.1287/inte.1120.0633.View ArticleGoogle Scholar
- Ghalwash MF, Obradovic Z. Early classification of multivariate temporal observations by extraction of interpretable shapelets. BMC Bioinformatics. 2012; 13. doi:10.1186/1471-2105-13-195.
- Ghalwash MF, Radosavljevic V, Obradovic Z. Extraction of interpretable multivariate patterns for early diagnostics. In: IEEE 13th International Conference on Data Mining (ICDM). Dallas, Texas, USA: IEEE: 2013. p. 201–10.Google Scholar
- Holec M, Kléma J, železnỳ F, Tolar J. Comparative evaluation of set-level techniques in predictive classification of gene expression samples. BMC Bioinformatics. 2012; 13(Suppl 10):15.View ArticleGoogle Scholar
- Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002; 46(1-3):389–422.View ArticleGoogle Scholar
- Mamitsuka H. Selecting features in microarray classification using roc curves. Pattern Recognit. 2006; 39(12):2393–404.View ArticleGoogle Scholar
- Sharma A, Paliwal K. Cancer classification by gradient lda technique using microarray gene expression data. Data Knowl Eng. 2008; 66(2):338–47.View ArticleGoogle Scholar
- Sharma A, Imoto S, Miyano S, Sharma V. Null space based feature selection method for gene expression data. Intl J Mach Learn Cybernet. 2012; 3(4):269–76.View ArticleGoogle Scholar
- Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, Kellam P. Consensus clustering and functional interpretation of gene-expression data. Genome Biol. 2004; 5(11):94.View ArticleGoogle Scholar
- Mitra S, Ghosh S. Feature selection and clustering of gene expression profiles using biological knowledge. Syst Man Cybernet Part C Appl Rev IEEE Trans. 2012; 42(6):1590–9. doi:10.1109/TSMCC.2012.2209416.View ArticleGoogle Scholar
- Zhou J, Lu Z, Sun J, Yuan L, Wang F, Ye J. Feafiner: biomarker identification from medical data through feature generalization and selection. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago, IL, USA: ACM: 2013. p. 1034–42.Google Scholar
- Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci. 2004; 101(12):4164–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Anal Mach Intell IEEE Trans. 2005; 27(8):1226–38.View ArticleGoogle Scholar
- Liu S, Liu H, Latecki LJ, Yan S, Xu C, Lu H. Size adaptive selection of most informative features. San Francisco, CA, USA: Association for the Advancement of Artificial Intelligence (AAAI): 2011.Google Scholar
- Lan L, Vucetic S. Multi-task feature selection in microarray data by binary integer programming. In: BMC Proceedings. vol. 7, BioMed Central Ltd: 2013. p. 50.Google Scholar
- Sharma A, Imoto S, Miyano S. A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinformatics. 2012; 9(3):754–64.View ArticleGoogle Scholar
- Adams WY, Su H, Fei-Fei L. Efficient euclidean projections onto the intersection of norm balls. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12). Edinburgh, Scotland: International Conference of Machine Learning (ICML): 2012. p. 433–40.Google Scholar
- Boyd S, Vandenberghe L. Convex Optimization. Cambridge, CB2 8RU, UK: Cambridge university press; 2004.View ArticleGoogle Scholar
- Collobert R, Sinz F, Weston J, Bottou L. Trading convexity for scalability. In: International Conference of Machine Learning. Pittsburgh, Pennsylvania: International Conference of Machine Learning (ICML): 2006.Google Scholar
- Yuille A, Rangarajan A. The concave-convex procedure (CCCP). In: Neural Computation. vol. 15,2003. p. 915–36.Google Scholar
- Lanckriet GR, Sriperumbudur BK. On the convergence of the concave-convex procedure. In: Advances in Neural Information Processing Systems. BC, Canada: Neural Information Processing Systems (NIPS). Vancouver: 2009. p. 1759–67.Google Scholar
- Rosasco L, Vito E, Caponnetto A, Piana M, Verri A. Are loss functions all the same?Neural Comput. 2004; 16(5):1063–76.View ArticlePubMedGoogle Scholar
- Coleman TF, Li Y. An interior trust region approach for nonlinear minimization subject to bounds. SIAM J Optim. 1996; 6:418–55.View ArticleGoogle Scholar
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33(1):1.View ArticlePubMedPubMed CentralGoogle Scholar
- Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun Pur Appl Math. 2004; 57(11):1413–57.View ArticleGoogle Scholar
- Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl. 2001; 109(3):475–94.View ArticleGoogle Scholar
- Tseng P, Yun S. A coordinate gradient descent method for nonsmooth separable minimization. Math Program. 2009; 117(1-2):387–423.View ArticleGoogle Scholar
- Tseng P, Yun S. A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. Comput Optim Appl. 2010; 47(2):179–206.View ArticleGoogle Scholar
- Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. J R Stat Soc Ser B Stat Methodol. 2008; 70(1):53–71.View ArticleGoogle Scholar
- Kira K, Rendell LA. A practical approach to feature selection. In: Proceedings of the Ninth International Workshop on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc: 1992. p. 249–56.Google Scholar
- A Feature Selection Toolbox for C and Matlab. http://www.cs.man.ac.uk/~gbrown/fstoolbox/. v1.03 Accessed 06-2015.
- mRMR: minimum Redundancy Maximum Relevance Feature Selection. http://penglab.janelia.org/proj/mRMR/. v.09 Accessed 06-2015.
- Yi G, Sze SH, Thon MR. Identifying clusters of functionally related genes in genomes. Bioinformatics. 2007; 23(9):1053–60.View ArticlePubMedGoogle Scholar
- Loganantharaj R. Beyond clustering of array expressions. Int J Bioinforma Res Appl. 2009; 5(3):329–48.View ArticleGoogle Scholar
- Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng. 2004; 16(11):1370–86.View ArticleGoogle Scholar
- Nagi S, Bhattacharyya DK, Kalita JK. Gene expression data clustering analysis: A survey. In: 2011 2nd National Conference on Emerging Trends and Applications in Computer Science (NCETACS). Meghalaya Shillong, India: IEEE: 2011. p. 1–12.Google Scholar
- The gene expression datasets are downloaded either from the respective website or from the following website. https://github.com/ramhiser/datamicroarray/blob/master/README.md Accessed 06-2015.
- Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. Liblinear: A library for large linear classification. J Mach Learn Res. 2008; 9:1871–4.Google Scholar
- Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc. 2008; 4(1):44–57.View ArticleGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009; 37(1):1–13.View ArticlePubMed CentralGoogle Scholar
- Desouza M, Gunning PW, Stehn JR. The actin cytoskeleton as a sensor and mediator of apoptosis. BioArchitecture. 2012; 2(3):75–87.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang B, Kirov S, Snoddy J. Webgestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005; 33(suppl 2):741–8.View ArticleGoogle Scholar
- Wang J, Duncan D, Shi Z, Zhang B. Web-based gene set analysis toolkit (webgestalt): Update 2013. Nucleic Acids Res. 2013; 41(W1):77–83.View ArticleGoogle Scholar
- Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF. Gems: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform. 2005; 74(7):491–503.View ArticlePubMedGoogle Scholar
- Tian E, Zhan F, Walker R, Rasmussen E, Ma Y, Barlogie B, Shaughnessy Jr JD. The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma. N Engl J Med. 2003; 349(26):2483–94.View ArticlePubMedGoogle Scholar
- Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, et al.Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002; 8(1):68–74.View ArticlePubMedGoogle Scholar
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci. 1999; 96(12):6745–50.View ArticlePubMedPubMed CentralGoogle Scholar