Sparse logistic regression with a L_{1/2} penalty for gene selection in cancer classification
- Yong Liang^{1}Email author,
- Cheng Liu^{1},
- Xin-Ze Luan^{1},
- Kwong-Sak Leung^{2},
- Tak-Ming Chan^{2},
- Zong-Ben Xu^{3} and
- Hai Zhang^{3}
https://doi.org/10.1186/1471-2105-14-198
© Liang et al.; licensee BioMed Central Ltd. 2013
Received: 4 July 2012
Accepted: 30 May 2013
Published: 19 June 2013
Abstract
Background
Microarray technology is widely used in cancer diagnosis. Successfully identifying gene biomarkers will significantly help to classify different cancer types and improve the prediction accuracy. The regularization approach is one of the effective methods for gene selection in microarray data, which generally contain a large number of genes and have a small number of samples. In recent years, various approaches have been developed for gene selection of microarray data. Generally, they are divided into three categories: filter, wrapper and embedded methods. Regularization methods are an important embedded technique and perform both continuous shrinkage and automatic gene selection simultaneously. Recently, there is growing interest in applying the regularization techniques in gene selection. The popular regularization technique is Lasso (L_{1}), and many L_{1} type regularization terms have been proposed in the recent years. Theoretically, the Lq type regularization with the lower value of q would lead to better solutions with more sparsity. Moreover, the L_{1/2} regularization can be taken as a representative of Lq (0 < q < 1) regularizations and has been demonstrated many attractive properties.
Results
In this work, we investigate a sparse logistic regression with the L_{1/2} penalty for gene selection in cancer classification problems, and propose a coordinate descent algorithm with a new univariate half thresholding operator to solve the L_{1/2} penalized logistic regression. Experimental results on artificial and microarray data demonstrate the effectiveness of our proposed approach compared with other regularization methods. Especially, for 4 publicly available gene expression datasets, the L_{1/2} regularization method achieved its success using only about 2 to 14 predictors (genes), compared to about 6 to 38 genes for ordinary L_{1} and elastic net regularization approaches.
Conclusions
From our evaluations, it is clear that the sparse logistic regression with the L_{1/2} penalty achieves higher classification accuracy than those of ordinary L_{1} and elastic net regularization approaches, while fewer but informative genes are selected. This is an important consideration for screening and diagnostic applications, where the goal is often to develop an accurate test using as few features as possible in order to control cost. Therefore, the sparse logistic regression with the L_{1/2} penalty is effective technique for gene selection in real classification problems.
Keywords
Gene selection Sparse logistic regression Cancer classificationBackground
With the development of DNA microarray technology, the biology researchers can analyze the expression levels of thousands of genes simultaneously. Many studies have demonstrated that microarray data are useful for classification of many cancers. However, from the biological perspective, only a small subset of genes is strongly indicative of a targeted disease, and most genes are irrelevant to cancer classification. The irrelevant genes may introduce noise and decrease classification accuracy. Moreover, from the machine learning perspective, too many genes may lead to overfitting and can negatively influence the classification performance. Due to the significance of these problems, effective gene selection methods are desirable to help to classify different cancer types and improve prediction accuracy.
In recent years, various approaches have been developed for gene selection of microarray data. Generally, they are divided into three categories: filter, wrapper and embedded methods. Filter methods evaluate a gene based on discriminative power without considering its correlations with other genes [1-4]. The drawback of filter methods is that it examines each gene independently, ignoring the possibility that groups of genes may have a combined effect which is not necessarily reflected by the individual performance of genes in the group. This is a common issue with statistical methods such as T-test, which examine each gene in isolation.
Wrapper methods utilize a particular learning method as feature evaluation measurement to select the gene subsets in terms of the estimated classification errors and build the final classifier. Wrapper approaches can obtain a small subset of relevant genes and can significantly improve classification accuracy [5, 6]. For example, Guyon et al. [7] proposed a gene selection approach utilizing support vector machines (SVM) based on recursive feature elimination. However, the wrapper methods greatly require extensive computational time.
The third group of gene selection procedures is embedded methods, which perform the variable selection as part of the statistical learning procedure. They are much more efficient computationally than wrapper methods with similar performance. Embedded methods have drawn much attention recently in the literature. The embedded methods are less computationally expensive and less prone to over fitting than the wrapper methods [8].
Regularization methods are an important embedded technique and perform both continuous shrinkage and automatic gene selection simultaneously. Recently, there is growing interest in applying the regularization techniques in the logistic regression models. Logistic regression is a powerful discriminative method and has a direct probabilistic interpretation which can obtain probabilities of classification apart from the class label information. In order to extract key features in classification problems, a series of regularized logistic regression methods have been proposed. For example, Shevade and Keerthi [9] proposed the sparse logistic regression based on the Lasso regularization [10] and Gauss-Seidel methods. Glmnet is the general approach for the L_{1} type regularized (including Lasso and elastic net) linear model using a coordinate descent algorithm [11, 12]. Similar to sparse logistic regression with the L_{1} regularization method, Gavin C. C. and Nicola L. C. [13] investigated sparse logistic regression with Bayesian regularization. Inspired by the aforementioned methods, we investigate the sparse logistic regression model with a L_{1/2} penalty, in particular for gene selection in cancer classification. The L_{1/2} penalty can be taken as a representative of Lq (0 < q < 1) penalty and has demonstrated many attractive properties, such as unbiasedness, sparsity and oracle properties [14].
In this paper, we develop a coordinate descent algorithm to the L_{1/2} regularization in the sparse logistic regression framework. The approach is applicable to biological data with high dimensions and low sample sizes. Empirical comparisons with sparse logistic regressions with the L_{1} penalty and the elastic net penalty demonstrate the effectiveness of the proposed L_{1/2} penalized logistic regression for gene selection in cancer classification problems.
Methods
Sparse logistic regression with the L_{1/2}penalty
Where λ > 0 is a tuning parameter and P(B) is a regularization term. The popular regularization technique is Lasso (L_{1}) [10], which has the regularization term P(β) = ∑ |β|. Many L_{1} type regularization terms have been proposed in the recent years, such as SCAD [15], elastic net [16], and MC+ [17].
The L_{1/2} regularization has been demonstrated many attractive properties, such as unbiasedness, sparsity and oracle properties. The theoretical and experimental analyses show that the L_{1/2} regularization is a competitive approach. Our work in this paper also reveals the effectiveness of the L_{1/2} regularization to solve the nonlinear logistic regression problems with a small number of predictive features (genes).
A coordinate descent algorithm for the L_{1/2}penalized logistic regression
The coordinate descent algorithm [11, 12] is a “one-at-a-time” approach, and its basic procedure can be described as follows: for each coefficients, to partially optimize the target function with respect to β_{ j }(j = 1, 2, …, p) with the remaining elements of β fixed at their most recently updated values.
where I is the indicator function. This formula is equivalent to the hard thresholding operator [17].
There are three cases of ω_{ j } < 0, $0<{\omega}_{j}<\frac{3}{4}{\lambda}^{\frac{2}{3}}$, and ${\omega}_{j}>\frac{3}{4}{\lambda}^{\frac{2}{3}}$ respectively.
(i) If ω _{ j } < 0, the three roots of equation (11) can be expressed as follows: ${\mu}_{1}=-2\phantom{\rule{0.5em}{0ex}}r\text{sin}\phantom{\rule{0.5em}{0ex}}\frac{\phi}{3},\phantom{\rule{0.5em}{0ex}}{\mu}_{2}=r\text{sin}\phantom{\rule{0.5em}{0ex}}\frac{\phi}{3}+i\sqrt{3}\phantom{\rule{0.5em}{0ex}}r\phantom{\rule{0.5em}{0ex}}\text{cos}\phantom{\rule{0.5em}{0ex}}\frac{\phi}{3}$and${\mu}_{3}=r\text{sin}\phantom{\rule{0.5em}{0ex}}\frac{\phi}{3}-i\phantom{\rule{0.5em}{0ex}}\sqrt{3\phantom{\rule{0.5em}{0ex}}}r\phantom{\rule{0.5em}{0ex}}\text{cos}\phantom{\rule{0.5em}{0ex}}\frac{\phi}{3},$where $r=\sqrt{\frac{\left|{\omega}_{j}\right|}{3}}$, $\phi =\text{arccos}\left(\frac{\lambda}{8{r}^{3}}\right)$. When r > 0, none of the roots satisfices μ _{1} > 0. Thus, there is no solution to equation (11) when ω _{ j } < 0.
and ${\mu}_{3}=r\phantom{\rule{0.5em}{0ex}}\mathit{\text{cos}}\phantom{\rule{0.5em}{0ex}}\frac{\phi}{3}-i\phantom{\rule{0.5em}{0ex}}\sqrt{3}r\phantom{\rule{0.5em}{0ex}}\text{sin}\phantom{\rule{0.5em}{0ex}}\frac{\phi}{3}$.
There is still no solution to equation (11) in this case.
and ${\beta}_{j}=\phantom{\rule{0.5em}{0ex}}-{\left({\mu}_{2}\right)}^{2}=\phantom{\rule{0.5em}{0ex}}-\frac{2}{3}\left|{\omega}_{j}\right|\left(1\phantom{\rule{0.5em}{0ex}}+\phantom{\rule{0.5em}{0ex}}\text{cos}\phantom{\rule{0.5em}{0ex}}\left(\frac{2\left(\pi -\phi \left({\omega}_{j}\right)\right)}{3}\right)\right)$.
Where ${Z}_{i}={X}_{i}\tilde{\beta}+\frac{{Y}_{i}-f\left({X}_{i}\tilde{\beta}\right)}{f\left({X}_{i}\tilde{\beta}\right)\left(1-f\left({X}_{i}\tilde{\beta}\right)\right)}$ is an estimated response, ${W}_{i}=f\left({X}_{i}\tilde{\beta}\right)\left(1-f\left({X}_{i}\tilde{\beta}\right)\right)$ is a weight and $f\left({X}_{i}\tilde{\beta}\right)=\text{exp}\left({X}_{i}\tilde{\beta}\right)/\left(1+\text{exp}\left({X}_{i}\tilde{\beta}\right)\right)$ is a evaluated value at current parameters. Redefine the partial residual for fitting current ${\tilde{\beta}}_{j}$ as ${\tilde{Z}}_{i}^{\left(j\right)}={\displaystyle \sum _{i=1}^{n}{W}_{i}\left({\tilde{Z}}_{i}-{\displaystyle \sum _{k\ne j}{x}_{\mathit{ik}}{\tilde{\beta}}_{k}}\right)}$ and $\sum _{i=1}^{n}{x}_{\mathit{ij}}\left({Z}_{i}-{\tilde{Z}}_{i}^{\left(j\right)}\right)$, we can directly apply the coordinate descent algorithm with the L_{1/2} penalty for sparse logistic regression and the details are given follows:
The coordinate descent algorithm for the L_{1/2} penalized logistic regression works well in the sparsity problems, because the procedure does not need to change many irrelevant parameters and recalculate partial residuals for each update step.
Results
Analyses of simulated data
In this section, we evaluate the performance of the sparse logistic regression with the L_{1/2} penalty in simulation study. We generate high-dimensional and low sample size data which contain many irrelevant features. Two methods are compared with our proposed approach: Sparse logistic regression with the Elastic Net penalty (L_{EN}) and Sparse logistic regression with the Lasso penalty (L_{1}).
Where ϵ is the independent random error generated from N(0,1) and σ is the parameter which controls the signal to noise. In every simulation, the dimension p of the predictor vector is 1000, and the first five true coefficients are nonzero: β_{1} = 1, β_{2} = 1, β_{3} = -1, β_{4} = -1, β_{5} = 1, and β_{ j } = 0(6 ≤ j ≤ 1000).
The estimation of the optimal tuning parameter λ in the sparse logistic regression models can be done in many ways and is often done by k-fold cross-validation (CV). Note that the choice of k will depend on the size of the training set. In our experiments, we use 10-fold cross-validation (k=10). The elastic net method has two tuning parameters, we need to cross-validate on a two-dimensional surface [16].
The average errors (%) for the test data sets obtained by the sparse logistic regressions with the L _{ 1/2 } , L _{ EN } and L _{ 1 } penalties in 30 runs
Sample size | L_{1/2} | L_{EN} | L_{1} | |
---|---|---|---|---|
$\begin{array}{c}\hfill \rho =0.1,\hfill \\ \hfill \sigma =0.2\hfill \end{array}$ | n=50 | 28.2 | 31.8 | 31.2 |
n=80 | 10.7 | 23.1 | 22.2 | |
n=100 | 8.1 | 16.9 | 15.7 | |
$\begin{array}{c}\hfill \rho =0.1,\hfill \\ \hfill \sigma =0.6\hfill \end{array}$ | n=50 | 31.4 | 33.1 | 33.3 |
n=80 | 18.4 | 27.1 | 26.6 | |
n=100 | 14.2 | 22.4 | 21.3 | |
$\begin{array}{c}\hfill \rho =0.4,\hfill \\ \hfill \sigma =0.2\hfill \end{array}$ | n=50 | 30.1 | 32.6 | 33.0 |
n=80 | 11.1 | 23.3 | 22.9 | |
n=100 | 9.1 | 19.0 | 16.4 | |
$\begin{array}{c}\hfill \rho =0.4,\hfill \\ \hfill \sigma =0.6\hfill \end{array}$ | n=50 | 35.1 | 35.5 | 36.3 |
n=80 | 20.5 | 27.2 | 26.9 | |
n=100 | 15.1 | 22.7 | 22.9 |
The average number of variables selected by the sparse logistic regressions with the L _{ 1/2 } , L _{ EN } and L _{ 1 } penalties in 30 runs
Sample size | L_{1/2} | L_{EN} | L_{1} | |
---|---|---|---|---|
$\begin{array}{c}\hfill \rho =0.1,\hfill \\ \hfill \sigma =0.2\hfill \end{array}$ | n=50 | 7.5 | 31.6 | 27.1 |
n=80 | 8.8 | 43.1 | 40.3 | |
n=100 | 8.9 | 49.7 | 45.7 | |
$\begin{array}{c}\hfill \rho =0.1,\hfill \\ \hfill \sigma =0.6\hfill \end{array}$ | n=50 | 8.3 | 33.6 | 29.2 |
n=80 | 10.6 | 45.7 | 41.9 | |
n=100 | 10.8 | 54.4 | 50.1 | |
$\begin{array}{c}\hfill \rho =0.4,\hfill \\ \hfill \sigma =0.2\hfill \end{array}$ | n=50 | 7.8 | 33.5 | 28.3 |
n=80 | 8.9 | 44.5 | 41.8 | |
n=100 | 9.0 | 51.2 | 46.6 | |
$\begin{array}{c}\hfill \rho =0.4,\hfill \\ \hfill \sigma =0.6\hfill \end{array}$ | n=50 | 8.6 | 41.3 | 29.9 |
n=80 | 10.7 | 45.9 | 44.1 | |
n=100 | 11.2 | 56.4 | 53.4 |
The frequencies of the relevant variables obtained by the sparse logistic regressions with the L _{ 1/2 } , L _{ EN } and L _{ 1 } penalties in 30 runs
Sample size | Method | ||||||
---|---|---|---|---|---|---|---|
$\begin{array}{c}\hfill \rho =0.1,\hfill \\ \hfill \sigma =0.2\hfill \end{array}$ | n=50 | L_{1/2} | 21 | 22 | 19 | 15 | 15 |
L_{EN} | 24 | 25 | 21 | 17 | 17 | ||
L_{1} | 22 | 24 | 20 | 15 | 17 | ||
n=80 | L_{1/2} | 30 | 30 | 30 | 30 | 30 | |
L_{EN} | 30 | 29 | 30 | 30 | 30 | ||
L_{1} | 30 | 29 | 30 | 30 | 30 | ||
n=100 | L_{1/2} | 30 | 30 | 30 | 30 | 30 | |
L_{EN} | 30 | 30 | 30 | 30 | 30 | ||
L_{1} | 30 | 30 | 30 | 30 | 30 | ||
$\begin{array}{c}\hfill \rho =0.1,\hfill \\ \hfill \sigma =0.6\hfill \end{array}$ | n=50 | L_{1/2} | 17 | 17 | 17 | 14 | 14 |
L_{EN} | 18 | 19 | 17 | 16 | 14 | ||
L_{1} | 18 | 18 | 18 | 16 | 15 | ||
n=80 | L_{1/2} | 30 | 29 | 30 | 28 | 28 | |
L_{EN} | 30 | 28 | 30 | 28 | 27 | ||
L_{1} | 30 | 28 | 30 | 27 | 26 | ||
n=100 | L_{1/2} | 30 | 30 | 30 | 30 | 30 | |
L_{EN} | 30 | 30 | 30 | 30 | 30 | ||
L_{1} | 30 | 30 | 30 | 28 | 30 | ||
$\begin{array}{c}\hfill \rho =0.4,\hfill \\ \hfill \sigma =0.2\hfill \end{array}$ | n=50 | L_{1/2} | 19 | 18 | 18 | 16 | 15 |
L_{EN} | 21 | 22 | 21 | 17 | 17 | ||
L_{1} | 18 | 21 | 19 | 16 | 17 | ||
n=80 | L_{1/2} | 30 | 30 | 30 | 30 | 30 | |
L_{EN} | 30 | 28 | 30 | 29 | 29 | ||
L_{1} | 30 | 27 | 30 | 29 | 29 | ||
n=100 | L_{1/2} | 30 | 30 | 30 | 30 | 30 | |
L_{EN} | 30 | 30 | 30 | 30 | 30 | ||
L_{1} | 30 | 30 | 30 | 29 | 29 | ||
$\begin{array}{c}\hfill \rho =0.4,\hfill \\ \hfill \sigma =0.6\hfill \end{array}$ | n=50 | L_{1/2} | 14 | 16 | 15 | 12 | 12 |
L_{EN} | 17 | 17 | 17 | 12 | 14 | ||
L_{1} | 17 | 15 | 14 | 9 | 13 | ||
n=80 | L_{1/2} | 29 | 25 | 26 | 28 | 29 | |
L_{EN} | 28 | 24 | 24 | 27 | 24 | ||
L_{1} | 27 | 24 | 24 | 23 | 23 | ||
n=100 | L_{1/2} | 30 | 29 | 30 | 30 | 30 | |
L_{EN} | 30 | 27 | 28 | 28 | 30 | ||
L_{1} | 29 | 27 | 27 | 28 | 30 |
Analyses on microarray data
Four publicly available gene expression datasets used in the experiments
Dataset | No. of genes | No. of samples | classes |
---|---|---|---|
Leukaemia | 3571 | 72 | ALL/AML |
Prostate | 5966 | 102 | Normal/Tumor |
Colon | 2000 | 62 | Normal/Tumor |
DLBCL | 6285 | 77 | DLBCL/FL |
Leukaemia dataset
The original dataset was provided by Golub et al. [7], and contains the expression profiles of 7,129 genes for 47 patients of acute lymphoblastic leukaemia (ALL) and 25 patients of acute myeloid leukaemia (AML). For data preprocessing, we followed the protocol detailed in the supplementary information to Dudoit et al. [1]. After thresholding, filtering, applying a logarithmic transformation and standardizing each expression profile to zero mean and unit variance, a dataset comprising 3,571 genes remained.
Prostate dataset
This original dataset contains the expression profiles of 12,600 genes for 50 normal tissues and 52 prostate tumor tissues. For data preprocessing, we adopt the pretreatment method [20] to obtain a dataset with 102 samples. And each sample contains 5966 genes.
Colon dataset
The colon microarray data set in Alon et al. [21] has 2000 genes per sample and 62 samples which consist of 22 normal tissues and 40 cancer tissues. The Colon dataset are available at http://microarray.princeton.edu/oncology.
DLBCL dataset
This dataset contains 77 microarray gene expression profiles of the 2 most prevalent adult lymphoid malignancies: 58 samples of diffuse large B-cell lymphomas (DLBCL) and 19 observations of follicular lymphoma (FL). Each sample contains 7,129 gene expression values. More information on these data can be found in Shipp MA et al. [22]. For data preprocessing, we followed the protocol detailed in the supplementary information to Dudoit et al. [1], and a dataset comprising 6,285 genes remained.
The detail information of 4 microarray datasets used in the experiments
Dataset | No.of Training(class1/class2) | No.of Testing(class1/class2) |
---|---|---|
Leukaemia | 50(32 ALL/18 AML) | 22 (15 ALL/7 AML) |
Prostate | 71(35 ALL/36 AML) | 31(15 ALL/16 AML) |
Colon | 42(14 Normal/28 Tumor) | 20(8 Normal/12 Tumor) |
DLBCL | 60(45 DLBCL/15FL) | 17(13 DLBCL/4 FL) |
The classification performances of different methods for 4 gene expression datasets
Dataset | Method | Cross-validation error | Test error | No. of selected genes |
---|---|---|---|---|
Leukaemia | L_{1/2} | 2/50 | 1/22 | 2 |
L_{EN} | 1/50 | 1/22 | 9 | |
L_{1} | 1/50 | 1/22 | 6 | |
Prostate | L_{1/2} | 5/71 | 3/31 | 5 |
L_{EN} | 5/71 | 4/31 | 34 | |
L_{1} | 5/71 | 3/31 | 25 | |
Colon | L_{1/2} | 4/42 | 3/20 | 5 |
L_{EN} | 5/42 | 4/20 | 13 | |
L_{1} | 5/42 | 4/20 | 7 | |
DLBCL | L_{1/2} | 3/60 | 2/17 | 14 |
L_{EN} | 2/60 | 1/17 | 38 | |
L_{1} | 3/60 | 3/17 | 23 |
Brief biological analyses of the selected genes
The 10 top-ranked informative genes found by the three sparse logistic regression methods from the Leukaemia dataset
Rank | Gene description | ||
---|---|---|---|
L_{1/2} | L_{EN} | L_{1} | |
1 | CST3 cystatin C * | CFD complement factor D (adipsin) * | CST3 cystatin C * |
2 | MPO myeloperoxidase * | CST3 cystatin C * | CFD complement factor D (adipsin) * |
3 | IL8 interleukin 8 | MPO myeloperoxidase * | MPO myeloperoxidase * |
4 | GYPB glycophorin B (MNS blood group) | DNTT deoxynucleotidyltransferase, terminal * | IL8 interleukin 8 * |
5 | IGL immunoglobulin lambda locus | TCL1A T-cell leukemia/lymphoma 1A * | DNTT deoxynucleotidyltransferase, terminal * |
6 | DNTT deoxynucleotidyltransferase, terminal | IGL immunoglobulin lambda locus * | TCL1A T-cell leukemia/lymphoma 1A * |
7 | LOC100437488 interleukin-8-like | IL8 interleukin 8 * | IGL immunoglobulin lambda locus |
8 | LTB lymphotoxin beta (TNF superfamily, member 3) | ZYX zyxin * | LTB lymphotoxin beta (TNF superfamily, member 3) |
9 | TCRB T cell receptor beta cluster | LTB lymphotoxin beta (TNF superfamily, member 3) * | CD79A CD79a molecule, immunoglobulin-associated alpha |
10 | S100A9 S100 calcium binding protein A9 | CD79A CD79a molecule, immunoglobulin-associated alpha | HBB hemoglobin, beta |
The 10 top-ranked informative genes found by the three sparse logistic regression methods from the Prostate dataset
Rank | Gene description | ||
---|---|---|---|
L_{1/2} | L_{EN} | L_{1} | |
1 | SLC43A3 solute carrier family 43, member 3 * | AMOTL2 angiomotin like 2 * | USP4 ubiquitin specific peptidase 4 (proto-oncogene) * |
2 | CD22 CD22 molecule * | USP4 ubiquitin specific peptidase 4 (proto-oncogene) * | CD22 CD22 molecule * |
3 | KHDRBS1 KH domain containing, RNA binding, signal transduction associated 1 * | EIF4EBP2 eukaryotic translation initiation factor 4E binding protein 2 * | EIF4EBP2 eukaryotic translation initiation factor 4E binding protein 2 * |
4 | ZNF787 zinc finger protein 787 * | PRAF2 PRA1 domain family, member 2 * | Gene symbol:AA683055, probe set: 34711_at * |
5 | GMPR guanosine monophosphate reductase * | CACYBP calcyclin binding protein * | AMOTL2 angiomotin like 2 * |
6 | AMOTL2 angiomotin like 2 | Gene symbol:AA683055, probe set: 34711_at * | VSNL1 visinin-like 1 * |
7 | EIF4EBP2 eukaryotic translation initiation factor 4E binding protein 2 | VSNL1 visinin-like 1 * | FLNC filamin C, gamma * |
8 | USP2 ubiquitin specific peptidase 2 | SLC43A3 solute carrier family 43, member 3 * | PRAF2 PRA1 domain family, member 2 * |
9 | USP4 ubiquitin specific peptidase 4 (proto-oncogene) | CD22 CD22 molecule * | CACYBP calcyclin binding protein * |
10 | ACTN4 actinin, alpha 4 | TMCO1 transmembrane and coiled-coil domains 1 * | SLC43A3 solute carrier family 43, member 3 * |
The 10 top-ranked informative genes found by the three sparse logistic regression methods from the colon dataset
Rank | Gene description | ||
---|---|---|---|
L_{1/2} | L_{EN} | L_{1} | |
1 | GUCA2B guanylate cyclase activator 2B (uroguanylin) * | GUCA2B guanylate cyclase activator 2B (uroguanylin) * | GUCA2B guanylate cyclase activator 2B (uroguanylin) * |
2 | MYL6 myosin, light chain 6, alkali, smooth muscle and non-muscle * | MYH9 myosin, heavy chain 9, non-muscle * | ATPsyn-Cf6 ATP synthase-coupling factor 6, mitochondrial * |
3 | DES desmin * | DES desmin * | MYH9 myosin, heavy chain 9, non-muscle * |
4 | CHRND cholinergic receptor, nicotinic, delta polypeptide * | MYL6 myosin, light chain 6, alkali, smooth muscle and non-muscle * | GSN gelsolin * |
5 | PECAM1 platelet/endothelial cell adhesion molecule-1 * | GSN gelsolin * | MYL6 myosin, light chain 6, alkali, smooth muscle and non-muscle * |
6 | ATPsyn-Cf6 ATP synthase-coupling factor 6, mitochondrial | COL11A2 collagen, type XI, alpha 2 * | COL11A2 collagen, type XI, alpha 2 * |
7 | ATF7 activating transcription factor 7 | ATPsyn-Cf6 ATP synthase-coupling factor 6, mitochondrial * | MXI1 MAX interactor 1, dimerization protein * |
8 | PROBABLE NUCLEAR ANTIGEN (Pseudorabies virus)[accession number:T86444] | ssb single-strand binding protein * | UQCRC1 ubiquinol-cytochrome c reductase core protein I * |
9 | MYH9 myosin, heavy chain 9, non-muscle | Sept2 septin 2 * | DES desmin * |
10 | MYH10 myosin, heavy chain 10, non-muscle | MXI1 MAX interactor 1, dimerization protein * | ZEB1 zinc finger E-box binding homeobox 1* |
The 10 top-ranked informative genes found by the three sparse logistic regression methods from the DLBCL dataset
Rank | Gene description | ||
---|---|---|---|
L_{1/2} | L_{EN} | L_{1} | |
1 | CCL21 chemokine (C-C motif) ligand 21 * | MTH1 metallothionein 1H * | MTH1 metallothionein 1H * |
2 | HLA-DQB1 major histocompatibility complex, class II, DQ beta 1 * | MT2A metallothionein 2A * | MT2A metallothionein 2A * |
3 | MT2A metallothionein 2A * | SFTPA1 surfacant protein A1 * | CCL21 chemokine (C-C motif) ligand 2 * |
4 | THRSP thyroid hormone responsive * | TCL1A T-cell leukemia/lymphoma 1A * | SFTPA1 surfacant protein A1 * |
5 | lgj immunoglobulin joining chain * | ZFP36L2 ZFP36 ring finger protein-like 2 * | POLD2 polymerase (DNA directed), delta 2, accessory subunit * |
6 | TCL1A T-cell leukemia/lymphoma 1A * | FCGR1A Fc fragment of IgG, high affinity Ia, receptor (CD64) * | lgj immunoglobulin joining chain * |
7 | GOT2 glutamic-oxaloacetic transaminase 2, mitochondrial (aspartate aminotransferase 2) * | lgj immunoglobulin joining chain * | MELK maternal embryonic leucine zipper kinase * |
8 | Plod procollagen lysyl hydroxylase * | TRB2 Homeodomain-like/winged-helix DNA-binding family protein * | CKS2 CDC28 protein kinase regulatory subunit 2 * |
9 | STXBP2 syntaxin binding protein 2 * | MELK maternal embryonic leucine zipper kinase * | EIF2A eukaryotic translation initiation factor 2A, 65kDa * |
10 | SFTPA1 surfacant protein A1 * | CCL21 chemokine (C-C motif) ligand 2 * | AQP4 aquaporin 4 * |
In Tables 7, 8, 9 and 10, some genes are only frequently selected by the L1/2 method, but not discovered by the L_{EN} and L_{1} methods. The evidence from the literatures showed that they are cancer related genes. For example, for the colon dataset, the genes cholinergic receptor, nicotinic, delta polypeptide (CHRND) and platelet/endothelial cell adhesion molecule-1 (PECAM1) were also selected by Maglietta R. et al. [30], Wiese A.H. et al. [31], Wang S. L. et al. [32], and Dai J. H. and Xu Q. [33]. These genes can significantly discriminate between non-dissected tumors and micro dissected invasive tumor cells. It is remarkable that apparently (to our knowledge) some discovered genes that have not been seen in any past studies.
On the other hand, from Tables 7, 8, 9 and 10, we found that the most frequently selected genes and their ranking orders by the LEN and L1 methods are much similar compared with those of the L1/2 method. The main reasons are that the classification hypothesis needs not be unique as the samples in gene expression data lie in a high-dimensional space, and both of the LEN and L1 methods are based on the L1 type penalty.
Construct KNN classifier with the most frequently selected relevant genes
In this section, to further evaluate the performance and prediction generality of the sparse logistic regression with L_{1/2} penalty, we constructed KNN (k =3, 5) classifiers using the relevant genes which were most frequently selected by the L_{1/2} penalized logistic regression method. In this experiment, we use the random leave-one-out cross validation (LOOCV) to evaluate the predictive ability and repeat 50 runs.
Summary of the results of KNN classifiers using the most frequently selected genes by our proposed L _{ 1/2 } penalized logistic regression method
Methods | K-NN(k=3) | K-NN(k=5) |
---|---|---|
Leukaemia | 98.3% | 94.4% |
Prostate | 95.1% | 94.2% |
Colon | 95.1% | 90.6% |
DLBCL | 94.8% | 91.2% |
Conclusions
In cancer classification application based on microarray data, only a small subset of genes is strongly indicative of a targeted disease. Thus, feature selection methods play an important role in cancer classification. In this paper, we propose and model sparse logistic regression with the L_{1/2} penalty, and develop the corresponding coordinate descent algorithm as a novel gene selection approach. The proposed method utilizes a novel univariate half thresholding to update the estimated coefficients.
Both simulation and microarray data studies show that the sparse logistic regression with the L_{1/2} penalty achieve higher classification accuracy than those of ordinary L_{1} and elastic net regularization approaches, while fewer but informative genes are selected. Therefore, the sparse logistic regression with the L_{1/2} penalty is the effective technique for gene selection in real classification problem.
In this paper, we use the proposed method to solve binary cancer classification problem. However, many cancer classification problems involve multi-category microarray data. We plan to extend our proposed method to solve multinomial penalized logistic regression for multiclass cancer classification in our future work.
Declarations
Acknowledgements
This research was supported by Macau Science and Technology Develop Funds (Grant No. 017/2010/A2) of Macau SAR of China and the National Natural Science Foundations of China (Grant No. 2013CB329404, 11131006, 61075054, and 11171272).
Authors’ Affiliations
References
- Dudoit S, Fridlyand S, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002, 97 (457): 77-87. 10.1198/016214502753479248.View ArticleGoogle Scholar
- Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004, 20: 2429-2437. 10.1093/bioinformatics/bth267.View ArticlePubMedGoogle Scholar
- Lee JW, Lee JB, Park M, Song SH: An extensive evaluation of recent classification tools applied to microarray data. Com Stat Data Anal. 2005, 48: 869-885. 10.1016/j.csda.2004.03.017.View ArticleGoogle Scholar
- Ding C, Peng H: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. 2005, 3 (2): 185-205. 10.1142/S0219720005001004.View ArticleGoogle Scholar
- Monari G, Dreyfus G: Withdrawing an example from the training set: an analytic estimation of its effect on a nonlinear parameterized model. Neurocomputing Letters. 2000, 35: 195-201. 10.1016/S0925-2312(00)00325-8.View ArticleGoogle Scholar
- Rivals I, Personnaz L: MLPs (mono-layer polynomials and multi-layer perceptrons) for nonlinear modeling. J Mach Learning Res. 2003, 3: 1383-1398.Google Scholar
- Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
- Guyon I, Elisseff A: An Introduction to variable and feature selection. J Mach Learning Res. 2003, 3: 1157-1182.Google Scholar
- Shevade SK, Keerthi SS: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics. 2003, 19: 2246-2253. 10.1093/bioinformatics/btg308.View ArticlePubMedGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996, 58: 267-288.Google Scholar
- Fiedman J, Hastie T, Hofling H, Tibshirani R: Path wise coordinate optimization. Ann. Appl. Statist. 2007, 1: 302-332. 10.1214/07-AOAS131.View ArticleGoogle Scholar
- Fiedman J, Hastie T, Hofling H, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J. Statist. Softw. 2010, 33: 1-22.Google Scholar
- Gavin CC, Talbot LC: Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics. 2006, 22: 2348-2355. 10.1093/bioinformatics/btl386.View ArticleGoogle Scholar
- Xu ZB, Zhang H, Wang Y, Chang XY, Liang Y: L_{1/2} regularization. Sci China Series F. 2010, 40 (3): 1-11.Google Scholar
- Fan J, Li R: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 2001, 96: 1348-1361. 10.1198/016214501753382273.View ArticleGoogle Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. J Royal Stat Soc Series B. 2005, 67 (2): 301-320. 10.1111/j.1467-9868.2005.00503.x.View ArticleGoogle Scholar
- Zhang CH: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 2010, 38: 894-942. 10.1214/09-AOS729.View ArticleGoogle Scholar
- Xu ZB, Chang XY, Xu FM, Zhang H: L_{1/2} Regularization: a thresholding representation theory and a fast solver. IEEE Transact Neural Networks Learn Syst. 2012, 23 (7): 1013-1027.View ArticleGoogle Scholar
- Sohn I, Kim J, Jung SH, Park C: Gradient lasso for Cox proportional hazards model. Bioinformatics. 2009, 25 (14): 1775-1781. 10.1093/bioinformatics/btp322.View ArticlePubMedGoogle Scholar
- Yang K, Cai ZP, Li JZ, Lin GH: A stable gene selection in microarray data analysis. BMC Bioinformatics. 2006, 7: 228-10.1186/1471-2105-7-228.PubMed CentralView ArticlePubMedGoogle Scholar
- Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Nat Acad Sci USA. 1999, 96 (12): 6745-6750. 10.1073/pnas.96.12.6745.PubMed CentralView ArticlePubMedGoogle Scholar
- Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Amgel M, Reich M, Pinkus GS, Ray TS, Kovall MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nat Med. 2002, 8: 68-74. 10.1038/nm0102-68.View ArticlePubMedGoogle Scholar
- Nagai A, Terashima M, Harada T, Shimode K, Takeuchi H, Murakawa Y, et al: Cathepsin B and H activities and cystatin C concentrations in cerebrospinal fluid from patients with leptomeningeal metastasis. Clin Chim Acta. 2003, 329: 53-60. 10.1016/S0009-8981(03)00023-8.View ArticlePubMedGoogle Scholar
- Moroz C, Traub L, Maymon R, Zahalka MA: A novel human ferritin subunit from placenta with immunosuppressive activity. J Biol Chem. 2002, 277: 12901-12905. 10.1074/jbc.M200956200.View ArticlePubMedGoogle Scholar
- Ben-Dor A, et al: Tissue classification with gene expression profiles. J Comput Biol. 2000, 7: 559-583. 10.1089/106652700750050943.View ArticlePubMedGoogle Scholar
- Yang AJ, Song XY: Bayesian variable selection for disease classification using gene expression data. Bioinformatics. 2010, 26: 215-222. 10.1093/bioinformatics/btp638.View ArticlePubMedGoogle Scholar
- Li HD, Xu QS, Liang YZ: Random frog: an efficient reversible jump Markov chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. Anal Chim Acta. 2012, 740: 20-26.View ArticlePubMedGoogle Scholar
- Notterman DA, Alon U, Sierk AJ, Levine AJ: Minimax probability machine. Advances in neural processing systems. Cancer Res. 2001, 61: 3124-3130.PubMedGoogle Scholar
- Shailubhai K, Yu H, Karunanandaa K, Wang J, Eber S, Wang Y, Joo N, Kim H, Miedema B, Abbas S, Boddupalli S, Currie M, Forte L: Uroguanylin treatment suppeesses polyp formation in the Apc(Min/+) mouse and indices apoptosis in human colon adenocarcinoma cells via cyclic GMP. Cancer Res. 2000, 60: 5151-5157.PubMedGoogle Scholar
- Maglietta R, Addabbo A, Piepoli A, Perri F, Liuni S, Pesole G, Ancona N: Selection of relevant genes in cancer diagnosis based on their prediction accuracy. Art Intell Med. 2007, 40: 29-44. 10.1016/j.artmed.2006.06.002.View ArticleGoogle Scholar
- Wiese AH J, Lassmann S, Nahrig J, Rosenberg R, Hofler H, Ruger R, Werner M: Identification of gene signatures for invasive colorectal tumor cells. Cancer Detect Prev. 2007, 31: 282-295. 10.1016/j.cdp.2007.07.003.View ArticlePubMedGoogle Scholar
- Wang SL, Li XL, Zhang SW, Gui J, Huang DS: Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction. Comp Biol Med. 2010, 40: 179-189. 10.1016/j.compbiomed.2009.11.014.View ArticleGoogle Scholar
- Dai JH, Xu Q: Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification. App Soft Comp. 2013, 13: 211-221. 10.1016/j.asoc.2012.07.029.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.