Development of sparse Bayesian multinomial generalized linear model for multi-class prediction
© Madahian et al; licensee BioMed Central Ltd. 2014
Published: 29 September 2014
Gene expression profiling has been used for many years to classify samples and to gain insights into the molecular mechanisms of phenotypes and diseases. A major challenge in expression analysis is caused by the large number of variables assessed compared to relatively small sample sizes. In addition, identification of markers that accurately predict multiple classes of samples, such as those involved in the progression of cancer or other diseases, remains difficult.
Materials and methods
In this study, we developed a multinomial Probit Bayesian model which utilized the double exponential prior to induce shrinkage and reduce the number of covariates in the model [1, 2]. A fully Bayesian hierarchical model was developed in order to facilitate Gibbs sampling which takes into account the progressive nature of the response variable. Gibbs sampling was performed in R for 100k iterations and the first 20k were discarded as burn-in. The method was applied to a published dataset on prostate cancer progression downloaded from Gene Expression Omnibus at NCBI (GSE6099) . The data set contained 99 prostate cancer cell types in four different progressive stages. The dataset was randomly divided into training (N=50) and test (N=49) groups such that each group contained an equal number of each cell type. Before applying our model, for each gene we performed ordinal logistic regression. Genes were ranked based on the p-value of association. Using a cutoff value of 0.05 after Benjamini and Hochberg FDR correction resulted in a final set of 398 genes.
Classification accuracy of prostate cancer subtypes in the train and test groups.
Train group Switched
Test group switched
Our future plan is to perform resampling on the selection of training and test groups in order to obtain more robust results and to compare the performance of the model to other popular classifiers such as Support Vector Machine and Random Forest.
This work was supported by the University of Memphis Center for Translational Informatics and the Assisi Foundation of Memphis.
- Park T, Casella G: The Bayesian lasso. J Am Stat Assoc. 2008, 103: 681-686. 10.1198/016214508000000337.View ArticleGoogle Scholar
- Albert J, Chib S: Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc. 1993, 88: 669-679. 10.1080/01621459.1993.10476321.View ArticleGoogle Scholar
- Tomlins SA, Mehra R, Rhodes DR, Cao X, Wang L, Dhanasekaran SM, Kalyana-Sundaram S, Wei JT, Rubin MA, Pienta KJ, Shah RB, Chinnaiyan AM: Integrative molecular concept modeling of prostate cancer progression. Nat Genet. 2007, 39 (1): 41-51. 10.1038/ng1935.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.