- Poster presentation
- Open Access
Development of a literature informed Bayesian machine learning method for feature extraction and classification
BMC Bioinformatics volume 16, Article number: P9 (2015)
Gene expression profiling is a powerful approach to identify markers for classification of samples; however, it has major limitations that hinder performance. Typically, a large number of variables are assessed compared to relatively small sample sizes. In addition, it is difficult to identify biologically informative markers which have high predictive power[1–3]. Thus, the goal of this study was to develop a machine learning approach that is able to bridge classification accuracy and biological function.
Materials and methods
We developed a Literature aided Sparse Bayesian Generalized Linear model which utilizes Generalized Double Pareto (LSBGG) prior to induce shrinkage in terms of the number of covariates. Importantly, instead of using uninformed hyper parameters for the prior distributions, we adjusted the hyper parameters based on the ranking of the genes by GeneIndexer (Quire Inc. Memphis, TN) with respect to ‘cancer’ keyword query. This unique approach controls shrinkage imposed on genes based on biological function extracted from the literature. The model was applied to a leukemia data set from Golub et al.. The dataset was split into training and test groups and classification performance was evaluated on the test group. The top 500 highly differentially expressed genes were used for the modeling step.
Using the top 10 genes obtained from LSBGG, we were able to achieve 91% classification accuracy in the test group. When the training and test datasets were switched, we obtained 92% classification accuracy. In contrast, the model without biological information achieved 91% and 86% classification accuracies in the two test scenarios (Table 1). Consistent with these results, Receiver Operating Characteristic (ROC) analysis showed better performance when shrinkage was imposed using the literature (Figure 1). Notably, we found that the posterior mean of θ was higher for genes which were functionally related to cancer in the biomedical literature (Figure 2).
This demonstrates that while LBSGG performs slightly better in classification of samples, it uses more biologically informative genes, and hence may simultaneously provide insights into the mechanisms underlying the phenotype of interest.
Dupuy A, Simmon RM: Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting. JNCI J Natl Cancer Inst. 2007, 2: 147-157. 10.1093/jnci/djk018.
Madahian B, Deng L, Homayouni R: Application of Sparse Bayesian Generalized Linear Model to Gene Expression Data for Classification of Prostate Cancer Subtypes. Open Journal of Statistics. 2014, 2: 518-526. 10.4236/ojs.2014.47049.
Novianti PW, Roes KCB, Eijkemans MJC: Evaluation of Gene Expression Classification Studies: Factors Associated with Classification Performance. PLOS ONE. 2014, 9 (4): e96063-10.1371/journal.pone.0096063.
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield CE: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.
This work was supported by the University of Memphis Center for Translational Informatics and the Assisi Foundation of Memphis.
About this article
Cite this article
Madahian, B., Deng, L.Y. & Homayouni, R. Development of a literature informed Bayesian machine learning method for feature extraction and classification. BMC Bioinformatics 16 (Suppl 15), P9 (2015). https://doi.org/10.1186/1471-2105-16-S15-P9
- Receiver Operating Characteristic
- Machine Learning
- Classification Accuracy
- Prior Distribution
- Test Group