Skip to main content

Development of a literature informed Bayesian machine learning method for feature extraction and classification


Gene expression profiling is a powerful approach to identify markers for classification of samples; however, it has major limitations that hinder performance. Typically, a large number of variables are assessed compared to relatively small sample sizes. In addition, it is difficult to identify biologically informative markers which have high predictive power[13]. Thus, the goal of this study was to develop a machine learning approach that is able to bridge classification accuracy and biological function.

Materials and methods

We developed a Literature aided Sparse Bayesian Generalized Linear model which utilizes Generalized Double Pareto (LSBGG) prior to induce shrinkage in terms of the number of covariates. Importantly, instead of using uninformed hyper parameters for the prior distributions, we adjusted the hyper parameters based on the ranking of the genes by GeneIndexer (Quire Inc. Memphis, TN) with respect to ‘cancer’ keyword query. This unique approach controls shrinkage imposed on genes based on biological function extracted from the literature. The model was applied to a leukemia data set from Golub et al.[4]. The dataset was split into training and test groups and classification performance was evaluated on the test group. The top 500 highly differentially expressed genes were used for the modeling step.


Using the top 10 genes obtained from LSBGG, we were able to achieve 91% classification accuracy in the test group. When the training and test datasets were switched, we obtained 92% classification accuracy. In contrast, the model without biological information achieved 91% and 86% classification accuracies in the two test scenarios (Table 1). Consistent with these results, Receiver Operating Characteristic (ROC) analysis showed better performance when shrinkage was imposed using the literature (Figure 1). Notably, we found that the posterior mean of θ was higher for genes which were functionally related to cancer in the biomedical literature (Figure 2).

Table 1 Classification accuracy, sensitivity and specificity of the model including (LSBGG) or excluding literature.
Figure 1
figure 1

ROC curves for the models with or without literature imposed shrinkage. The area under the curve (AUC) for the model incorporating literature (Tests 1 &2) is higher than the model without incorporating literature (Tests 3 &4).

Figure 2
figure 2

Relationship between the posterior mean of θs and biological relevance. The posterior mean of θ (Y-axis) is shown for the top 500 genes (X-axis) when using a model with (A) or without (B) incorporation of literature to control shrinkage. There is a clear association between the estimated θ in the model and association with cancer in the literature.


This demonstrates that while LBSGG performs slightly better in classification of samples, it uses more biologically informative genes, and hence may simultaneously provide insights into the mechanisms underlying the phenotype of interest.


  1. Dupuy A, Simmon RM: Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting. JNCI J Natl Cancer Inst. 2007, 2: 147-157. 10.1093/jnci/djk018.

    Article  Google Scholar 

  2. Madahian B, Deng L, Homayouni R: Application of Sparse Bayesian Generalized Linear Model to Gene Expression Data for Classification of Prostate Cancer Subtypes. Open Journal of Statistics. 2014, 2: 518-526. 10.4236/ojs.2014.47049.

    Article  Google Scholar 

  3. Novianti PW, Roes KCB, Eijkemans MJC: Evaluation of Gene Expression Classification Studies: Factors Associated with Classification Performance. PLOS ONE. 2014, 9 (4): e96063-10.1371/journal.pone.0096063.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield CE: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.

    Article  PubMed  CAS  Google Scholar 

Download references


This work was supported by the University of Memphis Center for Translational Informatics and the Assisi Foundation of Memphis.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ramin Homayouni.

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Madahian, B., Deng, L.Y. & Homayouni, R. Development of a literature informed Bayesian machine learning method for feature extraction and classification. BMC Bioinformatics 16 (Suppl 15), P9 (2015).

Download citation

  • Published:

  • DOI:


  • Receiver Operating Characteristic
  • Machine Learning
  • Classification Accuracy
  • Prior Distribution
  • Test Group