Development of a literature informed Bayesian machine learning method for feature extraction and classification

Madahian, Behrouz; Deng, Lih Yuan; Homayouni, Ramin

doi:10.1186/1471-2105-16-S15-P9

Volume 16 Supplement 15

Proceedings of the 14th Annual UT-KBRIN Bioinformatics Summit 2015

Poster presentation
Open access
Published: 23 October 2015

Development of a literature informed Bayesian machine learning method for feature extraction and classification

Behrouz Madahian¹,
Lih Yuan Deng¹ &
Ramin Homayouni^2,3

BMC Bioinformatics volume 16, Article number: P9 (2015) Cite this article

1409 Accesses
3 Citations
Metrics details

Background

Gene expression profiling is a powerful approach to identify markers for classification of samples; however, it has major limitations that hinder performance. Typically, a large number of variables are assessed compared to relatively small sample sizes. In addition, it is difficult to identify biologically informative markers which have high predictive power[1–3]. Thus, the goal of this study was to develop a machine learning approach that is able to bridge classification accuracy and biological function.

Materials and methods

We developed a Literature aided Sparse Bayesian Generalized Linear model which utilizes Generalized Double Pareto (LSBGG) prior to induce shrinkage in terms of the number of covariates. Importantly, instead of using uninformed hyper parameters for the prior distributions, we adjusted the hyper parameters based on the ranking of the genes by GeneIndexer (Quire Inc. Memphis, TN) with respect to ‘cancer’ keyword query. This unique approach controls shrinkage imposed on genes based on biological function extracted from the literature. The model was applied to a leukemia data set from Golub et al.[4]. The dataset was split into training and test groups and classification performance was evaluated on the test group. The top 500 highly differentially expressed genes were used for the modeling step.

Results

Using the top 10 genes obtained from LSBGG, we were able to achieve 91% classification accuracy in the test group. When the training and test datasets were switched, we obtained 92% classification accuracy. In contrast, the model without biological information achieved 91% and 86% classification accuracies in the two test scenarios (Table 1). Consistent with these results, Receiver Operating Characteristic (ROC) analysis showed better performance when shrinkage was imposed using the literature (Figure 1). Notably, we found that the posterior mean of θ was higher for genes which were functionally related to cancer in the biomedical literature (Figure 2).

Table 1 Classification accuracy, sensitivity and specificity of the model including (LSBGG) or excluding literature.

Full size table

Conclusions

This demonstrates that while LBSGG performs slightly better in classification of samples, it uses more biologically informative genes, and hence may simultaneously provide insights into the mechanisms underlying the phenotype of interest.

References

Dupuy A, Simmon RM: Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting. JNCI J Natl Cancer Inst. 2007, 2: 147-157. 10.1093/jnci/djk018.
Article Google Scholar
Madahian B, Deng L, Homayouni R: Application of Sparse Bayesian Generalized Linear Model to Gene Expression Data for Classification of Prostate Cancer Subtypes. Open Journal of Statistics. 2014, 2: 518-526. 10.4236/ojs.2014.47049.
Article Google Scholar
Novianti PW, Roes KCB, Eijkemans MJC: Evaluation of Gene Expression Classification Studies: Factors Associated with Classification Performance. PLOS ONE. 2014, 9 (4): e96063-10.1371/journal.pone.0096063.
Article PubMed PubMed Central Google Scholar
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield CE: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This work was supported by the University of Memphis Center for Translational Informatics and the Assisi Foundation of Memphis.

Author information

Authors and Affiliations

Department of Mathematical Sciences, University of Memphis, Memphis, TN, 38152, USA
Behrouz Madahian & Lih Yuan Deng
Bioinformatics Program, University of Memphis, Memphis, TN, 38152, USA
Ramin Homayouni
Department of Biological Sciences, University of Memphis, Memphis, TN, 38152, USA
Ramin Homayouni

Authors

Behrouz Madahian
View author publications
You can also search for this author in PubMed Google Scholar
Lih Yuan Deng
View author publications
You can also search for this author in PubMed Google Scholar
Ramin Homayouni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ramin Homayouni.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Madahian, B., Deng, L.Y. & Homayouni, R. Development of a literature informed Bayesian machine learning method for feature extraction and classification. BMC Bioinformatics 16 (Suppl 15), P9 (2015). https://doi.org/10.1186/1471-2105-16-S15-P9

Download citation

Published: 23 October 2015
DOI: https://doi.org/10.1186/1471-2105-16-S15-P9

Proceedings of the 14th Annual UT-KBRIN Bioinformatics Summit 2015

Development of a literature informed Bayesian machine learning method for feature extraction and classification

Background

Materials and methods

Results

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Proceedings of the 14th Annual UT-KBRIN Bioinformatics Summit 2015

Development of a literature informed Bayesian machine learning method for feature extraction and classification

Background

Materials and methods

Results

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us