Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure

Background Many bioinformatics studies aim to identify markers, or features, that can be used to discriminate between distinct groups. In problems where strong individual markers are not available, or where interactions between gene products are of primary interest, it may be necessary to consider combinations of features as a marker family. To this end, recent work proposes a hierarchical Bayesian framework for feature selection that places a prior on the set of features we wish to select and on the label-conditioned feature distribution. While an analytical posterior under Gaussian models with block covariance structures is available, the optimal feature selection algorithm for this model remains intractable since it requires evaluating the posterior over the space of all possible covariance block structures and feature-block assignments. To address this computational barrier, in prior work we proposed a simple suboptimal algorithm, 2MNC-Robust, with robust performance across the space of block structures. Here, we present three new heuristic feature selection algorithms. Results The proposed algorithms outperform 2MNC-Robust and many other popular feature selection algorithms on synthetic data. In addition, enrichment analysis on real breast cancer, colon cancer, and Leukemia data indicates they also output many of the genes and pathways linked to the cancers under study. Conclusions Bayesian feature selection is a promising framework for small-sample high-dimensional data, in particular biomarker discovery applications. When applied to cancer data these algorithms outputted many genes already shown to be involved in cancer as well as potentially new biomarkers. Furthermore, one of the proposed algorithms, SPM, outputs blocks of heavily correlated genes, particularly useful for studying gene interactions and gene networks. Electronic supplementary material The online version of this article (10.1186/s12859-018-2059-8) contains supplementary material, which is available to authorized users.


Classification on Synthetic Microarray Data
Here we use the synthetic microarray model with block size k = 5, and consider all means types and correlations of section 4.2. Given the selected features of a feature selection algorithm, we train the following classifiers: Regularized Quadratic Discriminant Analysis (RQDA), Diagonal Quadratic Discriminant Analysis (DQDA), Regularized Linear Discriminant Analysis (RLDA), Diagonal Linear Discriminant Analysis (DLDA), a linear Support Vector Machine (SVM), a quadratic SVM, an SVM with Radial Basis Function (RBF) kernel, a Generalized Linear Model (GLM) with logit link, and a GLM with probit link. We generate 1000 points from each class, and test the trained classifiers on generated test data. We iterate 500 times to find the classification error of each classifier. For each feature selection algorithm we assign the minimum error of all constructed classifiers as the error of that algorithm. Figures 10 to 12 plot the error of classifiers built using feature selection algorithms of section 4.2. As the results indicate, typically methods that perform better on feature selection yield lower classification error. However, there are few exceptions. For instance, when markers are marginal and ρ 0 = ρ 1 = 0.1, POFAC slightly outperforms 2MNC-Robust in features selection, but has higher classification error. Furthermore, as long as sample size is larger than 30, proposed algorithms yield lower classification error compared with other feature selection algorithms.

Real Datasets
Here we present the top 100 genes of CMNC-OBF, REMAIN, and POFAC on the cancer datasets. We also present the top 20 pathways of CMNC-OBF, REMAIN, POFAC, and SPM, and study if the top genes and pathways are already shown or suggested to be involved in the cancer under study. We also find potential biomarkers which their role in cancer requires further investigation.
We also report the results of linear models for microarray and RNA-Seq data (limma) on these real datasets. In order to find the differentially expressed genes we fit a linear model with design matrix X comprised of two columns, where each column is the indicator of sample points belonging to a class. We also consider the contrast matrix C to be [1, −1], differentiating the mean between two classes. We then find the adjusted p-values using moderated t-statistic obtained by the empirical Bayes approach described in [1]. We pick the top 2000 genes as well as significant genes bounding FDR by 0.05 using Benjamini-Hochberg procedure [2]. We report the top 100 genes, and top enriched pathways of PANTHER for both top 2000 genes and significant genes.
For REMIAN with T 1 = 0.005, the top pathway, ubiquitin proteasome pathway, is suggested to be involved in breast cancer [10], and the second top gene, ZNF192, is suggested to be involved in metastatic progression of breast cancer [11].
Using POFAC to do enrichment analysis, the top pathway, CCKR signaling map [12], is suggested to be involved in breast cancer, as well as the second and third top genes, PHTF1 [6,7,8,9] and MUC5AC [13,14].
Finally, we report the results of limma on this dataset. Tab. 4 lists the top 100 genes, and Tabs. 9 and 10 report the enriched pathways for the top 2000 genes, and significant genes, respectively. Note using top 2000 genes and the 782 unique significant genes, PANTHER pathways recognize 350 and 157 genes, respectively.
Many of the top genes selected by the proposed algorithms were not among the top 2000 genes of limma. For instance, DCT, PHTF1, MUCA5C, ZNF227, ZP2, and CEACAM7 are not selected by limma. On the other hand, many of the top genes of limma are among the top 100 genes of the proposed algorithms. For instance, the top gene of limma, CENPF, is among the top 100 genes of CMNC-OBF, POFAC, and REMAIN, or the second and fourth top genes of limma, KIAA0101 and H2AFZ, are selected by CMNC-OBF and REMAIN. Furthermore, the third top gene of limma, PDGFD, is selected by CMNC-OBF.

Colon Cancer Dataset
This dataset is curated on GEO with accession number GSE41850, containing gene expression levels of 238 patients in stages 1-4 of colon cancer. 28 stage 1 patients comprise class 0 and the remaining patients comprise class 1. The top 100 genes of REMAIN, CMNC-OBF, and POFAC are provided in Tabs. 11, 12 and 13, respectively. The top 20 pathways of CMNC-OBF, REMAIN,POFAC,and SPM are listed in Tabs. 15,16,17,and 18,respectively. We again start by CMNC-OBF. The top pathway, cadherin signaling pathway [41,42], and the top gene, CPNE4 [43], are suggested to be involved in colon cancer. Now we look at enrichment analysis of REMAIN with T 1 = 0.01. The top pathway, cadherin signaling pathway [41,42], and the top gene, EPHA7 [44,45], are suggested to be involved in colon cancer. Now we use POFAC gene set to do enrichment analysis. The top pathway, ionotropic glutamate receptor pathway [46,47], is suggested to be involved in colon cancer, as well as the top two genes, EPHA7 [44,45], and CPNE4 [43].
Using genes detected by SPM with T 4 = 10 7 n 4 , t 1 = 10 6 and t 2 = 4, we performed enrichment analysis. The top pathway, ionotropic glutamate receptor pathway, is shown to be involved in colon cancer [46,47].
We also find many top genes and pathways selected by several algorithms among their top 20 genes and pathways.  [58,59], are all suggested to be involved in colon cancer.
Finally, we report the results of limma on this dataset. Tab. 14 lists the top 100 genes, and Tabs. 19 and 20 report the enriched pathways for the top 2000 genes, and significant genes, respectively. Note using top 2000 genes and the 194 unique significant genes, PANTHER pathways recognize 333 and 44 genes, respectively.
While some top genes are shared between limma and other Bayesian methods, such as SCN7A, many of the top genes of Bayesian methods, such as EPHA7 and SPNE4, are not among the top 2000 genes of limma. On the other hand, many top genes of limma are among the top 100 genes of the proposed Bayesian algorithms.