- Research
- Open Access

# Integrating gene set analysis and nonlinear predictive modeling of disease phenotypes using a Bayesian multitask formulation

- Mehmet Gönen
^{1}Email author

**17 (Suppl 16)**:0

https://doi.org/10.1186/s12859-016-1311-3

© The Author(s) 2016

**Published:**13 December 2016

## Abstract

### Background

Identifying molecular signatures of disease phenotypes is studied using two mainstream approaches: (i) Predictive modeling methods such as linear classification and regression algorithms are used to find signatures predictive of phenotypes from genomic data, which may not be robust due to limited sample size or highly correlated nature of genomic data. (ii) Gene set analysis methods are used to find gene sets on which phenotypes are linearly dependent by bringing prior biological knowledge into the analysis, which may not capture more complex nonlinear dependencies. Thus, formulating an integrated model of gene set analysis and nonlinear predictive modeling is of great practical importance.

### Results

In this study, we propose a Bayesian binary classification framework to integrate gene set analysis and nonlinear predictive modeling. We then generalize this formulation to multitask learning setting to model multiple related datasets conjointly. Our main novelty is the probabilistic nonlinear formulation that enables us to robustly capture nonlinear dependencies between genomic data and phenotype even with small sample sizes. We demonstrate the performance of our algorithms using repeated random subsampling validation experiments on two cancer and two tuberculosis datasets by predicting important disease phenotypes from genome-wide gene expression data.

### Conclusions

We are able to obtain comparable or even better predictive performance than a baseline Bayesian nonlinear algorithm and to identify sparse sets of relevant genes and gene sets on all datasets. We also show that our multitask learning formulation enables us to further improve the generalization performance and to better understand biological processes behind disease phenotypes.

## Keywords

- Gene set analysis
- Nonlinear predictive modeling
- Disease phenotypes
- Multiple kernel learning
- Cancer
- Tuberculosis

## Background

Predictive modeling is frequently used to find molecular signatures of disease phenotypes from genomic data, which helps us better understand underlying biological processes behind phenotypes and reduce data acquisition cost for future clinical samples by doing targeted profiling instead of genome-wide screens. To this aim, supervised machine learning methods such as linear classification and regression algorithms are trained to predict disease phenotypes, and features with relatively higher importance values (e.g. features with larger magnitude weights) in these parametric models are included into the signature. However, as illustrated by existing studies [1, 2], molecular signatures identified by such algorithms may not be robust due to small sample size or highly correlated nature of genomic data.

Gene set analysis methods try to identify gene sets on which disease phenotypes are dependent by calculating an enrichment score for each gene and transforming these scores into gene set scores using a summarization procedure [3]. The main advantage of these approaches is the ability to bring prior biological knowledge into the analysis in the form of biological pathways or sets of genes with similar biological functions [4], leading to more robust and clinically interpretable results than predictive modeling approaches. However, they usually assume linear dependencies between genomic data and phenotype, which may not reflect the underlying biology of disease, and have difficulties in using very small or large gene sets in the analysis.

To benefit from the best of both worlds, integrating gene set analysis and predictive modeling is already considered in many existing studies [5–7], which modify linear classification and regression algorithms to include gene set information while doing feature selection for molecular signature extraction. Even though this family of methods capture dependencies between genes, they still fail to capture nonlinear dependencies between genomic data and phenotype.

We suggest to integrate these two components using a nonlinear framework by extending our earlier Bayesian formulation [8]. Here, we develop a novel Bayesian multiple kernel learning algorithm, which trains a binary classifier with a sparse set of active gene sets using a sparsity-inducing prior, i.e. the spike and slab prior [9]. Using gene sets within a probabilistic formulation helps us identify more robust signatures even with small sample sizes. Using a kernel-based formulation enables us to capture nonlinear dependencies between genomic data and phenotype, and to use overlapping gene sets and gene sets with different sizes without any major concern. We also generalize our proposed formulation to multitask learning setting to model multiple related datasets (e.g. different patient cohorts profiled against the same phenotype) conjointly, leading to better predictive performance and more robust molecular signatures. To the best of our knowledge, [10] provides the first joint formulation of gene set analysis and nonlinear predictive modeling, which performs a survival analysis on breast cancer patients using both clinical and genomic data, using an existing discriminative multiple kernel learning algorithm. However, our approach has important advantages over their method: (i) more robustness on clinical datasets with small sample size due to its probabilistic nature, (ii) its ability to perform automatic model selection (e.g. determining the sparsity level of kernel weights) due to its fully Bayesian inference mechanism and (iii) its ability to model multiple related datasets conjointly due to its multitask learning variant.

We perform repeated random subsampling validation experiments on two cancer and two tuberculosis datasets to demonstrate the better predictive performance of our two algorithms over a baseline Bayesian nonlinear algorithm and to show the biological relevance of the genes and gene sets selected to disease phenotypes modeled.

## Materials

In this study, we use two cancer and two tuberculosis datasets, where we solve binary classification problems to predict phenotype values from genomic data and to extract molecular signatures of disease phenotypes.

### Diagnosis of micro-satellite instability in colorectal and endometrial carcinomas

Micro-satellite instability is a hypermutable phenotype caused by the loss of DNA mismatch repair activity. It is frequently observed in several tumor types such as colorectal, endometrial, gastric, ovarian and sebaceous carcinomas [11]. Tumors with micro-satellite instability do not respond to chemotherapeutic strategies developed for micro-satellite stable tumors, leading to its clinical importance. That is why we address the problem of predicting micro-satellite instability status of cancer patients from their gene expression data. We use two publicly available datasets provided by ‘the Cancer Genome Atlas’ (TCGA) consortium: (i) ‘colon and rectum adenocarcinoma’ (COADREAD) patients [12] and (ii) ‘uterine corpus endometrial carcinoma’ (UCEC) patients [13].

Summary of two cancer datasets

Number of patients | ||||
---|---|---|---|---|

Dataset | MSI-H | MSI-L | MSS | Total |

COADREAD | 37 | 43 | 181 | 261 |

UCEC | 108 | 27 | 195 | 330 |

### Diagnosis of tuberculosis in adult and pediatric individuals

Tuberculosis is responsible for 1.5 million deaths in 2013 according to the World Health Organization, which makes it the second greatest killer due to a single infectious agent after HIV. It is also the leading cause of death for HIV-infected people. Its diagnosis is currently based on clinical and radiological features, sputum microscopy and tuberculin skin testing, which usually give false results in HIV-infected individuals [14]. New clinical diagnostic tests, especially for resource poor settings such as low-income countries with high rates of HIV, are needed to identify tuberculosis cases correctly for proper treatment. That is why we address the problem of predicting tuberculosis status of individuals from genome-wide RNA expression in host blood. We use two publicly available datasets of HIV-infected and -uninfected individuals from South Africa and Malawi: (i) adult individuals (ADULT) [14] and (ii) pediatric individuals (PEDIATRIC) [15].

Summary of two tuberculosis datasets

Number of individuals | ||||
---|---|---|---|---|

Dataset | ATB | LTBI | OD | Total |

ADULT | 195 | 167 | 175 | 537 |

PEDIATRIC | 111 | 54 | 169 | 334 |

## Methods

We consider the problem of predicting phenotype values from genomic data using classification algorithms. Instead of training classifiers that use all available features, we want to develop classifiers that use very few but biologically relevant input features to identify a molecular signature of the phenotype and to reduce the data acquisition cost for test samples. However, the molecular signatures identified from, for example, gene expression data are not robust when we have limited training data [1, 2]. In such cases, we obtain different molecular signatures from different subsets of the same training set due to highly correlated nature of data, which makes knowledge extraction quite difficult. Instead, we can use our prior biological knowledge to group the input features and pick the relevant groups that are predictive of the phenotype while training the classification algorithm. We first discuss our proposed method that can learn a classifier and identify predictive gene sets simultaneously on a single dataset. We then explain how we extend our method to model multiple related datasets by identifying a common set of predictive gene sets across them.

### Sparse Bayesian multiple kernel learning

We formulate the prediction task as a binary classification problem defined on the genomic data, denoted as domain \(\mathcal {X}\), and the phenotype, denoted as domain \(\mathcal {Y}\). We are given an independent and identically distributed sample \(\{\boldsymbol {x}_{i} \in \mathcal {X}\}_{i = 1}^{N}\) and a class label vector \(\boldsymbol {y} = \{y_{i} \in \mathcal {Y}\}_{i = 1}^{N}\), where *N* is the number of data points, and \(\mathcal {Y} = \{-1, +1\}\). We are also given a list of gene sets \(\{\mathcal {I}_{m}\}_{m = 1}^{P}\), which encode our prior biological knowledge in terms of gene names, where \(\mathcal {I}_{m}\) list the names of genes in the gene set *m*, which may be a set of genes from a biological pathway or a set of genes with similar biological functions, and *P* is the number of gene sets.

*p*, small

*n*). (ii) We can learn better classifiers using nonlinear kernels such as the Gaussian kernel (i.e. kernel trick). (iii) We can use domain-specific kernels (e.g. graph and tree kernels for structured objects) to better capture the underlying biological processes [17]. To calculate similarities between the data points, we have multiple kernel functions defined over gene sets, namely, \(\{k_{m}\colon \mathcal {X} \times \mathcal {X} \to \mathbb {R}\}_{m = 1}^{P}\), which are used to calculate the kernel matrices \(\{\mathbf {K}_{m}\}_{m = 1}^{P}\). For each gene set, the corresponding kernel \(k_{m}(\boldsymbol {x}_{i}, \boldsymbol {x}_{j} | \mathcal {I}_{m})\) considers only the features extracted from or related to the genes in \(\mathcal {I}_{m}\). We choose to learn a weighted combination of the input kernels \(\{\mathbf {K}_{m}\}_{m = 1}^{P}\) while training a binary classifier, which is known as multiple kernel learning [18], by extending our earlier Bayesian formulation [8] with a sparsity-inducing prior on the kernel weights. Figure 1 gives a schematic description of the proposed model.

#### Probabilistic model

Our proposed probabilistic model, called ‘sparse Bayesian multiple kernel learning’ (SBMKL), has three main parts: (i) finding kernel-specific latent variables using the same set of sample weights over the input kernels, (ii) assigning sparse weights to these latent variables using the spike and slab prior [9] and (iii) generating predicted outputs using the latent variables and these sparse weights together with a bias parameter.

**Σ**) represents the normal distribution with the mean vector μ and the covariance matrix

**Σ**, and Gamma(·;

*α*,

*β*) denotes the gamma distribution with the shape parameter

*α*and the scale parameter

*β*. We generate the latent variables g

^{ m }for each input kernel

**K**

_{ m }using the same set of sample weights a. Note that we need to use a small noise parameter

*σ*

_{ g }while generating the latent variables to better generalize to test data points.

*ζ*,

*η*) denotes the beta distribution with the shape parameters

*ζ*and

*η*, and Bernoulli(·;

*π*) represents the Bernoulli distribution with the success probability parameter

*π*. We generate a binary indicator variable

*s*

_{ m }and a normally distributed weight

*e*

_{ m }for each input kernel. The product of these two variables

*s*

_{ m }

*e*

_{ m }is a simple parameterization of the spike and slab prior, which is more amenable to approximate inference.

*ν*is introduced to resolve the scaling ambiguity and to place a low-density region between two classes, similar to the margin idea in support vector machines, which is generally used for semi-supervised learning [20].

#### Inference using variational Bayes

We need to infer the posterior distribution over the model parameters and the latent variables, which we denote as **Θ**={λ,a,**G**,*κ*,s,*ω*,e,*γ*,*b*,f}, given the input kernel matrices \(\{\mathbf {K}_{m}\}_{m = 1}^{P}\) and the class labels y to find the predictive distribution for test data points. Unfortunately, exact inference for our proposed probabilistic model is intractable. Instead of using a computationally expensive Gibbs sampling approach [21], we choose to perform variational inference, which maximizes a lower bound on the marginal likelihood using an ensemble of factored posteriors to infer the joint parameter distribution [22].

*q*and

*p*using ‘the Kullback–Leibler divergence’ denoted as \(\mathcal {KL}(q||p)\). We can decompose the log evidence as

*q*(

**Θ**) as a factorized approximation:

*q*(e)

*q*(s) because it gives a unimodal distribution, but the true posterior distribution may have exponentially many modes. To capture this multimodal structure, we choose to formulate the factorization as

*q*(e|s)

*q*(s), which can be approximated efficiently [23]. We then write \(\mathcal {L}(q)\) in the form of expectations:

*τ*can be found as

#### Inference details

*α*(·),

*β*(·),

*μ*(·), and

*Σ*(·) denote the shape parameter, the scale parameter, the mean vector and the covariance matrix of their arguments, respectively. The approximate posterior distributions can be updated as

*h*(·)〉 denotes the posterior expectation as usual, i.e. E

_{ q(·)}[

*h*(·)].

*ζ*(·),

*η*(·) and

*π*(·) denote the shape parameters and the success probability parameter of their arguments. We can update the approximate posterior distributions as

*r*

_{ m }is defined as

**Σ**,

*ρ*(·)) denotes the truncated normal distribution with the mean vector μ, the covariance matrix

**Σ**and the truncation rule

*ρ*(·) such that TruncatedNormal(·;μ,

**Σ**,

*ρ*(·))∝Normal(·;μ,

**Σ**) if

*ρ*(·) is true, and TruncatedNormal(·;μ,

**Σ**,

*ρ*(·))=0 otherwise. The approximate posterior distributions can be updated as

#### Prediction scenario

*q*(a) and obtain the posterior predictive mean of the latent variables g

_{⋆}for a new data point x

_{⋆}as

*f*

_{⋆}can also be found by replacing \(p(b, \boldsymbol {e}, \boldsymbol {s} | \{\mathbf {K}_{m}\}_{m = 1}^{P}, \boldsymbol {y})\) with its approximate posterior distribution

*q*(

*b*)

*q*(e|s)

*q*(s):

*f*

_{⋆}〉 to predict the class label by looking at its sign.

### Sparse Bayesian multitask multiple kernel learning

We formulate the joint modeling of prediction tasks on multiple datasets using a multitask learning approach, which models distinct but related tasks conjointly to improve overall generalization performance. We are given *T* datasets, and, for each dataset, we have an independent and identically distributed sample \(\{\boldsymbol {x}_{t,i} \in \mathcal {X}\}_{i = 1}^{N_{t}}\) and a class label vector \(\boldsymbol {y}_{t} = \{y_{t,i} \in \mathcal {Y}\}_{i = 1}^{N_{t}}\), where *N*
_{
t
} is the number of data points in the dataset *t*. We also have a list of gene sets \(\{\mathcal {I}_{m}\}_{m = 1}^{P}\), which are shared across the tasks, and the corresponding kernel functions \(\{k_{t, m}(\cdot, \cdot | \mathcal {I}_{m})\}_{m = 1}^{P}\) for each task.

#### Probabilistic model

Our single-task learning model SBMKL is extended towards multitask learning to obtain ‘sparse Bayesian multitask multiple kernel learning’ (SBMTMKL).

#### Inference using variational Bayes

*q*(

**Θ**) as a factorized approximation:

#### Inference details

*r*

_{ m }is defined as

#### Prediction scenario

*q*(a

_{ t }) instead of \(p(\boldsymbol {a}_{t}| \{\{\mathbf {K}_{t, m}\}_{m = 1}^{P}, \boldsymbol {y}_{t}\}_{t = 1}^{T})\) and obtain the posterior predictive mean of the latent variables g

_{ t,⋆}for a new data point x

_{ t,⋆}in the task

*t*as

*f*

_{ t,⋆}can also be found by replacing \(p\left (b_{t}, \boldsymbol {e}_{t}, \boldsymbol {s} | \left \{ \{\mathbf {K}_{t, m}\}_{m = 1}^{P}, \boldsymbol {y}_{t}\right \}_{t = 1}^{T}\right)\) with its approximate posterior distribution

*q*(

*b*

_{ t })

*q*(e

_{ t }|s)

*q*(s):

*f*

_{ t,⋆}〉 to predict the class label by looking at its sign.

### Baseline algorithm

^{⊤}k

_{ i }+

*b*). We again learn the posterior distribution over the model parameters and the latent variables using a deterministic variational approximation as we do for our methods. We call this algorithm ‘Bayesian relevance vector machine’ (BRVM). We have three main reasons for choosing this particular baseline algorithm: (i) BRVM can make use of kernel functions to obtain nonlinear models like our methods. (ii) We can see the effect of using gene set information by comparing our methods to BRVM. (iii) BRVM uses the same type of inference mechanism with our methods.

## Results and discussion

To illustrate the effectiveness of our proposed methods SBMKL and SBMTMKL, we report their results on four datasets (i.e. two cancer and two tuberculosis datasets) and compare them to the baseline algorithm BRVM, which does not make use of gene set information, using repeated random subsampling validation experiments.

### Experimental settings

For each dataset, we create 100 random train/test splits to obtain robust results. For each replication, the training set is defined by randomly selecting 75 % of the data points with stratification on the phenotype, and the remaining 25 % of the samples are used as the test set. The training set is normalized to have zero mean and unit standard deviation, and the test set is then normalized using the mean and the standard deviation of the original training set.

We extract gene sets from ‘the Molecular Signatures Database’ (MSigDB) [3], which contains curated pathway gene sets from online databases such as ‘the Kyoto Encyclopedia of Genes and Genomes’ (KEGG) [25] and ‘the Pathway Interaction Database’ (PID) [26]. In our experiments, we use 196 PID pathways reported in MSigDB as our gene set collection.

_{2}denotes the

*ℓ*

_{2}-norm, and we set the kernel width

*s*to the mean of pairwise Euclidean distances between the data points:

For BRVM, we calculate a single kernel over all input features. For SBMKL and SBMTMKL, we calculate a separate kernel function for each gene set over the corresponding features. Note that the Gaussian kernels calculated on the gene sets take values between 0 and 1 by definition, and there is no need for eliminating small/large gene sets or performing additional normalization steps to remove the effect of gene set size.

The hyper-parameter values of BRVM are selected as (*α*
_{
λ
},*β*
_{
λ
})=(1,1), (*α*
_{
γ
},*β*
_{
γ
})=(1,1) and *ν*=1. The hyper-parameter values of SBMKL and SBMTMKL are selected as (*α*
_{
λ
},*β*
_{
λ
})=(1,1),*σ*
_{
g
}=0.1,(*ζ*
_{
κ
},*η*
_{
κ
})=(1,999),(*α*
_{
ω
},*β*
_{
ω
})=(1,1),(*α*
_{
γ
},*β*
_{
γ
})=(1,1) and *ν*=1. Note that (*ζ*
_{
κ
},*η*
_{
κ
}) are set to these particular values to produce very sparse binary indicator variables, leading to classifiers with very few gene sets used for prediction. For BRVM, we perform 200 iterations during variational inference, whereas we perform 50 iterations for SBMKL and SBMTMKL.

We use ‘area under the receiver operating characteristic curve’ (AUROC) to compare classification results. AUROC is used to summarize the receiver operating characteristic curve, which is a curve of true positives as a function of false positives while the threshold to predict labels changes. Larger AUROC values correspond to better performance.

### Classification results on the cancer datasets

### Classification results on the tuberculosis datasets

### Biological results on the cancer datasets

Gene set selection results on the cancer datasets for MSI-H versus others classification

SBMKL on COADREAD | SBMKL on UCEC | SBMTMKL on COADREAD and UCEC | |||
---|---|---|---|---|---|

Gene set name | Frequency | Gene set name | Frequency | Gene set name | Frequency |

WNT_NONCANONICAL_PATHWAY | 0.10 | P53DOWNSTREAMPATHWAY | 0.92 | P53DOWNSTREAMPATHWAY | 0.99 |

TGFBRPATHWAY | 0.09 | NOTCH_PATHWAY | 0.83 | NOTCH_PATHWAY | 0.92 |

DELTANP63PATHWAY | 0.07 | NFAT_TFPATHWAY | 0.26 | WNT_NONCANONICAL_PATHWAY | 0.61 |

TAP63PATHWAY | 0.07 | IL5_PATHWAY | 0.24 | NFAT_TFPATHWAY | 0.41 |

RB_1PATHWAY | 0.07 | P53REGULATIONPATHWAY | 0.24 | AR_PATHWAY | 0.34 |

NFAT_3PATHWAY | 0.06 | CDC42_REG_PATHWAY | 0.20 | RHOA_PATHWAY | 0.21 |

ATF2_PATHWAY | 0.06 | AVB3_OPN_PATHWAY | 0.15 | REG_GR_PATHWAY | 0.21 |

SMAD2_3NUCLEARPATHWAY | 0.05 | WNT_NONCANONICAL_PATHWAY | 0.14 | UPA_UPAR_PATHWAY | 0.17 |

P73PATHWAY | 0.05 | REG_GR_PATHWAY | 0.13 | BMPPATHWAY | 0.17 |

MYC_ACTIVPATHWAY | 0.05 | UPA_UPAR_PATHWAY | 0.11 | RAC1_PATHWAY | 0.14 |

### Biological results on the tuberculosis datasets

Gene set selection results on the tuberculosis datasets for ATB versus others classification

SBMKL on ADULT | SBMKL on PEDIATRIC | SBMTMKL on ADULT and PEDIATRIC | |||
---|---|---|---|---|---|

Pathway name | Frequency | Pathway name | Frequency | Pathway name | Frequency |

ERBB_NETWORK_PATHWAY | 0.73 | A6B1_A6B4_INTEGRIN_PATHWAY | 0.31 | RHODOPSIN_PATHWAY | 0.67 |

AP1_PATHWAY | 0.55 | RAS_PATHWAY | 0.27 | ERBB_NETWORK_PATHWAY | 0.63 |

CONE_PATHWAY | 0.44 | INTEGRIN1_PATHWAY | 0.24 | AP1_PATHWAY | 0.60 |

AR_TF_PATHWAY | 0.42 | RAC1_PATHWAY | 0.21 | SYNDECAN_1_PATHWAY | 0.51 |

CERAMIDE_PATHWAY | 0.31 | RHODOPSIN_PATHWAY | 0.20 | PLK1_PATHWAY | 0.50 |

RHODOPSIN_PATHWAY | 0.31 | SYNDECAN_1_PATHWAY | 0.17 | CERAMIDE_PATHWAY | 0.42 |

SYNDECAN_1_PATHWAY | 0.29 | ATM_PATHWAY | 0.16 | ATM_PATHWAY | 0.41 |

FANCONI_PATHWAY | 0.25 | ATF2_PATHWAY | 0.15 | AR_TF_PATHWAY | 0.40 |

RXR_VDR_PATHWAY | 0.24 | THROMBIN_PAR1_PATHWAY | 0.15 | ATF2_PATHWAY | 0.37 |

HNF3BPATHWAY | 0.23 | IL12_2PATHWAY | 0.13 | HNF3APATHWAY | 0.35 |

## Conclusions

Integrating gene set analysis and predictive modeling is already considered by many existing studies, which fail either to capture nonlinear dependencies between genomic data and phenotype or to model multiple related datasets conjointly.

In this study, we integrate gene set analysis and nonlinear predictive modeling of disease phenotypes by casting this problem into a binary classification framework defined on the gene sets with a sparsity-inducing prior on their weights. To this aim, we propose a Bayesian multiple kernel learning algorithm, which produces a classifier with sparse gene set weights, by extending our earlier Bayesian formulation [8]. We then generalize this new algorithm to multitask learning to be able to model multiple related datasets conjointly, leading to better generalization performance and to more robust molecular signatures. The main novelty of our methods is the integration of gene set analysis and nonlinear predictive modeling using a probabilistic formulation, which enables us to robustly capture nonlinear dependencies between genomic data and phenotype even with small sample sizes, and to use overlapping gene sets and gene sets with different sizes without any major concern. Our approach brings us two side advantages: (i) We can identify very few gene sets predictive of the phenotype, which may shed light on underlying biological processes. (ii) We can reduce the data acquisition cost for test samples in clinical settings by collecting only the features used in our classifier.

To demonstrate the performance of our algorithms SBMKL and SBMTMKL, we perform repeated random subsampling validation experiments on four datasets of two major human diseases, namely, cancer and tuberculosis. On the two cancer datasets [12, 13], we decide whether a colorectal or endometrial tumor displays micro-satellite instability using its mRNA gene expression data. On the two tuberculosis datasets [14, 15], we diagnose whether an adult or pediatric individual has an active tuberculosis infection using his/her whole blood RNA expression data. We compare our two methods to a baseline Bayesian nonlinear algorithm that is trained on all available genomic data without using gene set information. Our methods obtain comparable or even better predictive performance using very few features (i.e. less than 2.5 % of the input features) on all datasets. We also show that we are able to identify biologically relevant genes and gene sets for cancer and tuberculosis phenotypes, which are validated by the existing studies from the literature. The results of our multitask learning algorithm show that modeling multiple related datasets conjointly enables us to further improve the generalization performance and to better understand biological processes behind disease phenotypes.

In the experiments reported, we use real-valued gene expression measurements as genomic data. Our methods can also be applied to discrete data such as mutation profiles of tumors, which are hard to use in classical gene set analysis methods due to their very sparse nature. As a possible extension, we plan to use our kernel-based formulations on cancer datasets to identify driver mutations using kernels for discrete data such as the Jaccard similarity coefficient.

## Notes

## Declarations

### Declarations

This article has been published as part of *BMC Bioinformatics* Volume 17 Supplement 16, 2016: Proceedings of the Tenth International Workshop on Machine Learning in Systems Biology (MLSB 2016). The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-16.

### Funding

Publication of this article was funded by Koç University.

### Availability of data and materials

All datasets used in this study were made publicly available previously by the corresponding data providers as mentioned in the manuscript. Matlab and R implementations of our two methods are available at https://github.com/mehmetgonen/sbmkl.

### Authors’ contributions

MG designed the study, implemented the algorithms, carried out the computational experiments, analyzed the results, and wrote the manuscript.

### Competing interests

The author declares that he has no competing interests.

### Consent for publication

Not applicable.

### Ethics approval and consent to participate

Not applicable.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Ein-Dor L., et al.Outcome signature genes in breast cancer: Is there a unique setBioinformatics. 2005; 21:171–8.View ArticlePubMedGoogle Scholar
- Ein-Dor L., et al.Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA. 2006; 103:5923–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Subramanian A., et al.Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005; 102:15545–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Khatri P, et al.Ten years of pathway analysis: Current approaches and outstanding challenges. PLoS Comput Biol. 2012; 8:e1002375.View ArticlePubMedPubMed CentralGoogle Scholar
- Tai F, Pan W.Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics. 2007a; 23:1775–82.View ArticlePubMedGoogle Scholar
- Tai F, Pan W. Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Bioinformatics. 2007b; 23:3170–7.View ArticlePubMedGoogle Scholar
- Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008; 24:1175–82.View ArticlePubMedGoogle Scholar
- Gönen M. Bayesian efficient multiple kernel learning. In: Proceedings of the 29th International Conference on Machine Learning. Edinburgh: Omnipress: 2012.Google Scholar
- Mitchell TJ, Beauchamp JJ. Bayesian variable selection in linear regression. J Amer Statist Assoc. 1988; 83:1023–32.View ArticleGoogle Scholar
- Seoane JA, et al.A pathway-based data integration framework for prediction of disease progression. Bioinformatics. 2014; 30:838–45.View ArticlePubMedGoogle Scholar
- Vilar E, Gruber SB. Microsatellite instability in colorectal cancer—the stable evidence. Nat Rev Clin Oncol. 2010; 7:153–62.View ArticlePubMedPubMed CentralGoogle Scholar
- The Cancer Genome Atlas Network Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012; 487:330–7.Google Scholar
- The Cancer Genome Atlas Research Network Integrated genomic characterization of endometrial carcinoma. Nature. 2013; 497:67–73.Google Scholar
- Kaforou M, et al.Detection of tuberculosis in HIV-infected and -uninfected African adults using whole blood RNA expression signatures: A case-control study. PLoS Med. 2013; 10:e1001538.View ArticlePubMedPubMed CentralGoogle Scholar
- Anderson ST, et al.Diagnosis of childhood tuberculosis and host RNA expression in Africa. N Engl J Med. 2014; 370:1712–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Schölkopf B, Smola AJ. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. 2002.Google Scholar
- Schölkopf B, et al.Kernel Methods in Computational Biology. 2004.Google Scholar
- Gönen M, Alpaydın E. Multiple kernel learning algorithms. J Mach Learn Res. 2011; 12:2211–68.Google Scholar
- Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. J Amer Statist Assoc. 1993; 88:669–79.View ArticleGoogle Scholar
- Lawrence ND, Jordan MI. Semi-supervised learning via Gaussian processes. Adv Neural Inf Process Syst. 2005; 17:753–60.Google Scholar
- Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities. J Amer Statist Assoc. 1990; 85:398–409.View ArticleGoogle Scholar
- Jordan MI, et al.An introduction to variational methods for graphical models. Mach Learn. 1999; 37:183–233.View ArticleGoogle Scholar
- Titsias MK, Lázaro-Gredilla M. Spike and slab variational inference for multi-task and multiple kernel learning. Adv Neural Inf Process Syst. 2011; 24:2339–47.Google Scholar
- Tipping ME. Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001; 1:211–44.Google Scholar
- Ogata H., et al.KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999; 27:29–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Schaefer CF, et al.PID: The Pathway Interaction Database. Nucleic Acids Res. 2009; 37:D674–D9.View ArticlePubMedGoogle Scholar
- Timmers C, et al.E2f1, E2f2, and E2f3 control E2F target expression and cellular proliferation via a p53-dependent negative feedback loop. Mol Cell Biol. 2007; 27:65–78.View ArticlePubMed CentralGoogle Scholar
- Boggaram V, et al.Early secreted antigenic target of 6 kDa (ESAT-6) protein of Mycobacterium tuberculosis induces interleukin-8 (IL-8) expression in lung epithelial cells via protein kinase signaling and reactive oxygen species. J Biol Chem. 2013; 288:25500–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Mamishi S, et al.Diagnostic accuracy of IL-2 for the diagnosis of latent tuberculosis: A systematic review and meta-analysis. Eur J Clin Microbiol Infect Dis. 2014; 33:2111–9.View ArticlePubMedGoogle Scholar
- Martinez AN, et al.Role of interleukin 6 in innate immunity to Mycobacterium tuberculosis infection. J Infect Dis. 2013; 207:1253–61.View ArticlePubMedPubMed CentralGoogle Scholar