# MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification

- Jason M Knight
^{1}Email author, - Ivan Ivanov
^{2}and - Edward R Dougherty
^{1, 3}

**15**:401

https://doi.org/10.1186/s12859-014-0401-3

© Knight et al.; licensee BioMed Central Ltd. 2014

**Received: **3 April 2014

**Accepted: **27 November 2014

**Published: **10 December 2014

## Abstract

### Background

Sequencing datasets consist of a finite number of reads which map to specific regions of a reference genome. Most effort in modeling these datasets focuses on the detection of univariate differentially expressed genes. However, for classification, we must consider multiple genes and their interactions.

### Results

Thus, we introduce a hierarchical multivariate Poisson model (MP) and the associated optimal Bayesian classifier (OBC) for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior or equivalent classification performance compared to typical classifiers for two synthetic datasets and over a range of classification problem difficulties. We also introduce the Bayesian minimum mean squared error (MMSE) conditional error estimator and demonstrate its computation over the feature space. In addition, we demonstrate superior or leading class performance over an RNA-Seq dataset containing two lung cancer tumor types from The Cancer Genome Atlas (TCGA).

### Conclusions

Through model-based, optimal Bayesian classification, we demonstrate superior classification performance for both synthetic and real RNA-Seq datasets. A tutorial video and Python source code is available under an open source license at http://bit.ly/1gimnss.

## Keywords

## Background

The possibility of genomic phenotype classification arose with the inception of gene-expression microarrays. From the outset, two fundamental problems have frustrated the endeavor: (1) the inaccuracy of microarray measurements, and (2) small samples. Our particular application of interest is classification using RNA-Seq data. Modern RNA-Seq technologies sequence small RNA fragments (mRNA) to measure gene expression, where the number of reads mapped to a gene on the reference genome defines the count data. Given that RNA-Seq data has advantages over microarray data, in particular, more accurate measurement, we still confront the second fundamental problem, which is statistical, not technological: small samples cause re-sampling-based classifier error estimators to be very inaccurate due to excessive variance and lack of regression with the true error [1]-[4]. Since the error rate of a classifier quantifies its predictive accuracy, it is the salient epistemological attribute of any classifier. The inability to satisfactorily estimate the error with model-free methods with small samples implies that genomic classifier error estimation is virtually impossible without the use of prior information, so that the whole small-sample classification problem becomes unapproachable in a model-free framework [5].

The situation has been addressed by utilizing prior knowledge via a Bayesian approach that considers a prior distribution on an uncertainty class of feature-label distributions [6],[7]. For expression-based classification, prior distributions have been constructed using expression data not employed in classifier design [8] and known regulatory pathways [9]. Given that a prior model must be assumed to achieve satisfactory error estimation, an obvious course of action is to derive an optimal classifier based on the prior knowledge and the sample data, the result being an optimal Bayesian classifier (OBC) that is guaranteed to have the best average performance of any classifier relative to the posterior distribution derived from the prior distribution and data [10],[11]. While Bayesian classification does not depend on particular distributional forms, closed-form solutions have been derived for the multinomial model and Gaussian models using linear classifiers for the minimum mean squared error (MMSE) error estimate [6],[7], the MSE of the error estimate [12],[13], and an optimal Bayesian classifier (OBC) relative to the prior distribution [10],[11], the latter being expressed in terms of *effective class conditional distributions*, which are expectations relative to the posterior distribution of the class-conditional distributions. The closed-form solutions depend on particular models (multinomial and Gaussian) and the existence of conjugate priors, which can be too constraining for practical applications such as RNA-Seq classification.

Much of the statistical literature concerning classification of RNA-Seq data attempts to address differential expression testing, that is, univariate statistical testing on an individual gene basis. These attempts typically model RNA-Seq data via negative binomial [14],[15] and Poisson distributions [16]. In addition, network inference has been attempted using a hierarchical Poisson log-normal model [17], and clustering of RNA-Seq data points has utilized various approaches [18],[19]. However, in clinical settings one is often interested in sample classification: the problem of classifying the RNA-Seq data from unlabeled patients using a set of labeled training data. One of the few RNA-Seq-specific attempts towards this goal uses a Poisson modeling assumption with independent features [20]. The Poisson model is completely parameterized by its mean and thus is known to exhibit problems in fitting RNA-Seq data due to the overdispersion typically observed in such datasets.

In this paper, we focus on modeling the pipeline that starts with extracting the gene concentrations from the biological samples and their subsequent processing by the sequencing instrument [21]. This is accomplished using a hierarchical, multivariate Poisson model (MP). Specifically, gene concentration levels are modeled by a log-normal distribution and the sequencing instrument sampling of those is modeled via a Poisson process. This allows us to accurately model the RNA-Seq data overdispersion as demonstrated by marginal variance calculations and posterior predictive model diagnostics in Section ‘Overdispersion’. In addition, this hierarchical model allows for inferring any covariance structure observed between the features.

Whereas Dalton and Dougherty have presented a computational method for nonlinear classifiers in the Gaussian model [8], this still depends upon conjugate priors. In this work, we remove the constraints imposed by the requirement of a closed-form solution by developing the optimal Bayesian classifier using a Markov-chain-Monte-Carlo (MCMC) methodology. This provides a computational framework for calculating the OBC for any parameterized class conditional-density and any prior distribution. Most notably, this allows us to use distributions designed to closely model particular datasets and a prior distribution of any form to improve classification performance in small-sample settings, in particular, for RNA-Seq-based classification.

## Methods

### Notation

Throughout, we use capital letters to indicate random variables, lower case letters to indicate individual realizations of random variables or indices, bold latin characters for observed vectors, and Greek letters for latent features and parameters. We write *p*(**X**) as the probability measure over the random variable **X**. *p*(**X**) may be a probability mass function, probability density function, or arbitrary probability measure. *p*(**x**|*y*) denotes the conditional probability *p*(**X**=**x**|*Y*=*y*). Similarly, following Bayesian convention, we write parameterized distributions by conditioning on the parameter, for instance, *p*(**X**|*Y*,*θ*), and posterior expectations by conditioning on the sample, such as *E*[**X**|*Y*,*S* _{
n
}], where *S* _{
n
} and all other values are defined in Section ‘Review of optimal Bayesian classification’. If it is unclear which density an expectation is taken with respect to, then we denote it in subscript notation, such as ${E}_{\theta |{S}_{n}}\left[\xb7\right],$ where the expectation is taken with respect to the density *p*(*θ*|*S* _{
n
}).

### Review of optimal Bayesian classification

Binary classification considers a set of *n* labeled training data points, ${S}_{n}={\left\{(\mathbf{x},y)\right\}}_{1}^{n}$, where *y*∈{0,1} is the class label and $\mathbf{x}\in \mathcal{X}$ is the feature vector over a feature space . An example of binary classification in a clinical setting might include class 0 and 1 being two types of cancers, or normal and cancerous tissues. Available features would then be the gene or genes that will eventually be used in the designed classifier to assign this label. The feature space would be the set of possible gene expression measurements for all genes in the feature vector. The labeled training data *S* _{
n
} would be the set of gene expression measurements from samples which had undergone further testing (possibly observation with the passage of time, cell culturing, or more invasive followup procedures) to identify the type or malignancy of the tissue. Using *S* _{
n
}, we design a classifier *ψ* that hopefully performs well on the unknown joint feature-label distribution *p*(**X**,*Y*). In the same clinical example, the classifier *ψ* could then identify the type of cancer using gene expression measurements alone.

By parameterizing this unknown joint distribution in a model-based Bayesian framework one can derive an optimal Bayesian classifier (OBC) that minimizes the expected error over the space of all classifiers under assumed forms of the class-conditional densities. Specifically, under Gaussian and multinomial class-conditional densities and their corresponding conjugate prior distributions, closed-form solutions for the OBC [10],[11] and the first two moments of the error estimate conditioned on the sample [12],[13] have been obtained.

The parameterization of the feature-label distribution consists of the marginal class probability *c* and the class-conditional densities *p*(**x**|*y*,*θ* _{
y
}), where a particular value *θ* _{
y
}∈*Θ* _{
y
} specifies a single class-conditional density contained in the class of densities defined over the space *Θ* _{
y
}, which will be a Cartesian product as described in Section ‘The multivariate poisson model’. Therefore, for a two-class problem, we specify a parameterized joint feature-label distribution as *θ*=(*c*,*θ* _{0},*θ* _{1})∈*Θ*=[0,1]×*Θ* _{0}×*Θ* _{1}. In the Bayesian classification framework, these values are then treated as random variables, so that we may consider quantities such as the expectation of *c*, or another random variable conditioned on the value of the parameter vector *θ*.

*c*and the class-conditional parameters

*θ*

_{ y }. In addition, the posterior of

*c*is assumed known throughout the tree. Figure 1 demonstrates a primary benefit of the Bayesian approach to classification. Once we obtain the posterior distribution of the class-conditional parameters, it is straightforward to calculate many relevant quantities through appropriately crafted conditional expectations. In this paper we demonstrate how to approximate any quantity in the tree for arbitrary class conditional densities and arbitrary prior distributions.

We now examine the tree in more detail. Starting at the far left of the tree, *p*(*θ*|*S* _{
n
}) is the posterior distribution of the parameterized feature-label distribution – posterior to the labeled samples in *S* _{
n
}. Typically, error estimates and the optimal classifier are our primary interest, so that this posterior distribution is traditionally used as a means to compute other quantities and is not of interest by itself.

*effective class-conditional density*is the marginal predictive posterior of the feature vector

**X**conditioned

*S*

_{ n }and the class variable

*Y*,

It gives the distribution of the feature vector using a weighted average over all the parameterized class-conditional densities in *Θ* _{
y
} given a class *y*. The weights in this expectation are the posterior, *p*(*θ* _{
y
}|*S* _{
n
}), evaluated at each *θ* _{
y
}.

*ψ*is

*ε*=

*p*(

*ψ*(

**X**)≠

*Y*). Given the sample data

*S*

_{ n },

*ε*is a random unknown quantity in the Bayesian framework. The MMSE estimate given in [12] can be written as

where **I** _{
A
} is the indicator function for event *A*, $\u0109=E\left[c|{S}_{n}\right]$ is the posterior expectation of *c*, *R* _{
y
} is the region of the feature space the classifier predicts to be class *y*, is the feature space, and *ε* _{
y
}(*θ* _{
y
},*ψ*) is the error of classifier *ψ* contributed by class *y* on the fixed distribution *θ* _{
y
}.

where *p*(*ε*|*θ*) is the true error for a fixed feature-label distribution and fixed classifier. We denote this deterministic function by *ε*(*θ*,*ψ*). As shown in Figure 1, the MMSE estimate and the sample conditioned MSE for this error can also be calculated using the first two moments of the error distribution.

### Conditional error estimator

*ε*as a random variable, one can similarly derive its posterior distribution by conditioning on the feature vector:

which is different than the derivation of the same quantity in (3).

*conditional error estimator,*which we define as the MMSE estimate of the classification error conditioned on the feature vector

**x**,

*Z*is a normalizing constant given by

In addition to being useful in the above alternative derivation of the classifier’s error posterior, the conditional error estimate has other practical applications. When classifying an unlabeled data point, we would like to estimate the error of the classifier output for that particular data point, as opposed to the overall error estimate for the classifier.

In sum, using the effective class-conditional densities and the posterior marginal probabilities one can calculate conditional error estimates for points in the feature space in addition to the earlier quantities described.

### The multivariate poisson model

With the widespread use of next-generation sequencing techniques, classification approaches must be developed to account for the discrete nature of the mapped sequence data and to accommodate the various types of prior information available regarding these experiments.

*X*

_{i,j}reads for sample point

*i*and gene

*j*. We model this as

where *λ* _{i,j} is the location parameter of the log-normal distribution for sample *i* and gene *j*, and *d* _{
i
} is a variable accounting for the sequencing depth as determined by the sequencing process [21]. For each *i*, we model the location parameter vector *λ* _{
i
} with a multivariate Gaussian distribution, *λ* _{
i
}∼Normal(*μ*,*Σ*). We then consider the mean *μ* and covariance *Σ* of the gene concentrations as independent quantities for each class *y*.

*y*is parameterized by

*θ*

_{ y }=(

*μ*,

*Σ*,

**d**,

**λ**), where

**d**=(

*d*

_{1},…,

*d*

_{ n }) and

**λ**=(

*λ*

_{i,j}),

*i*=1,2,…

*n*,

*j*=1,2,…,

*D*, for

*n*sample points and

*D*total genes. Therefore, ${\theta}_{y}\in {\Theta}_{y}={\mathbb{R}}^{D}\times {\mathbb{R}}^{D\times D}\times {\mathbb{R}}^{n}\times {\mathbb{R}}^{D\times n}$. The feature-label distribution parameterization for the two-class problem is then given by

*θ*=(

*c*,

*θ*

_{0},

*θ*

_{1}), where

*c*=

*p*(

*Y*=0), the prior probability for class 0.

where each element of *μ* _{
y
} is distributed according to a univariate Gaussian. Unless otherwise stated, *η* is the *D* dimensional zero vector, *ν* ^{2}=25, *κ*=10, and *S*=(*κ*−1−*D*)*I* _{
D
}. For computational and identifiability reasons, **d** is fixed to be a vector of normalization constants in order to match the different sequencing depths across all the samples. In practice, **d** can be approximated by an upper quartile normalization, which has been shown to be effective [27].

In any Bayesian approach the choice of prior affects the results, especially when only a few data points are given. In the case of MMSE classifier error estimation in the Bayesian framework, robustness to incorrect modeling assumptions has been extensively studied in [7] and in those studies performance held up well for various kinds of incorrect modeling assumptions. Robustness of optimal Bayesian classifiers to false modeling assumptions was extensively studied in [11]. Again, good robustness was exhibited. Of course, one can get bad small-sample results by intentionally selecting an inaccurate prior. In general, if one is confident in his knowledge, then a tight prior is called for because tighter priors require less data for good performance; on the other hand, when one is not confident, then prudence calls for a less informative prior. As proven in [11], OBC classification is consistent under very general conditions; however, a prior whose mass is concentrated far away from the true parameters will perform worse than one that is non-informative. These issues have been extensively discussed in the Bayesian literature [9],[28]-[30]. In the end, performance is the measure of worth and our results with synthetic and real data indicate solid performance for the modeling approach used herein.

### Overdispersion

*conditionally*Poisson in equation 8, the observed read counts are not

*marginally*Poisson distributed. To demonstrate this, consider a one-dimensional simplification of the MP model in which

*X*is the number of reads observed,

*λ*is the log of the RNA concentration, and

*X*,

where *μ* and *σ* ^{2} are the mean and variance of the log of the concentration. Therefore, when *σ* ^{2}>0, the marginal variance of *X* is always greater than that of a Poisson random variable with the same effective rate.

*T*, we compute the p-value by comparing the test statistic on the true data

*T*(

*S*

_{ n }) and the value of the statistic averaged across the posterior predictive distribution

*T*(

*x*

^{ r e p }), where

*x*

^{ r e p }∼

*p*(

*x*|

*S*

_{ n }):

where *x* ^{r e p(s)} are Monte Carlo samples taken from the posterior predictive distribution *p*(*x*|*S* _{
n
}) using the *M* Monte Carlo samples from the posterior distribution of *θ* as described in Section ‘Computation’. The term (0.5)Pr(*T*(*x* ^{
r
e
p
})=*T*(*S* _{
n
})|*S* _{
n
}) is necessary due to the discrete nature of RNA-Seq data. P-values away from 0 and 1 indicate that the model posterior produces test statistics both above and below that measured on the real data.

We also consider where the real test statistic falls in relation to credible intervals of the test statistic to consider the magnitude of any differences. We apply the inter-quartile distance test statistic to provide a measure of the MP model’s ability to fit the dispersion of RNA-Seq data. We also consider several other test quantities in the Additional file 1: Table S1-S5.

### Prior calibration using discarded features

Since designed classifiers typically use very few of the totality of observed genes, only a small fraction of the data is used for classifier design. Similarly to [8], we can use the discarded features to calibrate the inverse-Wishart prior for our MP OBC. Our goal is to obtain hyperparameters *S*,**m**,*κ*, and *ν* ^{2} for each class from our training data *S* _{
n
}. In general, we do not expect the discarded features to give us information about any particular genes and the specific covariances between genes, so we make the simplifying assumptions that we learn information from the discarded genes in an aggregate sense. Thus, we consider the following structure on the hyperparameters: **m**=*m*[1,1,…,1]^{
T
} and

where $m\in \mathbb{R}$, *σ* ^{2}>0, and −1≤*ρ*≤1. For each class, we need to determine values for five scalar quantities: *m*,*ν* ^{2},*σ* ^{2},*ρ*, and *κ*.

*μ*and

*Σ*. Typically, the number of discarded features

*F*is much larger than the dimensionality

*D*of the classification problem. Therefore, due to computation time, we uniformly sample

*F*

_{ s }pairs of features from

*F*and average the resulting runs rather than using all or large groups of discarded features in a single MCMC run. We use the following procedure (for the complete algorithm, see Additional file 1):

- 1.
For each randomly chosen discarded feature pair (

*s*in total): - (a)
Obtain MCMC samples using the feature pair as data and flat priors.

- (b)
Record posterior averages of

*μ*and*Σ*.

- 2.
- 3.
Using the resulting five hyperparameter estimates, run the final MCMC for classification.

*κ*via

*μ*directly from our posterior, we obtain

*F*

_{ s }discarded feature pairs. For the

*i*-th feature pair we obtain the posterior means ${\hat{\mu}}_{1}^{\left(i\right)},{\hat{\Sigma}}_{11}^{\left(i\right)},$ and ${\hat{\Sigma}}_{12}^{\left(i\right)}$ and then average:

We substitute the estimates from Equations (15)-(19) back into Equations (9)-(14) to obtain the final hyperparameter estimates.

One must keep in mind that the calibration procedure explicitly assumes the MP model. Hence, one can only expect an improvement in the classification performance if the data follow the MP model.

### Computation

where ${\theta}_{y}^{\left(i\right)}$ are *M* samples of *θ* _{
y
} from the model posterior distributions.

For clarity of presentation, we do not consider the class variable *y*, and we assume a single class. We do this because the computation can be performed per-class due to the assumed independence between the classes and the marginal probability, *p*(*c*,*θ* _{0},*θ* _{1})=*p*(*c*)*p*(*θ* _{0})*p*(*θ* _{1}).

*θ*using the Metropolis Hastings MCMC algorithm we define a proposal distribution

*p*(

*θ*

^{′}|

*θ*) to obtain a new value for the class parameters

*θ*

^{′}from the old values

*θ*. We then calculate the acceptance ratio

under the assumption of a symmetric proposal distribution (*p*(*θ* ^{′}|*θ*)=*p*(*θ*|*θ* ^{′})). The process of proposing and accepting samples from this distribution with the probability *R* induces a Markov chain. Positivity of the proposal distribution (*p*(*θ* ^{′}|*θ*)>0 for any *θ*) is a sufficient condition for ergodicity of this Markov chain. Furthermore, this Markov chain admits a steady-state distribution equal to our desired posterior distribution *p*(*θ*|*S* _{
n
}) [34].

*p*(

**x**

_{ i }|

*θ*)=

*p*(

**x**

_{ i }|

*λ*

_{ i }) owing to conditional independence. From the definition of the prior,

where, *p*(**x** _{
k
}|*λ* _{
k
})∼Poisson(*d* _{
k
} exp(*λ* _{
k
})), *λ*∼ Normal (*μ*,*Σ*), $\Lambda ={\mathbb{R}}^{n\times D}$, and the *λ* ^{(g)} are *T* vector-valued samples drawn from the appropriate class’s posterior distribution used to approximate the inner intractable integral. In addition, we use this approximation of the effective class-conditional density to calculate the conditional error estimates of (7) in a pointwise fashion.

*c*, the posterior expectation takes the closed form

where the *n* _{
y
} are the number of training samples obtained from class *y* and the *α* _{
y
} are hyperparameters set to 1 for an uninformative prior. Conjugacy was used for this one parameter because the increased flexibility of the full sampling approach was deemed not necessary due to the constrained, univariate nature of the parameter. If more complex relationships between *c* and other parameters were desired, then a sampling approach using non-conjugate priors would be straightforward to implement.

### Synthetic data

With these parameters, ten global, twenty heterogeneous, and ten non-marker features are generated. Then four features are randomly chosen to represent a mixture of features of various classification quality. Following [21], the features in the data are zero mean and unit standard deviation normalized except for the MP OBC. The exception occurs because the MP model expects features to be positive integers and normalization is not necessary. The discarded features are used for calibration of the MP OBC priors, and 3000 samples are generated from each class to estimate the true classification rate for each classifier.

We use four features in this synthetic data classification study owing to limited computational resources as discussed in Section ‘Computational limitations’.

The synthetic data generation method proposed in [35] imposes the strong assumption of a homogeneous covariance (HC) structure between the two classes of data. This assumption does not hold for biological situations where interactions between features are not necessarily preserved between classes, and this occurs frequently in biology when considering the possible effects of canalizing genes, nonlinear gene regulation, and mutations in the case of cancer [36],[37]. Specifically, if the canalizing gene is not observed, and differs in activity between the two classes, then the measured correlation between two downstream genes could potentially be negligible in one class while strong in the other class. Similarly, for highly nonlinear gene regulation, if a gene in one class is in the saturation region of its response curve from a master gene, then the correlation will be low, while a lower expression level in the other class would allow for a large measured correlation with the same canalizing gene. And finally, if one class represents normal gene expression and the other tumor-related expression, then a correlation might exist from a functioning pathway in the normal tissue, but a mutation could result in a lack of correlation effects in the tumor.

Hence, we modify the synthetic data generation procedure in an attempt to produce synthetic datasets more representative of such nonlinear phenomena in biology. In this modified procedure, we allow independent covariance (IC) matrices for the features of the two classes. To generate these covariance matrices, *Σ* _{
y
}, we utilize independent draws from inverse-Wishart distributions with parameters *κ* _{
y
}=22,*D*=20, and scale matrix $S={\sigma}_{y}^{2}$ (*κ*−1−*D*)*I* _{
D
}. To examine the effects of feature correlation in IC datasets, we can also generate low-correlation covariance matrices by zeroing the off-diagonal terms. Once the covariance matrix for class *y* is obtained, location parameters for gene-expression values for each sample point are drawn from the respective multivariate normal distribution *λ* _{
y
}∼*N*(*μ* _{
y
},*Σ* _{
y
}). Each sample point is then assumed to be normalized through an upper quartile or other suitable method, but in practice any sample-based normalization is imperfect. We reflect this variation by drawing the sequencing depth *d* _{
i
} from a Uniform(*d* _{low},*d* _{high}) distribution, giving the rate of the Poisson process as *d* _{
i
} exp*λ* _{
i
}. The number of reads for a single gene from a single sample is then drawn from this Poisson distribution. See Additional file 1 for more detail.

The OBC is optimal on average across the space of distributions determined by its prior distributions. To avoid biasing the performance comparison, we draw the classification problem datasets using different distributions than those of the OBC priors. See Additional file 1 for more detail.

### Real data

*c*and therefore

*c*must be known in advance. Based upon records from 2006-2010, we have a very accurate estimate, 48,600/141,300≈0.34 [38]. Whereas we can use the value of

*c*directly, along with all of the data, in designing the OBC, for classification rules that do not use

*c*explicitly, the separately sampled data must be maximally subsampled to the proper sampling ratio

*c*before applying the classification rule [39]. This means that for

*N*

_{ trn }desired samples, the sample subsets will contain round((1−

*c*)

*N*

_{ trn }) and round(

*c*

*N*

_{ trn }) for class 0 and 1, respectively. Moreover, holdout error estimation, which we use here, must be properly adapted for separate sampling for all design methods, including the OBC. The holdout estimate is given by

where ${\widehat{\epsilon}}_{0}$ and ${\widehat{\epsilon}}_{1}$ are the ordinary holdout estimators (performed on all remaining data samples not used for training) for the class-0 and class-1 errors, respectively [39]. We note that many studies have made the mistake of using classification rules designed for random sampling when sampling is separate. This can have devastating effects on classifier performance [39].

While averaging over sample subsets for holdout error estimation, we also average over uniformly, randomly selected gene subsets of size 4. This sampling occurs from low (1-10 average reads per gene) expression genes. We sample from these lower expression genes because we are ultimately interested in classification problems where the delineation between phenotypes is determined by genes with low expression. We used 10,000 for averaging in order to obtain a large enough sample over this feature and sample subset space to achieve repeatable results (data not shown). Computational runtime for each sample and gene subset was similar to the synthetic data.

## Results and discussion

The Additional file 1 contains a simple two-class, two-feature demonstration of the overall procedure to allow for easy visualization and interpretation. Here we discuss the results for the synthetic and real data.

### Synthetic data

In the case of independent-covariance data with highly correlated features, Figure 3b shows superior classification performance of the MP OBC at nearly all sample sizes considered. In addition, for calibrated prior distributions, the performance of the MP OBC improves. This improvement is greater when the sample sizes are small, which demonstrates the importance of additional knowledge (through discarded features) when data are expensive to obtain or not readily available.

The superior performance of the OBC relative to LDA, 3NN, and SVM in Figure 3b is on account of classification optimization relative to the model, which characterizes prior information. To further investigate OBC improvement, we again considered heterogeneous covariance matrices but with independent features to determine if there is any difference in the relative performance between the classifiers. In fact, the results provided in Figure 3c show identical relative performance to the error curves in Figure 3b, thereby indicating that both the standard classifiers and the OBC, relative performance (at least in the case considered) is not affected by whether or not the features are correlated. Indeed, comparing Figure 3a with Figures 3b and 3c, we see that the relative performance of SVM, MP OBC, and calibrated MP OBC is the same in both the homogeneous and heterogeneous models. The switch in relative performance between LDA and 3NN between Figure 3a and Figures 3b and 3c is not surprising because LDA is optimal for a fixed (known) homogeneous Gaussian model but not for a heterogeneous Gaussian model.

The larger overall classification errors in Figure 3a as compared to Figures 3b and 3c are due to the different covariance matrices generated by the HC and IC models. Each model required different generating distributions for {*σ* _{
y
},*ρ*} and {*S*,*κ*} for the HC and IC cases, respectively, and the particular choices made in Section ‘Synthetic data’ resulted in larger dispersions and higher errors in the HC models than the IC models. To demonstrate this, we tested LDA with 1000 training and testing samples across 1000 random generating distributions and found the average HC classification error to be 0.41 and the IC error to be 0.32. This is despite LDA being optimal for homogeneous, fixed, known Gaussian cases and sub-optimal for heterogeneous, fixed, known Gaussian cases, where the former is similar to the HC case.

Still using independent-covariance data, we fixed the mean of class 0 at *μ* _{0}=0.0 in Figure 3d, and varied *μ* _{1} from 0.0 to 1.0 to make the classification problem harder and easier, respectively. Across this range of classification problems, the MP OBC had better classification performance than the other classification methods. In addition, calibrated priors improved performance further, especially for harder classification problems.

### Real data

**Posterior predictive model diagnostics are given for 10 randomly selected genes from adenocarcinoma TCGA samples**

Gene ID | IQR ( S | 95% int. for IQR ( x | p-value |
---|---|---|---|

UPK1A|11045 | 2.12 | [1.0, 3.0] | 0.09 |

OR4P4|81300 | 0.00 | [0.0, 0.0] | 0.50 |

PCDHA12|56137 | 139.22 | [107.8, 187.0] | 0.54 |

MDS2|259283 | 1.85 | [2.0, 5.0] | 1.00 |

AXIN2|8313 | 347.69 | [331.5, 439.3] | 0.85 |

DYNLT1|6993 | 848.41 | [830.0, 1043.3] | 0.90 |

RARA|5914 | 786.43 | [706.8, 881.3] | 0.62 |

TMEM194A|23306 | 396.06 | [367.0, 471.3] | 0.76 |

AGPS|8540 | 496.45 | [505.8, 636.5] | 0.97 |

NLRP2|55655 | 854.47 | [381.3, 677.5] | 0.00 |

### Computational limitations

The results in Figure 3 and Figure 4 required tens of thousands of MCMC runs. Owing to limited available computational resources, we could only allocate around 30 seconds on a single CPU core for each MCMC run. This necessitated using only four genes for these classification results as each iteration of the MCMC procedure has time complexity of *O*(*D* ^{3}), where *D* is the number of features. In practice, one would have a small number of data sets and could use parallel computing to devote more time and computing effort for the classification. For example, in timescales on the order of hours on a typical workstation, we have successfully performed classification using 50 genes.

The other classification methods compared in this study have smaller computational requirements and can correspondingly handle larger numbers of features given the same available resources. However, for the small sample sizes often available in biology, 50 genes is typically beyond the “peaking” point where most classifiers decrease in classification performance as more features are added (for a fixed number of training samples) [40]. Incidentally, the OBC does not suffer this “peaking phenomenon” as shown in [10].

In addition, the computational time requirements of classification is typically not a bottleneck in translational medicine given the timescales used in collecting biological data. In these settings, the accuracy of classification is much more valuable than rapid runtimes, and this is the primary advantage of the computational OBC framework proposed in this paper.

## Conclusions

We have demonstrated that Bayesian classification can be applied to specific problem domains such as RNA-Seq through statistical modeling and MCMC computation. The resulting classifier provides superior classification performance compared to state-of-the-art classifiers such as SVM with a radial basis kernel. Although we have not discussed error estimation – our interest in the present paper being classification, *ipso facto*, the MCMC approach to optimal Bayesian classification can be applied, via [6],[7] and [12],[13], to obtain optimal MMSE error estimators for any classification rule and sample-conditioned evaluation of the MSE for error estimation.

Future work includes examining the normalization parameter *d* and determining if additional performance improvements can be made by considering the distribution over *d* rather than transforming the original data through the process of data normalization. Additionally, more efficient computational techniques could be used to allow for larger feature sizes, including program optimization and utilizing structure in the feature covariance to reduce the size of the parameter space.

## Funding

This work was supported in part by the National Institute of Health grant U01CA162077, the NIEHS Center for Translational Environmental Health Research (CTEHR) P30ES023512, and the Center for Nonlinear Studies at the Los Alamos National Laboratory.

## Additional file

## Declarations

### Acknowledgements

The authors acknowledge the Whole Systems Genomics Initiative (WSGI) for providing computational resources and systems administration support for the WSGI HPC Cluster. In addition, code in this paper utilizes the Python Scikit-learn library [41].

## Authors’ Affiliations

## References

- Braga-Neto UM, Dougherty ER: Is cross-validation valid for small-sample microarray classification? . Bioinformatics. 2004, 20 (3): 374-380.View ArticlePubMedGoogle Scholar
- Hanczar B, Hua J, Dougherty ER: Decorrelation of the true and estimated classifier errors in high-dimensional settings . EURASIP J Bioinformatics Syst Biol. 2007, 2007: 2-Google Scholar
- Hanczar B, Dougherty ER: On the comparison of classifiers for microarray data . Curr Bioinformatics. 2010, 5 (1): 29-39.View ArticleGoogle Scholar
- Hanczar B, Dougherty ER: The reliability of estimated confidence intervals for classification error rates when only a single sample is available . Pattern Recognit. 2013, 46 (3): 1067-1077. doi:10.1016/j.patcog.2012.09.019,View ArticleGoogle Scholar
- Dougherty ER, Zollanvari A, Braga-Neto UM: The illusion of distribution-free small-sample classification in genomics . Curr Genomics. 2011, 12 (5): 333-View ArticlePubMed CentralPubMedGoogle Scholar
- Dalton LA, Dougherty ER: Bayesian minimum mean-square error estimation for classification error – part i: definition and the bayesian mmse error estimator for discrete classification . Signal Process IEEE Trans. 2011, 59 (1): 115-129.View ArticleGoogle Scholar
- Dalton LA, Dougherty ER: Bayesian minimum mean-square error estimation for classification error – part ii: the bayesian mmse error estimator for linear classification of gaussian distributions . IEEE Trans Signal Process. 2011, 59: 130-144.View ArticleGoogle Scholar
- Dalton LA, Dougherty ER: Application of the bayesian mmse estimator for classification error to gene expression microarray data . Bioinformatics. 2011, 27 (13): 1822-1831.View ArticlePubMedGoogle Scholar
- Esfahani MS, Dougherty ER: Incorporation of biological pathway knowledge in the construction of priors for optimal bayesian classification . Comput Biol Bioinformatics IEEE/ACM Trans. 2014, 11: 202-218. doi:10.1109/TCBB.2013.143,View ArticleGoogle Scholar
- Dalton LA, Dougherty ER: Optimal classifiers with minimum expected error within a bayesian framework – part i: Discrete and gaussian models . Pattern Recognit. 2013, 46 (5): 1301-1314. doi:10.1016/j.patcog.2012.10.018,View ArticleGoogle Scholar
- Dalton LA, Dougherty ER: Optimal classifiers with minimum expected error within a bayesian framework – part ii: Properties and performance analysis . Pattern Recognit. 2013, 46 (5): 1288-1300. doi:10.1016/j.patcog.2012.10.019,View ArticleGoogle Scholar
- Dalton LA, Dougherty ER: Exact sample conditioned mse performance of the bayesian mmse estimator for classification error – part i: representation . Signal Process IEEE Trans. 2012, 60 (5): 2575-2587.View ArticleGoogle Scholar
- Dalton LA, Dougherty ER: Exact sample conditioned mse performance of the bayesian mmse estimator for classification error – part ii: consistency and performance analysis . Signal Process IEEE Trans. 2012, 60 (5): 2588-2603.View ArticleGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data . Genome Biol. 2010, 11 (10): 106-View ArticleGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edger: a bioconductor package for differential expression analysis of digital gene expression data . Bioinformatics. 2010, 26 (1): 139-140.View ArticlePubMed CentralPubMedGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays . Genome Res. 2008, 18 (9): 1509-1517.View ArticlePubMed CentralPubMedGoogle Scholar
- Gallopin M, Rau A, Jaffrézic F: A hierarchical poisson log-normal model for network inference from rna sequencing data . PloS one. 2013, 8 (10): 77503-View ArticleGoogle Scholar
- Si Y, Liu P, Li P, Brutnell TP: Model-based clustering for rna-seq data . Bioinformatics. 2014, 30 (2): 197-205.View ArticlePubMedGoogle Scholar
- Rau A, Celeux G, Martin-Magniette M-L, Maugis-Rabusseau C: Clustering high-throughput sequencing data with poisson mixture models. [Research Report] RR-7786. 2011, pp.36. <inria-00638082>.Google Scholar
- Witten DM: Classification and clustering of sequencing data using a poisson model . Ann Appl Stat. 2011, 5 (4): 2493-2518.View ArticleGoogle Scholar
- Ghaffari N, Youse MR, Johnson CD, Ivanov I, Dougherty ER: Modeling the next generation sequencing sample processing pipeline for the purposes of classification . BMC Bioinformatics. 2013, 14 (1): 307-View ArticlePubMed CentralPubMedGoogle Scholar
- Duda RO, Hart PE, Stork DG: Pattern Classification, Hoboken, NJ: John Wiley & Sons; 2012.Google Scholar
- Bengtsson M, Ståhlberg A, Rorsman P, Kubista M: Gene expression profiling in single cells from the pancreatic islets of langerhans reveals lognormal distribution of mrna levels . Genome Res. 2005, 15 (10): 1388-1392.View ArticlePubMed CentralPubMedGoogle Scholar
- Attoor S, Dougherty ER, Chen Y, Bittner ML, Trent JM: Which is better for cdna-microarray-based classification: ratios or direct intensities . Bioinformatics. 2004, 20 (16): 2513-2520.View ArticlePubMedGoogle Scholar
- Lindley DV: A statistical paradox . Biometrika. 1957, 44 (1/2): 187-192.View ArticleGoogle Scholar
- Shafer G: Lindley’s paradox . J Am Stat Assoc. 1982, 77 (378): 325-334.View ArticleGoogle Scholar
- Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F: A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis . Brief Bioinform. 2013, 14 (6): 671-683. doi:10.1093/bib/bbs046,View ArticlePubMedGoogle Scholar
- Jaynes ET: Prior probabilities . Syst Sci Cybernet IEEE Trans. 1968, 4 (3): 227-241.View ArticleGoogle Scholar
- Jeffreys H: An invariant form for the prior probability in estimation problems . Proc R Soc Lond A Math Phys Sci. 1946, 186 (1007): 453-461.View ArticlePubMedGoogle Scholar
- Berger JO, Bernardo JM: On the development of reference priors . Bayesian Stat. 1992, 4 (4): 35-60.Google Scholar
- Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB: Bayesian Data Analysis, Boca Raton, FL: CRC Press; 2013.Google Scholar
- Carrasco M, Florens J-P: Simulation-based method of moments and efficiency . J Bus Econ Stat. 2002, 20 (4): 482-492.View ArticleGoogle Scholar
- Hansen LP, Singleton KJ: Generalized instrumental variables estimation of nonlinear rational expectations models . Econometrica: J Econometric Soc. 1982, 50 (2): 1269-1286.View ArticleGoogle Scholar
- Gilks WR, Richardson S, Spiegelhalter DJ: Markov Chain Monte Carlo in Practice, vol.2, Boca Raton, FL: CRC Press; 1996.Google Scholar
- Hua J, Tembe WD, Dougherty ER: Performance of feature-selection methods in the classification of high-dimension data . Pattern Recognit. 2009, 42 (3): 409-424.View ArticleGoogle Scholar
- Martins DC, Braga-Neto UM, Hashimoto RF, Bittner ML, Dougherty ER: Intrinsically multivariate predictive genes . Selected Topics Signal Process IEEE J. 2008, 2 (3): 424-439.View ArticleGoogle Scholar
- Dougherty ER, Brun M, Trent JM, Bittner ML: Conditioning-based modeling of contextual genomic regulation . Comput Biol Bioinformatics IEEE/ACM Trans. 2009, 6 (2): 310-320.View ArticleGoogle Scholar
- Ries LAG, Melbert D, Krapcho M, Stinchcomb DG, Howlader N, Horner MJ, Mariotto A, Miller BA, Feuer EJ, Altekruse SF, Lewis DR, Clegg L, Eisner MP, Reichman M, Edwards BK: Seer cancer statistics review, 1975-2005, Bethesda, MD: National Cancer Institute; 2008.Google Scholar
- Esfahani MS, Dougherty ER: Effect of separate sampling on classification accuracy . Bioinformatics. 2014, 30 (2): 242-250.View ArticleGoogle Scholar
- Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER: Optimal number of features as a function of sample size for various classification rules . Bioinformatics. 2005, 21 (8): 1509-1515.View ArticlePubMedGoogle Scholar
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E: Scikit-learn: machine learning in Python . J Mach Learn Res. 2011, 12: 2825-2830.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.