 Research Article
 Open Access
 Published:
MCMC implementation of the optimal Bayesian classifier for nonGaussian models: modelbased RNASeq classification
BMC Bioinformatics volume 15, Article number: 401 (2014)
Abstract
Background
Sequencing datasets consist of a finite number of reads which map to specific regions of a reference genome. Most effort in modeling these datasets focuses on the detection of univariate differentially expressed genes. However, for classification, we must consider multiple genes and their interactions.
Results
Thus, we introduce a hierarchical multivariate Poisson model (MP) and the associated optimal Bayesian classifier (OBC) for classifying samples using sequencing data. Lacking closedform solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior or equivalent classification performance compared to typical classifiers for two synthetic datasets and over a range of classification problem difficulties. We also introduce the Bayesian minimum mean squared error (MMSE) conditional error estimator and demonstrate its computation over the feature space. In addition, we demonstrate superior or leading class performance over an RNASeq dataset containing two lung cancer tumor types from The Cancer Genome Atlas (TCGA).
Conclusions
Through modelbased, optimal Bayesian classification, we demonstrate superior classification performance for both synthetic and real RNASeq datasets. A tutorial video and Python source code is available under an open source license at http://bit.ly/1gimnss.
Background
The possibility of genomic phenotype classification arose with the inception of geneexpression microarrays. From the outset, two fundamental problems have frustrated the endeavor: (1) the inaccuracy of microarray measurements, and (2) small samples. Our particular application of interest is classification using RNASeq data. Modern RNASeq technologies sequence small RNA fragments (mRNA) to measure gene expression, where the number of reads mapped to a gene on the reference genome defines the count data. Given that RNASeq data has advantages over microarray data, in particular, more accurate measurement, we still confront the second fundamental problem, which is statistical, not technological: small samples cause resamplingbased classifier error estimators to be very inaccurate due to excessive variance and lack of regression with the true error [1][4]. Since the error rate of a classifier quantifies its predictive accuracy, it is the salient epistemological attribute of any classifier. The inability to satisfactorily estimate the error with modelfree methods with small samples implies that genomic classifier error estimation is virtually impossible without the use of prior information, so that the whole smallsample classification problem becomes unapproachable in a modelfree framework [5].
The situation has been addressed by utilizing prior knowledge via a Bayesian approach that considers a prior distribution on an uncertainty class of featurelabel distributions [6],[7]. For expressionbased classification, prior distributions have been constructed using expression data not employed in classifier design [8] and known regulatory pathways [9]. Given that a prior model must be assumed to achieve satisfactory error estimation, an obvious course of action is to derive an optimal classifier based on the prior knowledge and the sample data, the result being an optimal Bayesian classifier (OBC) that is guaranteed to have the best average performance of any classifier relative to the posterior distribution derived from the prior distribution and data [10],[11]. While Bayesian classification does not depend on particular distributional forms, closedform solutions have been derived for the multinomial model and Gaussian models using linear classifiers for the minimum mean squared error (MMSE) error estimate [6],[7], the MSE of the error estimate [12],[13], and an optimal Bayesian classifier (OBC) relative to the prior distribution [10],[11], the latter being expressed in terms of effective class conditional distributions, which are expectations relative to the posterior distribution of the classconditional distributions. The closedform solutions depend on particular models (multinomial and Gaussian) and the existence of conjugate priors, which can be too constraining for practical applications such as RNASeq classification.
Much of the statistical literature concerning classification of RNASeq data attempts to address differential expression testing, that is, univariate statistical testing on an individual gene basis. These attempts typically model RNASeq data via negative binomial [14],[15] and Poisson distributions [16]. In addition, network inference has been attempted using a hierarchical Poisson lognormal model [17], and clustering of RNASeq data points has utilized various approaches [18],[19]. However, in clinical settings one is often interested in sample classification: the problem of classifying the RNASeq data from unlabeled patients using a set of labeled training data. One of the few RNASeqspecific attempts towards this goal uses a Poisson modeling assumption with independent features [20]. The Poisson model is completely parameterized by its mean and thus is known to exhibit problems in fitting RNASeq data due to the overdispersion typically observed in such datasets.
In this paper, we focus on modeling the pipeline that starts with extracting the gene concentrations from the biological samples and their subsequent processing by the sequencing instrument [21]. This is accomplished using a hierarchical, multivariate Poisson model (MP). Specifically, gene concentration levels are modeled by a lognormal distribution and the sequencing instrument sampling of those is modeled via a Poisson process. This allows us to accurately model the RNASeq data overdispersion as demonstrated by marginal variance calculations and posterior predictive model diagnostics in Section ‘Overdispersion’. In addition, this hierarchical model allows for inferring any covariance structure observed between the features.
Whereas Dalton and Dougherty have presented a computational method for nonlinear classifiers in the Gaussian model [8], this still depends upon conjugate priors. In this work, we remove the constraints imposed by the requirement of a closedform solution by developing the optimal Bayesian classifier using a MarkovchainMonteCarlo (MCMC) methodology. This provides a computational framework for calculating the OBC for any parameterized class conditionaldensity and any prior distribution. Most notably, this allows us to use distributions designed to closely model particular datasets and a prior distribution of any form to improve classification performance in smallsample settings, in particular, for RNASeqbased classification.
Methods
Notation
Throughout, we use capital letters to indicate random variables, lower case letters to indicate individual realizations of random variables or indices, bold latin characters for observed vectors, and Greek letters for latent features and parameters. We write p(X) as the probability measure over the random variable X. p(X) may be a probability mass function, probability density function, or arbitrary probability measure. p(xy) denotes the conditional probability p(X=xY=y). Similarly, following Bayesian convention, we write parameterized distributions by conditioning on the parameter, for instance, p(XY,θ), and posterior expectations by conditioning on the sample, such as E[XY,S _{ n }], where S _{ n } and all other values are defined in Section ‘Review of optimal Bayesian classification’. If it is unclear which density an expectation is taken with respect to, then we denote it in subscript notation, such as ${E}_{\theta {S}_{n}}\left[\xb7\right],$ where the expectation is taken with respect to the density p(θS _{ n }).
Review of optimal Bayesian classification
Binary classification considers a set of n labeled training data points, ${S}_{n}={\left\{(\mathbf{x},y)\right\}}_{1}^{n}$, where y∈{0,1} is the class label and $\mathbf{x}\in \mathcal{X}$ is the feature vector over a feature space . An example of binary classification in a clinical setting might include class 0 and 1 being two types of cancers, or normal and cancerous tissues. Available features would then be the gene or genes that will eventually be used in the designed classifier to assign this label. The feature space would be the set of possible gene expression measurements for all genes in the feature vector. The labeled training data S _{ n } would be the set of gene expression measurements from samples which had undergone further testing (possibly observation with the passage of time, cell culturing, or more invasive followup procedures) to identify the type or malignancy of the tissue. Using S _{ n }, we design a classifier ψ that hopefully performs well on the unknown joint featurelabel distribution p(X,Y). In the same clinical example, the classifier ψ could then identify the type of cancer using gene expression measurements alone.
By parameterizing this unknown joint distribution in a modelbased Bayesian framework one can derive an optimal Bayesian classifier (OBC) that minimizes the expected error over the space of all classifiers under assumed forms of the classconditional densities. Specifically, under Gaussian and multinomial classconditional densities and their corresponding conjugate prior distributions, closedform solutions for the OBC [10],[11] and the first two moments of the error estimate conditioned on the sample [12],[13] have been obtained.
The parameterization of the featurelabel distribution consists of the marginal class probability c and the classconditional densities p(xy,θ _{ y }), where a particular value θ _{ y }∈Θ _{ y } specifies a single classconditional density contained in the class of densities defined over the space Θ _{ y }, which will be a Cartesian product as described in Section ‘The multivariate poisson model’. Therefore, for a twoclass problem, we specify a parameterized joint featurelabel distribution as θ=(c,θ _{0},θ _{1})∈Θ=[0,1]×Θ _{0}×Θ _{1}. In the Bayesian classification framework, these values are then treated as random variables, so that we may consider quantities such as the expectation of c, or another random variable conditioned on the value of the parameter vector θ.
Figure 1 describes the interrelationships between the quantities of interest in the general theoretic framework of Bayesian classification. The tree shows a subset of the derivations possible from the posterior featurelabel parameter distribution to the OBC classifier and error estimates. Specifically, directed edges indicate that the child can be derived from the parent by performing the operation indicated by the edge label. Closedform solutions of the quantities highlighted in grey have been calculated for the Gaussian and multinomial featurelabel distributions [6],[7]. As in those derivations, the tree assumes independence between the marginal class probability c and the classconditional parameters θ _{ y }. In addition, the posterior of c is assumed known throughout the tree. Figure 1 demonstrates a primary benefit of the Bayesian approach to classification. Once we obtain the posterior distribution of the classconditional parameters, it is straightforward to calculate many relevant quantities through appropriately crafted conditional expectations. In this paper we demonstrate how to approximate any quantity in the tree for arbitrary class conditional densities and arbitrary prior distributions.
We now examine the tree in more detail. Starting at the far left of the tree, p(θS _{ n }) is the posterior distribution of the parameterized featurelabel distribution – posterior to the labeled samples in S _{ n }. Typically, error estimates and the optimal classifier are our primary interest, so that this posterior distribution is traditionally used as a means to compute other quantities and is not of interest by itself.
The effective classconditional density is the marginal predictive posterior of the feature vector X conditioned S _{ n } and the class variable Y,
It gives the distribution of the feature vector using a weighted average over all the parameterized classconditional densities in Θ _{ y } given a class y. The weights in this expectation are the posterior, p(θ _{ y }S _{ n }), evaluated at each θ _{ y }.
The true error of classifier ψ is ε=p(ψ(X)≠Y). Given the sample data S _{ n }, ε is a random unknown quantity in the Bayesian framework. The MMSE estimate given in [12] can be written as
where I _{ A } is the indicator function for event A, $\u0109=E\left[c{S}_{n}\right]$ is the posterior expectation of c, R _{ y } is the region of the feature space the classifier predicts to be class y, is the feature space, and ε _{ y }(θ _{ y },ψ) is the error of classifier ψ contributed by class y on the fixed distribution θ _{ y }.
We can also obtain the full posterior distribution of the error,
where p(εθ) is the true error for a fixed featurelabel distribution and fixed classifier. We denote this deterministic function by ε(θ,ψ). As shown in Figure 1, the MMSE estimate and the sample conditioned MSE for this error can also be calculated using the first two moments of the error distribution.
With the MMSE estimator defined, the optimal Bayesian classifier (OBC) is the classifier minimizing the expected error by pointwise minimization of the integral (2) [11]:
Conditional error estimator
If the true featurelabel distribution were known, then we could compute the true error of a classifier exactly as an expectation over the conditional error [22]:
Treating ε as a random variable, one can similarly derive its posterior distribution by conditioning on the feature vector:
which is different than the derivation of the same quantity in (3).
This introduces the idea of the conditional error estimator, which we define as the MMSE estimate of the classification error conditioned on the feature vector x,
as expanded through application of Bayes’ theorem, where Z is a normalizing constant given by
In addition to being useful in the above alternative derivation of the classifier’s error posterior, the conditional error estimate has other practical applications. When classifying an unlabeled data point, we would like to estimate the error of the classifier output for that particular data point, as opposed to the overall error estimate for the classifier.
For the OBC, from (4) the conditional error estimator can be written as
In sum, using the effective classconditional densities and the posterior marginal probabilities one can calculate conditional error estimates for points in the feature space in addition to the earlier quantities described.
The multivariate poisson model
With the widespread use of nextgeneration sequencing techniques, classification approaches must be developed to account for the discrete nature of the mapped sequence data and to accommodate the various types of prior information available regarding these experiments.
Gene concentration levels can be modeled using a lognormal distribution [23],[24]. As discussed in the introduction, we assume that the sequencing instrument samples this mRNA concentration through a Poisson process and obtains X _{i,j} reads for sample point i and gene j. We model this as
where λ _{i,j} is the location parameter of the lognormal distribution for sample i and gene j, and d _{ i } is a variable accounting for the sequencing depth as determined by the sequencing process [21]. For each i, we model the location parameter vector λ _{ i } with a multivariate Gaussian distribution, λ _{ i }∼Normal(μ,Σ). We then consider the mean μ and covariance Σ of the gene concentrations as independent quantities for each class y.
The entire MP model is represented in Figure 2 as a plate diagram. The distribution of a single class y is parameterized by θ _{ y }=(μ,Σ,d,λ), where d=(d _{1},…,d _{ n }) and λ=(λ _{i,j}),i=1,2,…n,j=1,2,…,D, for n sample points and D total genes. Therefore, ${\theta}_{y}\in {\Theta}_{y}={\mathbb{R}}^{D}\times {\mathbb{R}}^{D\times D}\times {\mathbb{R}}^{n}\times {\mathbb{R}}^{D\times n}$. The featurelabel distribution parameterization for the twoclass problem is then given by θ=(c,θ _{0},θ _{1}), where c=p(Y=0), the prior probability for class 0.
To ensure a proper posterior with unit integral, we place weakly informative priors over the latent variables in the MP model. In choosing these values, we have aimed to avoid the complications that can occur with overly diffuse priors, such as Lindley’s paradox [25],[26]. We choose:
where each element of μ _{ y } is distributed according to a univariate Gaussian. Unless otherwise stated, η is the D dimensional zero vector, ν ^{2}=25, κ=10, and S=(κ−1−D)I _{ D }. For computational and identifiability reasons, d is fixed to be a vector of normalization constants in order to match the different sequencing depths across all the samples. In practice, d can be approximated by an upper quartile normalization, which has been shown to be effective [27].
In any Bayesian approach the choice of prior affects the results, especially when only a few data points are given. In the case of MMSE classifier error estimation in the Bayesian framework, robustness to incorrect modeling assumptions has been extensively studied in [7] and in those studies performance held up well for various kinds of incorrect modeling assumptions. Robustness of optimal Bayesian classifiers to false modeling assumptions was extensively studied in [11]. Again, good robustness was exhibited. Of course, one can get bad smallsample results by intentionally selecting an inaccurate prior. In general, if one is confident in his knowledge, then a tight prior is called for because tighter priors require less data for good performance; on the other hand, when one is not confident, then prudence calls for a less informative prior. As proven in [11], OBC classification is consistent under very general conditions; however, a prior whose mass is concentrated far away from the true parameters will perform worse than one that is noninformative. These issues have been extensively discussed in the Bayesian literature [9],[28][30]. In the end, performance is the measure of worth and our results with synthetic and real data indicate solid performance for the modeling approach used herein.
Overdispersion
The MP model uses the Poisson distribution in a hierarchical scheme. It is important to note that, while the read counts are modeled as conditionally Poisson in equation 8, the observed read counts are not marginally Poisson distributed. To demonstrate this, consider a onedimensional simplification of the MP model in which X is the number of reads observed, λ is the log of the RNA concentration, and
Then for the marginal variance of X,
where μ and σ ^{2} are the mean and variance of the log of the concentration. Therefore, when σ ^{2}>0, the marginal variance of X is always greater than that of a Poisson random variable with the same effective rate.
In addition, by carrying out a posterior predictive model check [31], p. 143, by computing marginal posterior pvalues against real RNASeq data, we can quantitatively assess the ability of the MP model to fit the dispersion of the TCGA data. For a test statistic T, we compute the pvalue by comparing the test statistic on the true data T(S _{ n }) and the value of the statistic averaged across the posterior predictive distribution T(x ^{rep}), where x ^{rep}∼p(xS _{ n }):
where x ^{rep(s)} are Monte Carlo samples taken from the posterior predictive distribution p(xS _{ n }) using the M Monte Carlo samples from the posterior distribution of θ as described in Section ‘Computation’. The term (0.5)Pr(T(x ^{rep})=T(S _{ n })S _{ n }) is necessary due to the discrete nature of RNASeq data. Pvalues away from 0 and 1 indicate that the model posterior produces test statistics both above and below that measured on the real data.
We also consider where the real test statistic falls in relation to credible intervals of the test statistic to consider the magnitude of any differences. We apply the interquartile distance test statistic to provide a measure of the MP model’s ability to fit the dispersion of RNASeq data. We also consider several other test quantities in the Additional file 1: Table S1S5.
Prior calibration using discarded features
Since designed classifiers typically use very few of the totality of observed genes, only a small fraction of the data is used for classifier design. Similarly to [8], we can use the discarded features to calibrate the inverseWishart prior for our MP OBC. Our goal is to obtain hyperparameters S,m,κ, and ν ^{2} for each class from our training data S _{ n }. In general, we do not expect the discarded features to give us information about any particular genes and the specific covariances between genes, so we make the simplifying assumptions that we learn information from the discarded genes in an aggregate sense. Thus, we consider the following structure on the hyperparameters: m=m[1,1,…,1]^{T} and
where $m\in \mathbb{R}$, σ ^{2}>0, and −1≤ρ≤1. For each class, we need to determine values for five scalar quantities: m,ν ^{2},σ ^{2},ρ, and κ.
Due to the hierarchical design of the MP model, we cannot apply the method of moments in a direct fashion, as did [8]. Instead, we utilize a sampling based approach to the method of moments. This MCMC sampling approach has been examined in [32] as an extension to the generalized method of moments [33]. The sampling approach uses the discarded features in an additional MCMC run evaluated prior to the primary classification MCMC procedure as discussed in Section ‘Computation’ – and then proceeds to the method of moments. In this calibration MCMC, we initialize all prior distributions with flat priors and use the discarded features to obtain samples from the posterior distribution of μ and Σ. Typically, the number of discarded features F is much larger than the dimensionality D of the classification problem. Therefore, due to computation time, we uniformly sample F _{ s } pairs of features from F and average the resulting runs rather than using all or large groups of discarded features in a single MCMC run. We use the following procedure (for the complete algorithm, see Additional file 1):

1.
For each randomly chosen discarded feature pair (s in total):

(a)
Obtain MCMC samples using the feature pair as data and flat priors.

(b)
Record posterior averages of μ and Σ.

2.
Average over these posterior averages as given by Equations (15) (19).

3.
Using the resulting five hyperparameter estimates, run the final MCMC for classification.
Following [8], we use the moments of the posterior samples to determine the hyperparameters through the following relations: The mean of an inverseWishart distribution is
which together with our simplified covariance structure implies
The variance of the first diagonal of an inverseWishart matrix can be used to solve for κ via
As we have samples of μ directly from our posterior, we obtain
In order to use Equations (9)(14), we obtain estimates of the moments from MCMC performed over the F _{ s } discarded feature pairs. For the ith feature pair we obtain the posterior means ${\hat{\mu}}_{1}^{\left(i\right)},{\hat{\Sigma}}_{11}^{\left(i\right)},$ and ${\hat{\Sigma}}_{12}^{\left(i\right)}$ and then average:
We substitute the estimates from Equations (15)(19) back into Equations (9)(14) to obtain the final hyperparameter estimates.
One must keep in mind that the calibration procedure explicitly assumes the MP model. Hence, one can only expect an improvement in the classification performance if the data follow the MP model.
Computation
To obtain the MP OBC, we approximate the effective class conditional densities in order to minimize the expected error in a pointwise fashion:
where ${\theta}_{y}^{\left(i\right)}$ are M samples of θ _{ y } from the model posterior distributions.
For clarity of presentation, we do not consider the class variable y, and we assume a single class. We do this because the computation can be performed perclass due to the assumed independence between the classes and the marginal probability, p(c,θ _{0},θ _{1})=p(c)p(θ _{0})p(θ _{1}).
To obtain posterior samples of θ using the Metropolis Hastings MCMC algorithm we define a proposal distribution p(θ ^{′}θ) to obtain a new value for the class parameters θ ^{′} from the old values θ. We then calculate the acceptance ratio
under the assumption of a symmetric proposal distribution (p(θ ^{′}θ)=p(θθ ^{′})). The process of proposing and accepting samples from this distribution with the probability R induces a Markov chain. Positivity of the proposal distribution (p(θ ^{′}θ)>0 for any θ) is a sufficient condition for ergodicity of this Markov chain. Furthermore, this Markov chain admits a steadystate distribution equal to our desired posterior distribution p(θS _{ n }) [34].
From the definition of the likelihood,
where p(x _{ i }θ)=p(x _{ i }λ _{ i }) owing to conditional independence. From the definition of the prior,
The posterior predictive distribution in (20) is approximated by
where, p(x _{ k }λ _{ k })∼Poisson(d _{ k } exp(λ _{ k })), λ∼ Normal (μ,Σ), $\Lambda ={\mathbb{R}}^{n\times D}$, and the λ ^{(g)} are T vectorvalued samples drawn from the appropriate class’s posterior distribution used to approximate the inner intractable integral. In addition, we use this approximation of the effective classconditional density to calculate the conditional error estimates of (7) in a pointwise fashion.
Finally, because we have assumed a conjugate prior distribution for the marginal class probability c, the posterior expectation takes the closed form
where the n _{ y } are the number of training samples obtained from class y and the α _{ y } are hyperparameters set to 1 for an uninformative prior. Conjugacy was used for this one parameter because the increased flexibility of the full sampling approach was deemed not necessary due to the constrained, univariate nature of the parameter. If more complex relationships between c and other parameters were desired, then a sampling approach using nonconjugate priors would be straightforward to implement.
Synthetic data
To evaluate OBC performance in the setting of the MP model, we generate synthetic data using the method proposed in [35] to simulate gene expression/mRNA concentrations (see Additional file 1). These gene expression values are then statistically sampled to emulate modern sequencing machines as described in [21]. Parameter values are drawn from the following distributions to examine a wide variety of classification problems:
With these parameters, ten global, twenty heterogeneous, and ten nonmarker features are generated. Then four features are randomly chosen to represent a mixture of features of various classification quality. Following [21], the features in the data are zero mean and unit standard deviation normalized except for the MP OBC. The exception occurs because the MP model expects features to be positive integers and normalization is not necessary. The discarded features are used for calibration of the MP OBC priors, and 3000 samples are generated from each class to estimate the true classification rate for each classifier.
We use four features in this synthetic data classification study owing to limited computational resources as discussed in Section ‘Computational limitations’.
The synthetic data generation method proposed in [35] imposes the strong assumption of a homogeneous covariance (HC) structure between the two classes of data. This assumption does not hold for biological situations where interactions between features are not necessarily preserved between classes, and this occurs frequently in biology when considering the possible effects of canalizing genes, nonlinear gene regulation, and mutations in the case of cancer [36],[37]. Specifically, if the canalizing gene is not observed, and differs in activity between the two classes, then the measured correlation between two downstream genes could potentially be negligible in one class while strong in the other class. Similarly, for highly nonlinear gene regulation, if a gene in one class is in the saturation region of its response curve from a master gene, then the correlation will be low, while a lower expression level in the other class would allow for a large measured correlation with the same canalizing gene. And finally, if one class represents normal gene expression and the other tumorrelated expression, then a correlation might exist from a functioning pathway in the normal tissue, but a mutation could result in a lack of correlation effects in the tumor.
Hence, we modify the synthetic data generation procedure in an attempt to produce synthetic datasets more representative of such nonlinear phenomena in biology. In this modified procedure, we allow independent covariance (IC) matrices for the features of the two classes. To generate these covariance matrices, Σ _{ y }, we utilize independent draws from inverseWishart distributions with parameters κ _{ y }=22,D=20, and scale matrix $S={\sigma}_{y}^{2}$ (κ−1−D)I _{ D }. To examine the effects of feature correlation in IC datasets, we can also generate lowcorrelation covariance matrices by zeroing the offdiagonal terms. Once the covariance matrix for class y is obtained, location parameters for geneexpression values for each sample point are drawn from the respective multivariate normal distribution λ _{ y }∼N(μ _{ y },Σ _{ y }). Each sample point is then assumed to be normalized through an upper quartile or other suitable method, but in practice any samplebased normalization is imperfect. We reflect this variation by drawing the sequencing depth d _{ i } from a Uniform(d _{low},d _{high}) distribution, giving the rate of the Poisson process as d _{ i } expλ _{ i }. The number of reads for a single gene from a single sample is then drawn from this Poisson distribution. See Additional file 1 for more detail.
The OBC is optimal on average across the space of distributions determined by its prior distributions. To avoid biasing the performance comparison, we draw the classification problem datasets using different distributions than those of the OBC priors. See Additional file 1 for more detail.
Real data
We consider a real RNASeq dataset composed of level 3, RNASeqV2 data from the Cancer Genome Atlas (TCGA) project. It contains 484 and 470 specimens from lung adenocarcinoma and lung squamous cell carcinoma tumor biopsies, respectively. The samples are mapped read counts against 20531 known human RNA transcripts as generated by the University of North Carolina at Chapel Hill, one of the Genome Sequencing Centers for the TCGA. The data for each cancer type is the result of processing approximately 20 billion reads and the read count files for each are one gigabyte apiece. The problem is to classify the tumor types. Because the class0 (lung adenocarcinoma) and class1 (lung squamous cell carcinoma) sample sizes, 484 and 470, are not chosen randomly, we are confronted with the problem of separate sampling. This means that there is no way to obtain a posterior distribution for c and therefore c must be known in advance. Based upon records from 20062010, we have a very accurate estimate, 48,600/141,300≈0.34 [38]. Whereas we can use the value of c directly, along with all of the data, in designing the OBC, for classification rules that do not use c explicitly, the separately sampled data must be maximally subsampled to the proper sampling ratio c before applying the classification rule [39]. This means that for N _{ trn } desired samples, the sample subsets will contain round((1−c)N _{ trn }) and round(c N _{ trn }) for class 0 and 1, respectively. Moreover, holdout error estimation, which we use here, must be properly adapted for separate sampling for all design methods, including the OBC. The holdout estimate is given by
where ${\widehat{\epsilon}}_{0}$ and ${\widehat{\epsilon}}_{1}$ are the ordinary holdout estimators (performed on all remaining data samples not used for training) for the class0 and class1 errors, respectively [39]. We note that many studies have made the mistake of using classification rules designed for random sampling when sampling is separate. This can have devastating effects on classifier performance [39].
While averaging over sample subsets for holdout error estimation, we also average over uniformly, randomly selected gene subsets of size 4. This sampling occurs from low (110 average reads per gene) expression genes. We sample from these lower expression genes because we are ultimately interested in classification problems where the delineation between phenotypes is determined by genes with low expression. We used 10,000 for averaging in order to obtain a large enough sample over this feature and sample subset space to achieve repeatable results (data not shown). Computational runtime for each sample and gene subset was similar to the synthetic data.
Results and discussion
The Additional file 1 contains a simple twoclass, twofeature demonstration of the overall procedure to allow for easy visualization and interpretation. Here we discuss the results for the synthetic and real data.
Synthetic data
To evaluate classification performance, classifiers were trained using 3NN, LDA, and csupport vector machines with a radial basis function kernel [22]. Starting with the homogeneouscovariance model, Figure 3a shows that the performance of the multivariate Poisson OBC is better than nonlinear SVM when more than 10 samples are available and is significantly better than any other classifier when using calibrated features. Equivalently, by using discarded features, we can obtain the same classification performance while requiring fewer training samples.
In the case of independentcovariance data with highly correlated features, Figure 3b shows superior classification performance of the MP OBC at nearly all sample sizes considered. In addition, for calibrated prior distributions, the performance of the MP OBC improves. This improvement is greater when the sample sizes are small, which demonstrates the importance of additional knowledge (through discarded features) when data are expensive to obtain or not readily available.
The superior performance of the OBC relative to LDA, 3NN, and SVM in Figure 3b is on account of classification optimization relative to the model, which characterizes prior information. To further investigate OBC improvement, we again considered heterogeneous covariance matrices but with independent features to determine if there is any difference in the relative performance between the classifiers. In fact, the results provided in Figure 3c show identical relative performance to the error curves in Figure 3b, thereby indicating that both the standard classifiers and the OBC, relative performance (at least in the case considered) is not affected by whether or not the features are correlated. Indeed, comparing Figure 3a with Figures 3b and 3c, we see that the relative performance of SVM, MP OBC, and calibrated MP OBC is the same in both the homogeneous and heterogeneous models. The switch in relative performance between LDA and 3NN between Figure 3a and Figures 3b and 3c is not surprising because LDA is optimal for a fixed (known) homogeneous Gaussian model but not for a heterogeneous Gaussian model.
The larger overall classification errors in Figure 3a as compared to Figures 3b and 3c are due to the different covariance matrices generated by the HC and IC models. Each model required different generating distributions for {σ _{ y },ρ} and {S,κ} for the HC and IC cases, respectively, and the particular choices made in Section ‘Synthetic data’ resulted in larger dispersions and higher errors in the HC models than the IC models. To demonstrate this, we tested LDA with 1000 training and testing samples across 1000 random generating distributions and found the average HC classification error to be 0.41 and the IC error to be 0.32. This is despite LDA being optimal for homogeneous, fixed, known Gaussian cases and suboptimal for heterogeneous, fixed, known Gaussian cases, where the former is similar to the HC case.
Still using independentcovariance data, we fixed the mean of class 0 at μ _{0}=0.0 in Figure 3d, and varied μ _{1} from 0.0 to 1.0 to make the classification problem harder and easier, respectively. Across this range of classification problems, the MP OBC had better classification performance than the other classification methods. In addition, calibrated priors improved performance further, especially for harder classification problems.
Real data
In Table 1, we chose ten genes at random from adenocarcinoma tumor TCGA samples and performed model diagnostics [31], p. 143, by calculating posterior predictive pvalues for interquartile distance (IQR) as a measure of dispersion. In the Additional file 1, we provide additional test statistics and graphical predictive posterior model diagnostics. These results indicate that RNASeq overdispersion is modeled sufficiently with the MP model.
In Figure 4, we see mean holdout errors averaged over 10,000 training sets and testing sets of TCGA data as described in Section ‘Real data’. Here the MP OBC performs better than all other classifiers across most training sample sizes considered, but calibration does not improve performance for this particular dataset. Recall that improvement owing to calibration depends on the extent to which the data satisfy the MP model. If the aim of this paper were to build an operational classifier based on the TCGA data, then we would have to go back and extensively study the data set to examine deviations from the model – for instance, outliers; however, here our aim is to show the functionality of the OBC with nonGaussian data based on MCMC and apply it to the MP model. The fact that the MP OBC performs well on the real data satisfies this aim. Calibration is a tricky business and it would be a major separate study to characterize the manner in which model variation affects calibration, even if we were to perform an intensive study of this particular data set. Performance on the synthetic data demonstrates the effectiveness of the calibration when the model is satisfied.
Computational limitations
The results in Figure 3 and Figure 4 required tens of thousands of MCMC runs. Owing to limited available computational resources, we could only allocate around 30 seconds on a single CPU core for each MCMC run. This necessitated using only four genes for these classification results as each iteration of the MCMC procedure has time complexity of O(D ^{3}), where D is the number of features. In practice, one would have a small number of data sets and could use parallel computing to devote more time and computing effort for the classification. For example, in timescales on the order of hours on a typical workstation, we have successfully performed classification using 50 genes.
The other classification methods compared in this study have smaller computational requirements and can correspondingly handle larger numbers of features given the same available resources. However, for the small sample sizes often available in biology, 50 genes is typically beyond the “peaking” point where most classifiers decrease in classification performance as more features are added (for a fixed number of training samples) [40]. Incidentally, the OBC does not suffer this “peaking phenomenon” as shown in [10].
In addition, the computational time requirements of classification is typically not a bottleneck in translational medicine given the timescales used in collecting biological data. In these settings, the accuracy of classification is much more valuable than rapid runtimes, and this is the primary advantage of the computational OBC framework proposed in this paper.
Conclusions
We have demonstrated that Bayesian classification can be applied to specific problem domains such as RNASeq through statistical modeling and MCMC computation. The resulting classifier provides superior classification performance compared to stateoftheart classifiers such as SVM with a radial basis kernel. Although we have not discussed error estimation – our interest in the present paper being classification, ipso facto, the MCMC approach to optimal Bayesian classification can be applied, via [6],[7] and [12],[13], to obtain optimal MMSE error estimators for any classification rule and sampleconditioned evaluation of the MSE for error estimation.
Future work includes examining the normalization parameter d and determining if additional performance improvements can be made by considering the distribution over d rather than transforming the original data through the process of data normalization. Additionally, more efficient computational techniques could be used to allow for larger feature sizes, including program optimization and utilizing structure in the feature covariance to reduce the size of the parameter space.
Funding
This work was supported in part by the National Institute of Health grant U01CA162077, the NIEHS Center for Translational Environmental Health Research (CTEHR) P30ES023512, and the Center for Nonlinear Studies at the Los Alamos National Laboratory.
Additional file
References
 1.
BragaNeto UM, Dougherty ER: Is crossvalidation valid for smallsample microarray classification? . Bioinformatics. 2004, 20 (3): 374380.
 2.
Hanczar B, Hua J, Dougherty ER: Decorrelation of the true and estimated classifier errors in highdimensional settings . EURASIP J Bioinformatics Syst Biol. 2007, 2007: 2
 3.
Hanczar B, Dougherty ER: On the comparison of classifiers for microarray data . Curr Bioinformatics. 2010, 5 (1): 2939.
 4.
Hanczar B, Dougherty ER: The reliability of estimated confidence intervals for classification error rates when only a single sample is available . Pattern Recognit. 2013, 46 (3): 10671077. doi:10.1016/j.patcog.2012.09.019,
 5.
Dougherty ER, Zollanvari A, BragaNeto UM: The illusion of distributionfree smallsample classification in genomics . Curr Genomics. 2011, 12 (5): 333
 6.
Dalton LA, Dougherty ER: Bayesian minimum meansquare error estimation for classification error – part i: definition and the bayesian mmse error estimator for discrete classification . Signal Process IEEE Trans. 2011, 59 (1): 115129.
 7.
Dalton LA, Dougherty ER: Bayesian minimum meansquare error estimation for classification error – part ii: the bayesian mmse error estimator for linear classification of gaussian distributions . IEEE Trans Signal Process. 2011, 59: 130144.
 8.
Dalton LA, Dougherty ER: Application of the bayesian mmse estimator for classification error to gene expression microarray data . Bioinformatics. 2011, 27 (13): 18221831.
 9.
Esfahani MS, Dougherty ER: Incorporation of biological pathway knowledge in the construction of priors for optimal bayesian classification . Comput Biol Bioinformatics IEEE/ACM Trans. 2014, 11: 202218. doi:10.1109/TCBB.2013.143,
 10.
Dalton LA, Dougherty ER: Optimal classifiers with minimum expected error within a bayesian framework – part i: Discrete and gaussian models . Pattern Recognit. 2013, 46 (5): 13011314. doi:10.1016/j.patcog.2012.10.018,
 11.
Dalton LA, Dougherty ER: Optimal classifiers with minimum expected error within a bayesian framework – part ii: Properties and performance analysis . Pattern Recognit. 2013, 46 (5): 12881300. doi:10.1016/j.patcog.2012.10.019,
 12.
Dalton LA, Dougherty ER: Exact sample conditioned mse performance of the bayesian mmse estimator for classification error – part i: representation . Signal Process IEEE Trans. 2012, 60 (5): 25752587.
 13.
Dalton LA, Dougherty ER: Exact sample conditioned mse performance of the bayesian mmse estimator for classification error – part ii: consistency and performance analysis . Signal Process IEEE Trans. 2012, 60 (5): 25882603.
 14.
Anders S, Huber W: Differential expression analysis for sequence count data . Genome Biol. 2010, 11 (10): 106
 15.
Robinson MD, McCarthy DJ, Smyth GK: edger: a bioconductor package for differential expression analysis of digital gene expression data . Bioinformatics. 2010, 26 (1): 139140.
 16.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: Rnaseq: an assessment of technical reproducibility and comparison with gene expression arrays . Genome Res. 2008, 18 (9): 15091517.
 17.
Gallopin M, Rau A, Jaffrézic F: A hierarchical poisson lognormal model for network inference from rna sequencing data . PloS one. 2013, 8 (10): 77503
 18.
Si Y, Liu P, Li P, Brutnell TP: Modelbased clustering for rnaseq data . Bioinformatics. 2014, 30 (2): 197205.
 19.
Rau A, Celeux G, MartinMagniette ML, MaugisRabusseau C: Clustering highthroughput sequencing data with poisson mixture models. [Research Report] RR7786. 2011, pp.36. <inria00638082>.
 20.
Witten DM: Classification and clustering of sequencing data using a poisson model . Ann Appl Stat. 2011, 5 (4): 24932518.
 21.
Ghaffari N, Youse MR, Johnson CD, Ivanov I, Dougherty ER: Modeling the next generation sequencing sample processing pipeline for the purposes of classification . BMC Bioinformatics. 2013, 14 (1): 307
 22.
Duda RO, Hart PE, Stork DG: Pattern Classification, Hoboken, NJ: John Wiley & Sons; 2012.
 23.
Bengtsson M, Ståhlberg A, Rorsman P, Kubista M: Gene expression profiling in single cells from the pancreatic islets of langerhans reveals lognormal distribution of mrna levels . Genome Res. 2005, 15 (10): 13881392.
 24.
Attoor S, Dougherty ER, Chen Y, Bittner ML, Trent JM: Which is better for cdnamicroarraybased classification: ratios or direct intensities . Bioinformatics. 2004, 20 (16): 25132520.
 25.
Lindley DV: A statistical paradox . Biometrika. 1957, 44 (1/2): 187192.
 26.
Shafer G: Lindley’s paradox . J Am Stat Assoc. 1982, 77 (378): 325334.
 27.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F: A comprehensive evaluation of normalization methods for illumina highthroughput rna sequencing data analysis . Brief Bioinform. 2013, 14 (6): 671683. doi:10.1093/bib/bbs046,
 28.
Jaynes ET: Prior probabilities . Syst Sci Cybernet IEEE Trans. 1968, 4 (3): 227241.
 29.
Jeffreys H: An invariant form for the prior probability in estimation problems . Proc R Soc Lond A Math Phys Sci. 1946, 186 (1007): 453461.
 30.
Berger JO, Bernardo JM: On the development of reference priors . Bayesian Stat. 1992, 4 (4): 3560.
 31.
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB: Bayesian Data Analysis, Boca Raton, FL: CRC Press; 2013.
 32.
Carrasco M, Florens JP: Simulationbased method of moments and efficiency . J Bus Econ Stat. 2002, 20 (4): 482492.
 33.
Hansen LP, Singleton KJ: Generalized instrumental variables estimation of nonlinear rational expectations models . Econometrica: J Econometric Soc. 1982, 50 (2): 12691286.
 34.
Gilks WR, Richardson S, Spiegelhalter DJ: Markov Chain Monte Carlo in Practice, vol.2, Boca Raton, FL: CRC Press; 1996.
 35.
Hua J, Tembe WD, Dougherty ER: Performance of featureselection methods in the classification of highdimension data . Pattern Recognit. 2009, 42 (3): 409424.
 36.
Martins DC, BragaNeto UM, Hashimoto RF, Bittner ML, Dougherty ER: Intrinsically multivariate predictive genes . Selected Topics Signal Process IEEE J. 2008, 2 (3): 424439.
 37.
Dougherty ER, Brun M, Trent JM, Bittner ML: Conditioningbased modeling of contextual genomic regulation . Comput Biol Bioinformatics IEEE/ACM Trans. 2009, 6 (2): 310320.
 38.
Ries LAG, Melbert D, Krapcho M, Stinchcomb DG, Howlader N, Horner MJ, Mariotto A, Miller BA, Feuer EJ, Altekruse SF, Lewis DR, Clegg L, Eisner MP, Reichman M, Edwards BK: Seer cancer statistics review, 19752005, Bethesda, MD: National Cancer Institute; 2008.
 39.
Esfahani MS, Dougherty ER: Effect of separate sampling on classification accuracy . Bioinformatics. 2014, 30 (2): 242250.
 40.
Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER: Optimal number of features as a function of sample size for various classification rules . Bioinformatics. 2005, 21 (8): 15091515.
 41.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E: Scikitlearn: machine learning in Python . J Mach Learn Res. 2011, 12: 28252830.
Acknowledgements
The authors acknowledge the Whole Systems Genomics Initiative (WSGI) for providing computational resources and systems administration support for the WSGI HPC Cluster. In addition, code in this paper utilizes the Python Scikitlearn library [41].
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JK Developed the MCMC implementation of the optimal Bayesian classifier, worked on the MP modeling, and wrote the draft of the manuscript. II Worked on the MP modeling, RNAseq modeling, and finalizing the manuscript. ED Worked on the application of the optimal Bayesian classifier and finalizing the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Classification
 RNASeq
 Modelbased
 Bayesian