Incorporating biological prior knowledge for Bayesian learning via maximal knowledgedriven information priors
 Shahin Boluki^{1}Email author,
 Mohammad Shahrokh Esfahani^{2},
 Xiaoning Qian^{1} and
 Edward R Dougherty^{1}
https://doi.org/10.1186/s1285901718934
© The Author(s) 2017
Published: 28 December 2017
Abstract
Background
Phenotypic classification is problematic because small samples are ubiquitous; and, for these, use of prior knowledge is critical. If knowledge concerning the featurelabel distribution – for instance, genetic pathways – is available, then it can be used in learning. Optimal Bayesian classification provides optimal classification under model uncertainty. It differs from classical Bayesian methods in which a classification model is assumed and prior distributions are placed on model parameters. With optimal Bayesian classification, uncertainty is treated directly on the featurelabel distribution, which assures full utilization of prior knowledge and is guaranteed to outperform classical methods.
Results
The salient problem confronting optimal Bayesian classification is prior construction. In this paper, we propose a new prior construction methodology based on a general framework of constraints in the form of conditional probability statements. We call this prior the maximal knowledgedriven information prior (MKDIP). The new constraint framework is more flexible than our previous methods as it naturally handles the potential inconsistency in archived regulatory relationships and conditioning can be augmented by other knowledge, such as population statistics. We also extend the application of prior construction to a multinomial mixture model when labels are unknown, which often occurs in practice. The performance of the proposed methods is examined on two important pathway families, the mammalian cellcycle and a set of p53related pathways, and also on a publicly available gene expression dataset of nonsmall cell lung cancer when combined with the existing prior knowledge on relevant signaling pathways.
Conclusion
The new proposed general prior construction framework extends the prior construction methodology to a more flexible framework that results in better inference when proper prior knowledge exists. Moreover, the extension of optimal Bayesian classification to multinomial mixtures where data sets are both small and unlabeled, enables superior classifier design using small, unstructured data sets. We have demonstrated the effectiveness of our approach using pathway information and available knowledge of gene regulating functions; however, the underlying theory can be applied to a wide variety of knowledge types, and other applications when there are small samples.
Keywords
Background
Small samples are commonplace in phenotypic classification and, for these, prior knowledge is critical [1, 2]. If knowledge concerning the featurelabel distribution is available, say, genetic pathways, then it can be used to design an optimal Bayesian classifier (OBC) for which uncertainty is treated directly on the featurelabel distribution. As typical with Bayesian methods, the salient obstacle confronting OBC is prior construction. In this paper, we propose a new prior construction framework to incorporate gene regulatory knowledge via general types of constraints in the form of probability statements quantifying the probabilities of gene up and downregulation conditioned on the regulatory status of other genes. We extend the application of prior construction to a multinomial mixture model when labels are unknown, a key issue confronting the use of data arising from unplanned experiments in practice.
Regarding prior construction, E. T. Jaynes has remarked [3], “…there must exist a general formal theory of determination of priors by logical analysis of prior information – and that to develop it is today the top priority research problem of Bayesian theory”. It is precisely this kind of formal structure that is presented in this paper. The formal structure involves a constrained optimization in which the constraints incorporate existing scientific knowledge augmented by slackness variables. The constraints tighten the prior distribution in accordance with prior knowledge, while at the same time avoiding inadvertent over restriction of the prior, an important consideration with small samples.
Subsequent to the introduction of Jeffreys’ noninformative prior [4], there was a series of informationtheoretic and statistical methods: Maximal data information priors (MDIP) [5], noninformative priors for integers [6], entropic priors [7], reference (noninformative) priors obtained through maximization of the missing information [8], and leastinformative priors [9] (see also [10–12] and the references therein). The principle of maximum entropy can be seen as a method of constructing leastinformative priors [13, 14], though it was first introduced in statistical mechanics for assigning probabilities. Except in the Jeffreys’ prior, almost all the methods are based on optimization: max or minimizing an objective function, usually an information theoretic one. The leastinformative prior in [9] is found among a restricted set of distributions, where the feasible region is a set of convex combinations of certain types of distributions. In [15], several noninformative and informative priors for different problems are found. All of these methods emphasize the separation of prior knowledge and observed sample data.
Although the methods above are appropriate tools for generating prior probabilities, they are quite general methodologies without targeting any specific type of prior information. In that regard, the problem of prior selection, in any Bayesian paradigm, is usually treated conventionally (even “subjectively”) and independent of the real available prior knowledge and sample data.
The a priori knowledge in the form of graphical models (e.g., Markov random fields) has been widely utilized in covariance matrix estimation in Gaussian graphical models. In these studies, using a given graphical model illustrating the interactions between variables, different problems have been addressed: e.g., constraints on the matrix structure [16, 17] or known independencies between variables [18, 19]. Nonetheless, these studies rely on a fundamental assumption: the given prior knowledge is complete and hence provides one single solution. However, in many applications including genomics, the given prior knowledge is uncertain, incomplete, and may be inconsistent. Therefore, instead of interpreting the prior knowledge as a single solution, e.g., a single deterministic covariance matrix, we aim at constructing a prior distribution on an uncertainty class.
In a different approach to prior knowledge, genegene relationships (pathwaybased or proteinprotein interaction (PPI) networks) are used to improve classification accuracy [20–26], consistency of biomarker discovery [27, 28], accuracy of identifying differentially expressed genes and regulatory target genes of a transcription factor [29–31], and targeted therapeutic strategies [32, 33]. The majority of these studies utilize gene expressions corresponding to subnetworks in PPI networks, for instance: mean or median of gene expression values in gene ontology network modules [20], probabilistic inference of pathway activity [24], and producing candidate subnetworks via a Markov clustering algorithm applied to high quality PPI networks [26, 34]. None of these methods incorporate the regulating mechanisms (activating or suppressing) into classification or featureselection to the best of our knowledge.
The fundamental difference of the work presented in this paper is that we develop machinery to transform knowledge contained in biological signaling pathways to prior probabilities. We propose a general framework capable of incorporating any source of prior information by extending our previous prior construction methods [35–37]. We call the final prior distribution constructed via this framework, a maximal knowledgedriven information prior (MKDIP). The new MKDIP construction constitutes two steps: (1) Pairwise and functional information quantification: information in the biological pathways is quantified by an information theoretic formulation. (2) Objectivebased Prior Selection: combining sample data and prior knowledge, we build an objective function, in which the expected mean loglikelihood is regularized by the quantified information in step 1. As a special case, where we do not have any sample data, or there is only one data point available for constructing the prior probability, the proposed framework is reduced to a regularized extension of the maximum entropy principle (MaxEnt) [38].
Owing to population heterogeneity we often face a mixture model, for example, when considering tumor sample heterogeneity where the assignment of a sample to any subtype or stage is not necessarily given. Thus, we derive the MKDIP construction and OBC for a mixture model. In this paper, we assume that data are categorical, e.g. binary or ternary geneexpression representations. Such categorical representations have many potential applications, including those wherein we only have access to a coarse set of measurements, e.g. epifluorescent imaging [39], rather than fineresolution measurements such as microarray or RNASeq data. Finally, we emphasize that, in our framework, no single model is selected; instead, we consider all possible models as the uncertainty class that can be representative of the available prior information and assign probabilities to each model via the constructed prior.
Methods
Notation
Boldface lower case letters represent column vectors. Occasionally, concatenation of several vectors is also shown by boldface lower case letters. For a vector a, a _{0} represents the summation of all the elements and a _{ i } denotes its ith element. Probability sample spaces are shown by calligraphic uppercase letters. Uppercase letters are for sets and random variables (vectors). Probability measure over the random variable (vector) X is denoted by P(X), whether it be a probability density function or a probability mass function. E _{ X }[f(X)] represents the expectation of f(X) with respect to X. P(xy) denotes the conditional probability P(X=xY=y). θ represents generic parameters of a probability measure, for instance P(XY;θ) (or P _{ θ }(XY)) is the conditional probability parameterized by θ. γ represents generic hyperparameter vectors. π(θ;γ) is the probability measure over the parameters θ governed by hyperparameters γ, the parameters themselves governing another probability measure over some random variables. Throughout the paper, the terms “pathway” and “network” are used interchangeably. Also, the terms “feature”’ and “variable” are used interchangeably. \(\mathcal {M}ult(\boldsymbol {p};n)\) and \(\mathcal {D}(\boldsymbol {\alpha })\) represent a multinomial distribution with vector parameter p and n samples, and a Dirichlet distribution with vector α, respectively.
Review of optimal Bayesian classification
where the expectation is relative to the posterior distribution π ^{∗}(θ) over Θ, which is derived from the prior distribution π(θ) using Bayes’ rule [40, 41]. If we let θ _{0} and θ _{1} denote the class 0 and class 1 parameters, then we can write θ as θ=[c,θ _{ 0 },θ _{ 1 }]. If we assume that c,θ _{0},θ _{1} are independent prior to observing the data, i.e. π(θ)=π(c)π(θ _{0})π(θ _{1}), then the independence is preserved in the posterior distribution π ^{∗}(θ)=π ^{∗}(c)π ^{∗}(θ _{0})π ^{∗}(θ _{1}) and the posteriors are given by \(\pi ^{\ast }(\boldsymbol {\theta }_{y})\propto \pi (\boldsymbol {\theta } _{y})\prod _{i=1}^{n_{y}}f_{\boldsymbol {\theta }_{y}}(\mathbf {x}_{i}^{y}y)\) for y=0,1, where \(f_{\boldsymbol {\theta }_{y}}(\mathbf {x}_{i}^{y}y)\) and n _{ y } are the classconditional density and number of sample points for class y, respectively [42].
Given a classifier ψ _{ n } designed from random sample S _{ n }, from the perspective of meansquare error, the best error estimate minimizes the MSE between its true error (a function of θ and ψ _{ n }) and an error estimate (a function of S _{ n } and ψ _{ n }). This Bayesian minimummeansquareerror (MMSE) estimate is given by the expected true error, \(\widehat {\varepsilon }(\psi _{n},S_{n})=\mathrm {E}_{\boldsymbol {\theta } }[\varepsilon (\psi _{n},\boldsymbol {\theta })S_{n}]\), where ε(ψ _{ n },θ) is the error of ψ _{ n } on the featurelabel distribution parameterized by θ and the expectation is taken relative to the prior distribution π(θ) [42]. The expectation given the sample is over the posterior probability. Thus, \(\widehat {\varepsilon }(\psi _{n},S_{n})=\mathrm {E}_{\pi ^{\ast }}[\varepsilon ]\).
where \(U_{j}^{y}\) denotes the observed count for class y in bin j [40]. Hereafter, \(\sum _{i=1}^{b}\alpha _{i}^{y}\) is represented by \(\alpha _{0}^{y}\), i.e. \(\alpha _{0}^{y} = \sum _{i=1}^{b}\alpha _{i}^{y}\), and is called the precision factor. In the sequel, the sub(super)script relating to dependency on class y may be dropped; nonetheless, availability of prior knowledge for both classes is assumed.
Multinomial mixture model
Without loss of generality the summation above can be over the iterations of the chain considering burnin and thinning.
Prior construction: general framework
In this section, we propose a general framework for prior construction. We begin with introducing a knowledgedriven prior probability:
Definition 1
where C _{ θ }(ξ,D)is a cost function that depends on (1) θ: the random vector parameterizing the underlying probability distribution, (2) ξ: state of (prior) knowledge, and (3) D: partial observation (part of the sample data).
In contrast to noninformative priors, the MKDIP incorporates available prior knowledge and even part of the data to construct an informative prior.
The MKDIP definition is very general because we want a general framework for prior construction. The next definition specializes it to cost functions of a specific form in a constrained optimization.
Definition 2
In the sequel, we will refer to g ^{(1)}(·) and g ^{(2)}(·) as the cost functions, and \(g_{i}^{(3)}(\cdot)\)’s as the knowledgedriven constraints. We begin with introducing informationtheoretic cost functions, and then we propose a general set of mapping rules, denoted by \(\mathcal {T}\) in Definition 2, to convert biological pathway knowledge into mathematical forms. We then consider special cases with informationtheoretic cost functions.
Informationtheoretic cost functions
 1.(Maximum Entropy) The principle of maximumentropy (MaxEnt) for probability construction [38] leads to the least informative prior given the constraints in order to prevent adding spurious information. Under our general framework MaxEnt can be formulated by setting:where H[.] denotes the Shannon entropy.$$\beta=0,~g_{\boldsymbol{\theta}}^{(1)} = H[\boldsymbol{\theta}], $$
 2.(Maximal Data Information) The maximal data information prior (MDIP) introduced by Zellner [46] as a choice of the objective function is a criterion for the constructed probability distribution to remain maximally committed to the data [47]. To achieve MDIP, we can set our general framework with:$$\begin{aligned} \beta=0,~g_{\boldsymbol{\theta}}^{(1)} &= \ln \pi(\boldsymbol{\theta};\boldsymbol{\gamma}) + H[P(x\boldsymbol{\theta})]\\ &= \ln \pi(\boldsymbol{\theta};\boldsymbol{\gamma})E_{x\boldsymbol{\theta}}[\ln P(x\boldsymbol{\theta})]. \end{aligned} $$
 3.(Expected Mean Loglikelihood) The cost function introduced in [35] is the first one that utilizes part of the observed data for prior construction. In that, we havewhere \(\ell (\boldsymbol {\theta };D)=\frac {1}{n_{D}}\sum _{i=1}^{n_{D}}\log f(\boldsymbol {x}_{i}\boldsymbol {\theta })\) is the mean loglikelihood function of the sample points used for prior construction (D), and n _{ D } denotes the number of sample points in D. In [35], it is shown that this cost function is equivalent to the average KullbackLeibler distance between the unknown distribution (empirically estimated by some part of the samples) and the uncertainty class of distributions.$$\beta =1,~g^{(2)}_{\boldsymbol{\theta}}=\ell(\boldsymbol{\theta};D), $$
As originally proposed, the preceding approaches did not involve expectation over the uncertainty class. They were extended to the general prior construction form in Definition 1, including the expectation, in [36] to produce the regularized maximum entropy prior (RMEP), the regularized maximal data information prior (RMDIP), and the regularized expected mean loglikelihood prior (REMLP). In all cases, optimization was subject to specialized constraints.
For MKDIP, we employ the same informationtheoretic cost functions in the prior construction optimization framework. MKDIPE, MKDIPD, and MKDIPR correspond to using the same cost functions as REMP, RMDIP, and REMLP, respectively, but with the new general types of constraints. To wit, we employ functional information from the signaling pathways and show that by adding these new constraints that can be readily derived from prior knowledge, we can improve both supervised (classification problem with labelled data) and unsupervised (mixture problem without labels) learning of Bayesian operators.
From prior knowledge to mathematical constraints
In this part, we present a general formulation for mapping the existing knowledge into a set of constraints. In most scientific problems, the prior knowledge is in the form of conditional probabilities. In the following, we consider a hypothetical gene network and show how each component in a given network can be converted into the corresponding inequalities as general constraints in MKDIP optimization.
Before proceeding we would like to say something about contextual effects on regulation. Because a regulatory model is not independent of cellular activity outside the model, complete control relations such as A→B in the model, meaning that gene B is upregulated if and only if gene A is upregulated (after some time delay), do not necessarily translate into conditional probability statements of the form P(X _{ B }=1X _{ A }=1)=1, where X _{ A } and X _{ B } represent the binary gene values corresponding to genes A and B, respectively. Rather, what may be observed is P(X _{ B }=1X _{ A }=1)=1−δ, where δ>0. The pathway A→B need not imply P(X _{ B }=1X _{ A }=1)=1 because A→B is conditioned on the context of the cell, where by context we mean the overall state of the cell, not simply the activity being modeled. δ is called a conditioning parameter. In an analogous fashion, rather than P(X _{ B }=1X _{ A }=0)=0, what may be observed is P(X _{ B }=1X _{ A }=0)=η, where η>0, because there may be regulatory relations outside the model that upregulate B. Such activity is referred to as crosstalk and η is called a crosstalk parameter. Conditioning and crosstalk effects can involve multiple genes and can be characterized analytically via contextdependent conditional probabilities [48].
where η _{1}(1,1,k _{4},…,k _{ m }), η _{1}(0,0,k _{4},…,k _{ m }), and η _{1}(0,0,k _{4},…,k _{ m }) are crosstalk parameters. In practice it is unlikely that we would know the conditioning and crosstalk parameters for all combinations of k _{4},…,k _{ m }; rather, we might just know the average, in which case, δ _{1}(1,0,k _{4},…,k _{ m }) reduces to δ _{1}(1,0), η _{1}(1,1,k _{4},…,k _{ m }) reduces to η _{1}(1,1), etc.
The basic setting is very general and the conditional probabilities are what they are, whether or not they can be expressed in the regulatory form of conditioning or crosstalk parameters. The general scheme includes previous constraints and approaches proposed in [35] and [36] for the Gaussian and discrete setups. Moreover, in those we can drop the regulatoryset entropy because it is replaced by the set of conditional probabilities based on the regulatory set, whether forward (master predicting slaves) or backwards (slaves predicting masters) [48].
where λ _{1} and λ _{2} are nonnegative regularization parameters, and ε and \(\mathcal {E}\) represent the vector of all slackness variables and the feasible region for slackness variables, respectively. For each slackness variable, a possible range can be defined (note that all slackness variables are nonnegative). The higher the uncertainty is about a constraint stemming from prior knowledge, the greater the possible range for the corresponding slackness variable can be (more on this in the “Results and discussion” section).
The new general type of constraints discussed here introduces a formal procedure for incorporating prior knowledge. It allows the incorporation of knowledge of the functional regulations in the signaling pathways, any constraints on the conditional probabilities, and also knowledge of the crosstalk and conditioning parameters (if present), unlike the previous work in [36] where only partial information contained in the edges of the pathways is used in an ad hoc way.
An illustrative example and connection with conditional entropy
meaning that the conditioning parameter depends on whether X _{2}=0 or 1.
It should be noted that constraining H[X _{1}X _{2},X _{3},X _{5}] would not necessarily constrain the conditional probabilities, and may be considered as a more relaxed type of constraints. But, for example, in cases where there is no knowledge about the status of a gene given its regulator genes, constraining entropy is the only possible approach.
Special case of Dirichlet distribution
where the summation in the numerator and the first summation in the denominator of the second equality is over the states (bins) for which (\(X_{i} = k_{i}, \tilde {X}_{i} = \tilde {x}_{i}\)), and the second summation in the denominator is over the states (bins) for which (\(X_{i} = k_{i}^{c}, \tilde {X}_{i} = \tilde {x}_{i}\)).
where \(\phantom {\dot {i}\!}O_{\boldsymbol {B}_{i}}\) is the set of all possible vectors of values for B _{ i }.
For a multinomial model with a Dirichlet prior distribution, a constraint on the conditional probabilities translates into a constraint on the above expectation over the conditional probabilities (as in Eq. (15)). In our illustrative example and from the equations in Eq. (17), there are four of these constraints on the conditional probability for gene g _{1}. For example, in the second constraint from the second line of Eq. (17) (Eq. 17b), X _{ i }=X _{1}, k _{ i }=0, R _{ i }={X _{3}}, r _{ i }=[0], and B _{ i }={X _{2},X _{5}}. One might have several constraints for each gene extracted from its regulatory function (more on extracting general constraints from regulating functions in the “Results and discussion” section).
Results and discussion
The performance of the proposed general prior construction framework with different types of objective functions and constraints is examined and compared with other methods on two pathways, a mammalian cellcycle pathway and a pathway involving the gene TP53. Here we employ Boolean network modeling of genes/proteins (hereafter referred to as entities or nodes) [49] with perturbation (BNp). A Boolean Network with p nodes (genes/proteins) is defined as B=(V,F), where V represents the set of entities (genes/proteins) {v _{1},…,v _{ p }}, and F is the set of Boolean predictor functions {f _{1},…,f _{ p }}. At each step in a BNp, a decision is made by a Bernoulli random variable with the success probability equal to the perturbation probability, p _{ pert }, as to whether a node value is determined by perturbation of randomly flipping its value or by the logic model imposed from the interactions in the signaling pathways. A BNp with a positive perturbation probability can be modeled by an ergodic Markov chain, and possesses a steadystate distribution (SSD) [50]. The performance of different prior construction methods can be compared based on the expected true error of the optimal Bayesian classifiers designed with those priors, and also by comparing these errors with some other well known classification methods. Another comparison metric of prior construction methods is the expected norm of the difference between the true parameters and the posterior mean of these parameters inferred using the constructed prior distributions. Here, the true parameters are the vectors of the true classconditional SSDs, i.e. the vectors of the true classconditional bin probabilities of the BNp.
Moreover, the performance of the proposed framework is compared with other methods on a publicly available gene expression dataset of nonsmall cell lung cancer when combined with the existing prior knowledge on relevant signaling pathways.
Mammalian cell cycle classification
Boolean regulating functions of normal mammalian cell cycle [51]. In the Boolean functions {AND, OR, NOT}={∧,∨,−}
Gene  Node name  Boolean regulating function 

CycD  v _{1}  Extracellular signal 
Rb  v _{2}  \((\overline {v_{1}}\wedge \overline {v_{4}} \wedge \overline {v_{5}} \wedge \overline {v_{10}})\vee (v_{6}\wedge \overline {v_{1}}\wedge \overline {v_{10}})\) 
E2F  v _{3}  \((\overline {v_{2}}\wedge \overline {v_{5}} \wedge \overline {v_{10}})\vee ({v_{6}} \wedge \overline {v_{2}}\wedge \overline {v_{10}})\) 
CycE  v _{4}  \(({v_{3}}\wedge \overline {v_{2}})\) 
CycA  v _{5}  \(({v_{3}}\wedge \overline {v_{2}} \wedge \overline {v_{7}} \wedge \overline {(v_{8}\wedge v_{9})})\vee (v_{5}\wedge \overline {v_{2}}\wedge \overline {v_{7}}\wedge \) 
\(\overline {(v_{8}\wedge v_{9})})\)  
p27  v _{6}  \((\overline {v_{1}}\wedge \overline {v_{4}} \wedge \overline {v_{5}} \wedge \overline {v_{10}})\vee (v_{6}\wedge \overline {(v_{4}\wedge v_{5})}\wedge \overline {v_{10}}\wedge \overline {v_{1}})\) 
Cdc20  v _{7}  v _{10} 
Cdh1  v _{8}  \((\overline {v_{5}}\wedge \overline {v_{10}}) \vee ({v_{7}}) \vee {(v_{6}\wedge \overline {v_{10}})}\) 
UbcH10  v _{9}  \((\overline {v_{8}})\vee ({v_{8}}\wedge {v_{9}} \wedge ({v_{7}\vee {v_{5}}\vee {v_{10}}}))\) 
CycB  v _{10}  \((\overline {v_{7}}\wedge \overline {v_{8}})\) 
Gene  Node name  Boolean regulating function 

dna−dsb  v _{1}  Extracellular signal 
ATM  v _{2}  \(\overline {v_{4}} \wedge (v_{2}\vee v_{1})\) 
P53  v _{3}  \(\overline {v_{5}}\wedge (v_{2}\vee v_{4})\) 
Wip1  v _{4}  v _{3} 
Mdm2  v _{5}  \(\overline {v_{2}}\wedge (v_{3}\vee v_{4})\) 
The full functional regulations in the cellcycle Boolean network are shown in Table 1.
Following [36], for the binary classification problem, y=0 corresponds to the normal system functioning based on Table 1, and y=1 corresponds to the mutated (cancerous) system where CycD, p27, and Rb are permanently downregulated (are stuck at zero), which creates a situation where the cell cycles even in the absence of any growth factor. The perturbation probability is set to 0.01 and 0.05 for the normal and mutated system, respectively. A BNp has a transition probability matrix (TPM), and as mentioned earlier, with positive perturbation probability can be modeled by an ergodic Markov chain, and possesses a SSD [50]. Here, each class has a vector of steadystate bin probabilities, resulting from the regulating functions of its corresponding BNp and the perturbation probability. The constructed SSDs are further marginalized to a subset of seven genes to prevent trivial classification scenarios. The final feature vector is x=[E2F,CycE,CycA,Cdc20,Cdh1,UbcH10,CycB], and the state space size is 2^{7}=128. The true parameters for each class are the final classconditional steadystate bin probabilities, p ^{0} and p ^{1} for the normal and mutated systems, respectively, which are utilized for taking samples.
Classification problem corresponding to TP53
TP53 is a tumor suppressor gene involved in various biological pathways [36]. Mutated p53 has been observed in almost half of the common human cancers [52], and in more than 90% of patients with severe ovarian cancer [53]. A simplified pathway involving TP53, based on logic in [54], is shown in Fig. 3(b). DNA doublestrand break affects the operation of these pathways, and the Boolean network modeling of these pathways under this uncertainty has been studied in [53, 54]. The full functional regulations are shown in Table 2.
Following [36], two scenarios, dnadsb=0 and dnadsb=1, weighted by 0.95 and 0.05, are considered and the SSD of the normal system is constructed based on the ergodic Markov chain model of the BNp with the regulating functions in Table 2 by assuming the perturbation probability 0.01. The SSD for the mutated (cancerous) case is constructed by assuming a permanent down regulation of TP53 in the BNp, and perturbation probability 0.05. Knowing that dnadsb is not measurable, and to avoid trivial classification situations, the SSDs are marginalized to a subset of three entities x=[ATM,Wip1,Mdm2]. The state space size in this case is 2^{3}=8. The true parameters for each class are the final classconditional steadystate bin probabilities, p ^{0} and p ^{1} for the normal and mutated systems, respectively, which are used for data generation.
Extracting general constraints from regulating functions
If knowledge of the regulating functions exists, it can be used in the general constraint framework of the MKDIP, i.e. it can be used to constrain the conditional probabilities. In other words, the knowledge about the regulating function of gene i can be used to set ε _{ i }(k _{1},…,k _{ i−1},k _{ i+1},…,k _{ m }), and \(a^{k_{i}}_{i}(k_{1},\dots, k_{i1}, k_{i+1},\dots, k_{m})\) in the general form of constraints in (15). If the true regulating function of gene i is known, and it is not context sensitive, then the conditional probability of its status, \(a^{k_{i}}_{i}(k_{1},\dots, k_{i1}, k_{i+1},\dots, k_{m})\), is known for sure, and δ _{ i }(k _{1},…,k _{ i−1},k _{ i+1},…,k _{ m })=0. But in reality, the true regulating functions are not known, and are also context sensitive. The dependence on the context translates into δ _{ i }(k _{1},…,k _{ i−1},k _{ i+1},…,k _{ m }) being greater than zero. The greater the context effect on the gene status, the larger δ _{ i } is. Moreover, the uncertainty over the regulating function is captured by the slackness variables ε _{ i }(k _{1},…,k _{ i−1},k _{ i+1},…,k _{ m }) in Eq. (15). In other words, the uncertainty is translated to the possible range of the slackness variable values in the prior construction optimization framework. The higher the uncertainty is, the greater the range should be in the optimization framework. In fact, slackness variables make the whole constraint framework consistent, even for cases where the conditional probability constraints imposed by prior knowledge are not completely in line with each other, and guarantee the existence of a solution.
As an example, for the classification problems of the mammalian cellcycle network and the TP53 network, assuming the regulating functions in Tables 1 and 2 are the true regulating functions, the context effect can be observed in the dependence of the output of the Boolean regulating functions in the tables on the extracellular signals, nonmeasurable entities, and the genes that have been marginalized out in our setup. In the absence of quantitative knowledge about the context effect, i.e. \(a^{k_{i}}_{i}(k_{1},\dots, k_{i1}, k_{i+1},\dots, k_{m})\) for all possible setups of the regulator values, one can impose only those with such knowledge. For example, in the mammalian cellcycle network, CycB’s regulating function only depends on the values included in the observed feature set; therefore the conditional probabilities are known for all regulator value setups. But for CycE the regulating function depends on Rb, which is marginalized out in our feature set, and also itself depends on an extracellular signal. Hence, the conditional probability constraints for CycE are known only for the setup of the features that determine the output of the Boolean regulating function independent of the other regulator values.
In our comparison analysis, \(a^{k_{i}}_{i}(k_{1},\allowbreak \dots, k_{i1},\allowbreak k_{i+1}, \allowbreak \dots,\allowbreak k_{m})\) for each gene/protein in Eq. (15) is set to one for the feature value setups that determine the Boolean regulating output regardless of the context. But since the observed data are not fully described by these functions, and the system has uncertainty, we let the possible range for the slackness variables in Eq. (15) be [0,1).
The set of constraints extracted from the regulating functions and pathways for the TP53 network. Constraints extracted from the Boolean regulating functions in Table 2 corresponding to the pathway in Fig. 3(b) used in MKDIPE, MKDIPD, MKDIPR (left). Constraints extracted based on [36] from the pathway in Fig. 3(b) used in RMEP, RMDIP, REMLP (right)
(a) MKDIP Constraints  (b) Constraints in Methods of [36]  

Node  Constraint  Node  Constraint 
v _{2}  E _{ p }[P(v _{2}=0v _{4}=1)]≥1−ε _{1}  v _{2}  E _{ p }[P(v _{2}=0v _{4}=1)]≥1−ε _{1} 
v _{2}  E _{ p }[P(v _{2}=1v _{4}=0)]≥1−ε _{2}  v _{5}  E _{ p }[P(v _{5}=1v _{2}=0,v _{4}=1)]≥1−ε _{2} 
v _{5}  E _{ p }[P(v _{5}=0v _{2}=1)]≥1−ε _{3}  
v _{5}  E _{ p }[P(v _{5}=1v _{2}=0,v _{4}=1)]≥1−ε _{4} 
The first and second constraints for MKDIP in the left panel of Table 3 come from the regulating function of v _{2} in Table 2. Although v _{1} is an extracellular signal, the value of v _{4} imposes two constraints on the value of v _{2}. But the regulating function of v _{4} in Table 2 only depends on v _{3}, which is not included in our feature set, so we have no imposed constraints on the conditional probability from its regulating function. The other two constraints for MKDIP in the left panel of Table 3 are extracted from the regulating function of v _{5} in Table 2. Although v _{3} is not included in the observed features, for two setups of its regulators, (v _{2}=1) and (v _{2}=0,v _{4}=1), the value of v _{5} can be determined, so the constraint is imposed on the prior distribution from the regulating function. For comparison, the constraints extracted from the pathway in Fig. 3(b) based on the method of [36] are provided in the right panel of Table 3.
Performance comparison in classification setup

The true bin probabilities p ^{0} and p ^{1} are fixed.

n _{0} and n _{1} are determined using c as n _{0}=⌈c n⌉ and n−n _{0}.

Observations (training data) are randomly sampled from the multinomial distribution for each class, i.e. \((U^{y}_{1},\ldots,U^{y}_{b})\sim \mathcal {M}ult(\boldsymbol {p}^{y};n_{y})\), for y∈{0,1}.

10 data points are randomly taken from the training data points of each class to be used in the prior construction methods that utilize partial data (REMLP and MKDIPR)

All the classification rules are trained based on their constructed prior (if applicable to that classification rule) and the training data.

The classification errors associated with the classifiers are computed using p ^{0} and p ^{1}. Also for the Bayesian methods, the posterior probability mass (mean) distance from the true parameters (true bin probabilities, p ^{0} and p ^{1}) is calculated.
Expected true error of different classification rules for the mammalian cellcycle network. The constructed priors are considered using two precision factors: optimal precision factor (left) and estimated precision factor (right), with c=0.5, and c=0.6, where the minimum achievable error (Bayes error) is denoted by E r r _{ Bayes }
(a) c=0.5, optimal precision factor, E r r _{ Bayes }=0.2648  (b) c=0.5, estimated precision factor, E r r _{ Bayes }=0.2648  
Method/ n  30  60  90  120  150  Method/ n  30  60  90  120  150 
Hist  0.3710  0.3423  0.3255  0.3155  0.3081  Hist  0.3710  0.3423  0.3255  0.3155  0.3081 
CART  0.3326  0.3195  0.3057  0.3031  0.2975  CART  0.3326  0.3195  0.3057  0.3031  0.2975 
RF  0.3359  0.3160  0.3015  0.2991  0.2933  RF  0.3359  0.3160  0.3015  0.2991  0.2933 
SVM  0.3359  0.3112  0.2977  0.2959  0.2940  SVM  0.3359  0.3112  0.2977  0.2959  0.2940 
Jeffreys’  0.3710  0.3423  0.3255  0.3155  0.3081  Jeffreys’  0.3710  0.3423  0.3255  0.3155  0.3081 
RMEP  0.3236  0.3070  0.3010  0.2946  0.2910  RMEP  0.3315  0.3059  0.2985  0.2963  0.2930 
RMDIP  0.3236  0.3070  0.3010  0.2946  0.2910  RMDIP  0.3314  0.3060  0.2986  0.2965  0.2931 
REMLP  0.3425  0.3264  0.3146  0.3067  0.3011  REMLP  0.3488  0.3352  0.3202  0.3101  0.3048 
MKDIPE  0.3221  0.3070  0.3010  0.2949  0.2910  MKDIPE  0.3313  0.3056  0.2982  0.2962  0.2929 
MKDIPD  0.3232  0.3070  0.3010  0.2952  0.2910  MKDIPD  0.3315  0.3061  0.2986  0.2965  0.2931 
MKDIPR  0.3149  0.3028  0.2985  0.2943  0.2907  MKDIPR  0.3205  0.3041  0.2969  0.2947  0.2919 
(c) c=0.6, optimal precision factor, E r r _{ Bayes }=0.31  (d) c=0.6, estimated precision factor, E r r _{ Bayes }=0.31  
Method/ n  30  60  90  120  150  Method/ n  30  60  90  120  150 
Hist  0.3622  0.3608  0.3624  0.3641  0.3652  Hist  0.3622  0.3608  0.3624  0.3641  0.3652 
CART  0.3554  0.3556  0.3507  0.3510  0.3447  CART  0.3554  0.3556  0.3507  0.3510  0.3447 
RF  0.3524  0.3514  0.3467  0.3476  0.3420  RF  0.3524  0.3514  0.3467  0.3476  0.3420 
SVM  0.3735  0.3684  0.3615  0.3602  0.3544  SVM  0.3735  0.3684  0.3615  0.3602  0.3544 
Jeffreys’  0.3620  0.3559  0.3519  0.3502  0.3472  Jeffreys’  0.3620  0.3559  0.3519  0.3502  0.3472 
RMEP  0.3415  0.3385  0.3394  0.3390  0.3386  RMEP  0.3528  0.3415  0.3407  0.3388  0.3378 
RMDIP  0.3415  0.3383  0.3394  0.3390  0.3386  RMDIP  0.3529  0.3415  0.3408  0.3388  0.3378 
REMLP  0.3666  0.3625  0.3587  0.3558  0.3530  REMLP  0.3700  0.3650  0.3603  0.3578  0.3546 
MKDIPE  0.3415  0.3384  0.3394  0.3390  0.3386  MKDIPE  0.3525  0.3413  0.3405  0.3387  0.3377 
MKDIPD  0.3415  0.3386  0.3394  0.3390  0.3386  MKDIPD  0.3532  0.3418  0.3409  0.3389  0.3379 
MKDIPR  0.3437  0.3409  0.3404  0.3401  0.3389  MKDIPR  0.3486  0.3416  0.3416  0.3402  0.3387 
Expected true error of different classification rules for the TP53 network. The constructed priors are considered using two precision factors: optimal precision factor (left) and estimated precision factor (right), with c=0.5, and c=0.6, where the minimum achievable error (Bayes error) is denoted by E r r _{ Bayes }
(a) c=0.5, optimal precision factor, E r r _{ Bayes }=0.3146  (b) c=0.5, estimated precision factor, E r r _{ Bayes }=0.3146  
Method/ n  15  30  45  60  75  Method/ n  15  30  45  60  75 
Hist  0.3586  0.3439  0.3337  0.3321  0.3296  Hist  0.3586  0.3439  0.3337  0.3321  0.3296 
CART  0.3633  0.3492  0.3350  0.3314  0.3295  CART  0.3633  0.3492  0.3350  0.3314  0.3295 
RF  0.3791  0.3574  0.3461  0.3400  0.3362  RF  0.3791  0.3574  0.3461  0.3400  0.3362 
SVM  0.3902  0.3481  0.3433  0.3324  0.3322  SVM  0.3902  0.3481  0.3433  0.3324  0.3322 
Jeffreys’  0.3809  0.3439  0.3457  0.3321  0.3334  Jeffreys’  0.3809  0.3439  0.3457  0.3321  0.3334 
RMEP  0.3399  0.3392  0.3360  0.3315  0.3328  RMEP  0.3791  0.3489  0.3377  0.3329  0.3302 
RMDIP  0.3399  0.3392  0.3360  0.3315  0.3328  RMDIP  0.3789  0.3490  0.3378  0.3329  0.3302 
REMLP  0.3405  0.3340  0.3320  0.3292  0.3287  REMLP  0.3417  0.3372  0.3350  0.3318  0.3292 
MKDIPE  0.3397  0.3398  0.3351  0.3306  0.3297  MKDIPE  0.3675  0.3470  0.3373  0.3326  0.3298 
MKDIPD  0.3397  0.3398  0.3347  0.3306  0.3297  MKDIPD  0.3668  0.3472  0.3374  0.3327  0.3298 
MKDIPR  0.3435  0.3354  0.3321  0.3295  0.3283  MKDIPR  0.3471  0.3402  0.3349  0.3316  0.3287 
(c) c=0.6, optimal precision factor, E r r _{ Bayes }=0.2691  (d) c=0.6, estimated precision factor, E r r _{ Bayes }=0.2691  
Method/ n  15  30  45  60  75  Method/ n  15  30  45  60  75 
Hist  0.3081  0.2965  0.2906  0.2883  0.2846  Hist  0.3081  0.2965  0.2906  0.2883  0.2846 
CART  0.3173  0.2988  0.2882  0.2846  0.2796  CART  0.3173  0.2988  0.2882  0.2846  0.2796 
RF  0.3333  0.3035  0.2946  0.2850  0.2842  RF  0.3333  0.3035  0.2946  0.2850  0.2842 
SVM  0.3322  0.3091  0.2991  0.2926  0.2857  SVM  0.3322  0.3091  0.2991  0.2926  0.2857 
Jeffreys’  0.3105  0.2936  0.2860  0.2828  0.2819  Jeffreys’  0.3105  0.2936  0.2860  0.2828  0.2819 
RMEP  0.2924  0.2922  0.2847  0.2843  0.2835  RMEP  0.3346  0.3024  0.2894  0.2860  0.2823 
RMDIP  0.2924  0.2922  0.2847  0.2843  0.2835  RMDIP  0.3344  0.3023  0.2895  0.2858  0.2823 
REMLP  0.3003  0.2908  0.2869  0.2839  0.2832  REMLP  0.3054  0.2930  0.2910  0.2870  0.2850 
MKDIPE  0.2924  0.2909  0.2837  0.2851  0.2837  MKDIPE  0.3341  0.3025  0.2898  0.2864  0.2822 
MKDIPD  0.2924  0.2909  0.2837  0.2851  0.2837  MKDIPD  0.3347  0.3024  0.2898  0.2862  0.2822 
MKDIPR  0.3032  0.2917  0.2868  0.2843  0.2825  MKDIPR  0.3096  0.2981  0.2910  0.2869  0.2849 
Expected difference between the true model (for mammalian cellcycle network) and estimated posterior probability masses. Optimal precision factor (left) and estimated precision factor (right), with c=0.5, and c=0.6
(a) c=0.5, optimal precision factor  (b) c=0.5, estimated precision factor  
Method/ n  30  60  90  120  150  Method/ n  30  60  90  120  150 
Jeffreys’  0.2155  0.1578  0.1300  0.1134  0.1010  Jeffreys’  0.2155  0.1578  0.1300  0.1134  0.1010 
RMEP  0.1591  0.1293  0.1126  0.1020  0.0912  RMEP  0.1761  0.1381  0.1177  0.1032  0.0943 
RMDIP  0.1591  0.1294  0.1126  0.1020  0.0912  RMDIP  0.1761  0.1381  0.1177  0.1032  0.0943 
REMLP  0.1863  0.1436  0.1225  0.1088  0.0970  REMLP  0.2060  0.1607  0.1315  0.1120  0.1019 
MKDIPE  0.1589  0.1293  0.1126  0.1019  0.0911  MKDIPE  0.1760  0.1381  0.1177  0.1031  0.0943 
MKDIPD  0.1591  0.1293  0.1126  0.1020  0.0912  MKDIPD  0.1761  0.1381  0.1177  0.1032  0.0943 
MKDIPR  0.1563  0.1283  0.1118  0.1012  0.0907  MKDIPR  0.1742  0.1392  0.1184  0.1036  0.0949 
(c) c=0.6, optimal precision factor  (d) c=0.6, estimated precision factor  
Method/ n  30  60  90  120  150  Method/ n  30  60  90  120  150 
Jeffreys’  0.2183  0.1595  0.1322  0.1146  0.1027  Jeffreys’  0.2183  0.1595  0.1322  0.1146  0.1027 
RMEP  0.1628  0.1332  0.1154  0.1039  0.0946  RMEP  0.1805  0.1408  0.1201  0.1061  0.0961 
RMDIP  0.1628  0.1333  0.1154  0.1039  0.0947  RMDIP  0.1805  0.1408  0.1201  0.1061  0.0961 
REMLP  0.1867  0.1471  0.1247  0.1101  0.0990  REMLP  0.2065  0.1635  0.1346  0.1166  0.1036 
MKDIPE  0.1627  0.1332  0.1154  0.1038  0.0946  MKDIPE  0.1804  0.1408  0.1200  0.1061  0.0961 
MKDIPD  0.1628  0.1332  0.1154  0.1039  0.0946  MKDIPD  0.1805  0.1408  0.1201  0.1061  0.0961 
MKDIPR  0.1598  0.1317  0.1144  0.1032  0.0940  MKDIPR  0.1814  0.1421  0.1207  0.1065  0.0965 
Expected difference between the true model (for TP53 network) and estimated posterior probability masses. Optimal precision factor (left) and estimated precision factor (right), with c=0.5, and c=0.6
(a) c=0.5, optimal precision factor  (b) c=0.5, estimated precision factor  
Method/ n  15  30  45  60  75  Method/ n  15  30  45  60  75 
Jeffreys’  0.2285  0.1716  0.1429  0.1242  0.1114  Jeffreys’  0.2285  0.1716  0.1429  0.1242  0.1114 
RMEP  0.1427  0.1165  0.1051  0.0934  0.0880  RMEP  0.2218  0.1578  0.1280  0.1095  0.0981 
RMDIP  0.1424  0.1163  0.1048  0.0932  0.0878  RMDIP  0.2217  0.1575  0.1281  0.1094  0.0981 
REMLP  0.1698  0.1337  0.1199  0.1091  0.0985  REMLP  0.1845  0.1505  0.1366  0.1235  0.1133 
MKDIPE  0.1412  0.1161  0.1050  0.0933  0.0880  MKDIPE  0.2149  0.1565  0.1282  0.1096  0.0981 
MKDIPD  0.1407  0.1158  0.1047  0.0931  0.0878  MKDIPD  0.2149  0.1564  0.1281  0.1096  0.0981 
MKDIPR  0.1564  0.1247  0.1118  0.1031  0.0930  MKDIPR  0.1733  0.1410  0.1281  0.1171  0.1082 
(c) c=0.6, optimal precision factor  (d) c=0.6, estimated precision factor  
Method/ n  15  30  45  60  75  Method/ n  15  30  45  60  75 
Jeffreys’  0.2319  0.1723  0.1438  0.1262  0.1137  Jeffreys’  0.2319  0.1723  0.1438  0.1262  0.1137 
RMEP  0.1476  0.1222  0.1090  0.0987  0.0923  RMEP  0.2182  0.1599  0.1304  0.1144  0.1032 
RMDIP  0.1474  0.1220  0.1087  0.0985  0.0921  RMDIP  0.2179  0.1597  0.1303  0.1144  0.1031 
REMLP  0.1751  0.1332  0.1192  0.1077  0.0980  REMLP  0.1937  0.1522  0.1363  0.1235  0.1144 
MKDIPE  0.1457  0.1215  0.1086  0.0985  0.0922  MKDIPE  0.2165  0.1586  0.1304  0.1147  0.1036 
MKDIPD  0.1452  0.1211  0.1084  0.0983  0.0920  MKDIPD  0.2164  0.1585  0.1303  0.1147  0.1035 
MKDIPR  0.1574  0.1217  0.1093  0.1010  0.0926  MKDIPR  0.1758  0.1418  0.1274  0.1158  0.1086 
Performance comparison in mixture setup
Expected errors of different Bayesian classification rules in the mixture model for the mammalian cellcycle network. Expected true error (left) and expected error on unlabeled training data (right), with c _{0}=0.6
Method/ n  30  60  90  120  150  Method/ n  30  60  90  120  150 

PDCOTP  0.3216  0.3246  0.3280  0.3309  0.3334  PDCOTP  0.3236  0.3270  0.3314  0.3355  0.3339 
Jeffreys’  0.4709  0.4743  0.4704  0.4675  0.4654  Jeffreys’  0.4751  0.4621  0.4681  0.4700  0.4645 
RMEP  0.3417  0.3340  0.3307  0.3300  0.3299  RMEP  0.3447  0.3409  0.3366  0.3323  0.3316 
RMDIP  0.3408  0.3336  0.3300  0.3305  0.3301  RMDIP  0.3442  0.3404  0.3342  0.3344  0.3343 
REMLP  0.3754  0.3835  0.3882  0.3857  0.3844  REMLP  0.3748  0.3821  0.3908  0.3826  0.3812 
MKDIPE  0.3411  0.3341  0.3297  0.3297  0.3306  MKDIPE  0.3457  0.3386  0.3351  0.3312  0.3320 
MKDIPD  0.3407  0.3330  0.3306  0.3304  0.3303  MKDIPD  0.3482  0.3387  0.3381  0.3342  0.3334 
MKDIPR  0.3457  0.3342  0.3299  0.3286  0.3289  MKDIPR  0.3449  0.3343  0.3330  0.3306  0.3275 
Expected errors of different Bayesian classification rules in the mixture model for the TP53 network. Expected true error (left) and expected error on unlabeled training data (right), with c _{0}=0.6
Method/ n  15  30  45  60  75  Method/ n  15  30  45  60  75 

PDCOTP  0.2746  0.2824  0.2829  0.2996  0.2960  PDCOTP  0.2762  0.2818  0.2900  0.3027  0.2900 
Jeffreys’  0.4204  0.4324  0.4335  0.4432  0.4361  Jeffreys’  0.4220  0.4314  0.4381  0.4419  0.4348 
RMEP  0.3274  0.3204  0.3327  0.3402  0.3422  RMEP  0.3471  0.3350  0.3487  0.3543  0.3529 
RMDIP  0.3297  0.3260  0.3327  0.3406  0.3432  RMDIP  0.3504  0.3423  0.3496  0.3551  0.3545 
REMLP  0.3637  0.3687  0.3706  0.3658  0.3653  REMLP  0.3489  0.3579  0.3709  0.3593  0.3556 
MKDIPE  0.3312  0.3246  0.3322  0.3428  0.3386  MKDIPE  0.3502  0.3378  0.3486  0.3585  0.3492 
MKDIPD  0.3321  0.3204  0.3306  0.3436  0.3366  MKDIPD  0.3551  0.3329  0.3473  0.3570  0.3475 
MKDIPR  0.3872  0.3749  0.3667  0.3607  0.3586  MKDIPR  0.3613  0.3583  0.3589  0.3539  0.3462 
For each sample point, first the label (y) is generated from a Bernoulli distribution with success probability c _{1}, and then the bin observation is generated given the label, from the corresponding classconditional SSD (class conditional bin probabilities vector, p ^{ y }), i.e. the bin observation is a sample from a categorical distribution with parameter vector p ^{ y } but the label is hidden for the inference chain and classifier training. n sample points are generated and fed into the Gibbs inference chain with different priors from the different prior construction methods. Then the OBC is calculated based on Eq. 9. For each sample size, 400 Monte Carlo repetitions are done to calculate the expected true error and the error of classifying the unlabeled observed data used for the inference itself.
To have a fair comparison of different methods’ classconditional prior probability construction, we assume that we have a rough idea of the mixture weights (class probabilities). In practice this can come from existing population statistics. That is, the Dirichlet prior distribution over the mixture weights (class probabilities) parameters, ϕ in \(\mathcal {D}(\boldsymbol {\phi })\), are sampled in each iteration from a uniform distribution that is centered on the true mixture weights vector +/−10% interval, and fixed for all the methods in that repetition. For the REMLP and MKDIPR that need labeled data in their prior construction procedure, the predicted labels from using the Jeffreys’ prior are used and one fourth of the data points are used in prior construction for these two methods, and all for inference. The reason for using a larger number of data points in prior construction within the mixture setup compared to the classification setup is that in the mixture setup, data points are missing their true class labels, and the initial label estimates may be inaccurate. One can use a relatively larger number of data points in prior construction, which still avoids overfitting. The regularization parameters λ _{1} and λ _{2} are set as in the classification problem. Optimal precision factors are used for all prior construction methods. The results are shown in Tables 8 and 9 for the mammalian cellcycle and TP53 models, respectively. The best performance (lowest error) for each sample size and the best performance among practical methods (all other than PDCOTP), if different, is written in bold. As can be seen from the tables, in most cases the MKDIP methods have the best performance among the practical methods. With larger sample sizes, MKDIPR even outperforms PDCOTP in the mammalian cellcycle system.
Performance comparison on a real data set
Regulating functions corresponding to the signaling pathways in Fig. 4. In the Boolean functions {AND, OR, NOT}={∧,∨,−}
Gene  Node name  Boolean regulating function 

EGFR  v _{1}   
PIK3CA  v _{2}  v _{1}∨v _{4} 
AKT  v _{3}  v _{2} 
KRAS  v _{4}   
RAF1  v _{5}  \(v_{4} \wedge \overline {v_{3}}\) 
BAD  v _{6}  \(\overline {v_{3}}\) 
P53  v _{7}   
BCL2  v _{8}  \(\overline {v_{6}} \vee \overline {v_{7}}\) 
Expected error of different classification rules calculated on a real dataset. The classification is between LUA (class 0) and LUS (class 1), with c=0.57
Method/ n  34  74  114  134  174 

Best Non Bayesian  0.1764  0.1574  0.1473  0.1426  0.1371 
Jeffreys’  0.1766  0.1574  0.1476  0.1425  0.1371 
Best RM  0.1426  0.1289  0.1164  0.1083  0.1000 
Best MKDIP  0.1401  0.1273  0.1162  0.1075  0.0998 
The best performing rule for each sample size is written in bold. As can be seen from the table, OBC with MKDIP prior construction methods has the best performance among the classification rules. It is also clear that the classification performance can be significantly improved when pathway prior knowledge is integrated for constructing prior probabilities, especially when the sample size is small.
Implementation remarks
The results presented in this paper are based on Monte Carlo simulations, where thousands of optimization problems are solved for each sample size for each problem. Thus, the regularization parameters and the number of sample points used in prior construction are preselected for each problem. One can use cross validation to set these parameters in a specific application. It has been shown in [36] that by assuming precision factors greater than 1 (\(\alpha _{0}^{y}>1, y\in \{0,1\}\)), all three objective functions used are convex for the class of Dirichlet prior probabilities for multinomial likelihood functions. But unfortunately, we cannot guarantee the convexity of the feasible space due to the convolved constraints. Therefore, we have employed algorithms for nonconvex optimization problems and there is no guarantee of convergence to the global optimum. The method used for solving the optimization framework of the prior construction is based on the interiorpoint algorithm for nonlinear constrained optimization [67, 68] implemented in the fmincon function in MATLAB. In this paper, since the interest is in classification problems with small training sample sizes (which is often the case in bioinformatics) and also due to Monte Carlo simulations, we have only shown performance results on small networks with only a few genes. In practice, there would be no problem using the proposed method for larger networks, since there would then be a single onetime analysis. One should also note that with small sample sizes, one needs feature selection to keep the number of features small. In the experiments in this paper, feature selection is automatically done by focusing on the most relevant network by biological prior knowledge.
Conclusion
Bayesian methods have shown promising performance in classification problems in the presence of uncertainty and small sample sizes, which often occur in translational genomics problems. The impediment in using these methods is prior construction to integrate existing prior biological knowledge. In this paper we have proposed a knowledgedriven prior construction method with a general framework of mapping prior biological knowledge into a set of constraints. Knowledge can come from biological signaling pathways and other population studies, and be translated into constraints over conditional probabilities. This general scheme includes the previous approaches of using biological prior knowledge in prior construction. Here, the superior performance of this general scheme is shown on two important pathway families, the mammalian cellcycle pathway and the pathway centering around TP53. In addition, prior construction and the OBC are extended to a mixture model, where data sets are with missing labels. Moreover, comparisons on a publicly available gene expression dataset show that classification performance can be significantly improved for small sample sizes when corresponding pathway prior knowledge is integrated for constructing prior probabilities.
Declarations
Acknowledgements
Not applicable.
Funding
This work was funded in part by Award CCF1553281 from the National Science Foundation, and a DMREF grant from the National Science Foundation, award number 1534534. The publication cost of this article was funded by Award CCF1553281 from the National Science Foundation.
Availability of data and materials
The publicly available real datasets analyzed during the current study have been generated by the TCGA Research Network https://cancergenome.nih.gov/, and have been procured from http://www.cbioportal.org/.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 18 Supplement 14, 2017: Proceedings of the 14th Annual MCBIOS conference. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume18supplement14.
Authors’ contributions
SB developed mixturemodel modeling and extracting knowledge from pathways and regulating functions, performed the experiments, and wrote the first draft. MSE structured the prior knowledge by integrating his previous prior methods into this new framework. XQ in conjunction with ERD proposed the new general prior structure and proofread and edited the manuscript. ERD oversaw the project, in conjunction with XQ proposed the new general prior structure, wrote the OBC section, and proofread and edited the manuscript. All authors have read and approved final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Dougherty ER, Zollanvari A, BragaNeto UM. The illusion of distributionfree smallsample classification in genomics. Current Genomics. 2011; 12(5):333.PubMedPubMed CentralView ArticleGoogle Scholar
 Dougherty ER, Dalton LA. Scientific knowledge is possible with smallsample classification. EURASIP J Bioinforma Syst Biol. 2013; 2013(1):1–12.View ArticleGoogle Scholar
 Jaynes ET. What is the question? In: Bernardo JM, deGroot MH, Lindly DV, Smith AFM, editors. Bayesian Stat. Valencia: Valencia Univ. Press: 1980. p. 618–629.Google Scholar
 Jeffreys H. An invariant form for the prior probability in estimation problems. Proc Royal Soc London Ser A Math Phys Sci. 1946; 186(1007):453–61.View ArticleGoogle Scholar
 Zellner A. Past and Recent Results on Maximal Data Information Priors. Working paper series in economics and econometrics. University of Chicago, Graduate School of Business, Department of Economics, Chicago. 1995.Google Scholar
 Rissanen J. A universal prior for integers and estimation by minimum description length. Ann Stat. 1983; 11(2):416–31.View ArticleGoogle Scholar
 Rodríguez CC. Entropic priors. Albany: Department of Mathematics and Statistics, State University of New York; 1991.Google Scholar
 Berger JO, Bernardo JM. On the development of reference priors. Bayesian Stat. 1992; 4(4):35–60.Google Scholar
 Spall JC, Hill SD. Leastinformative Bayesian prior distributions for finite samples based on information theory. Autom Control IEEE Trans. 1990; 35(5):580–3.View ArticleGoogle Scholar
 Bernardo JM. Reference posterior distributions for Bayesian inference. J Royal Stat Soc Ser B Methodol. 1979; 41(2):113–147.Google Scholar
 Kass RE, Wasserman L. The selection of prior distributions by formal rules. J Am Stat Assoc. 1996; 91(435):1343–1370.View ArticleGoogle Scholar
 Berger JO, Bernardo JM, Sun D. Objective priors for discrete parameter spaces. J Am Stat Assoc. 2012; 107(498):636–48.View ArticleGoogle Scholar
 Jaynes ET. Information theory and statistical mechanics. Physical Rev. 1957; 106(4):620.View ArticleGoogle Scholar
 Jaynes ET. Prior probabilities. Syst Sci Cybern IEEE Trans. 1968; 4(3):227–41.View ArticleGoogle Scholar
 Zellner A. Models, prior information, and Bayesian analysis. J Econ. 1996; 75(1):51–68.View ArticleGoogle Scholar
 Burg JP, Luenberger DG, Wenger DL. Estimation of structured covariance matrices. Proc IEEE. 1982; 70(9):963–74.View ArticleGoogle Scholar
 Werner K, Jansson M, Stoica P. On estimation of covariance matrices with kronecker product structure. Signal Proc IEEE Trans. 2008; 56(2):478–91.View ArticleGoogle Scholar
 Wiesel A, Hero AO. Distributed covariance estimation in Gaussian graphical models. Signal Proc IEEE Trans. 2011; 60(1):211–220.View ArticleGoogle Scholar
 Wiesel A, Eldar YC, Hero AO. Covariance estimation in decomposable Gaussian graphical models. Signal Process IEEE Trans. 2010; 58(3):1482–1492.View ArticleGoogle Scholar
 Breslin T, Krogh M, Peterson C, Troein C. Signal transduction pathway profiling of individual tumor samples. BMC Bioinforma. 2005; 6(1):163.View ArticleGoogle Scholar
 Zhu Y, Shen X, Pan W. Networkbased support vector machine for classification of microarray samples. BMC Bioinforma. 2009; 10(1):21.View ArticleGoogle Scholar
 Svensson JP, Stalpers LJ, Esveldt–van Lange RE, Franken NA, Haveman J, Klein B, Turesson I, Vrieling H, GiphartGassler M. Analysis of gene expression using gene sets discriminates cancer patients with and without late radiation toxicity. PLoS Med. 2006; 3(10):422.View ArticleGoogle Scholar
 Lee E, Chuang HY, Kim JW, Ideker T, Lee D. Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008; 4(11):1000217.View ArticleGoogle Scholar
 Su J, Yoon BJ, Dougherty ER. Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PLoS ONE. 2009; 4(12):8161.View ArticleGoogle Scholar
 Eo HS, Heo JY, Choi Y, Hwang Y, Choi HS. A pathwaybased classification of breast cancer integrating data on differentially expressed genes, copy number variations and microrna target genes. Mol Cells. 2012; 34(4):393–8.PubMedPubMed CentralView ArticleGoogle Scholar
 Wen Z, Liu ZP, Yan Y, Piao G, Liu Z, Wu J, Chen L. Identifying responsive modules by mathematical programming: An application to budding yeast cell cycle. PloS ONE. 2012; 7(7):41854.View ArticleGoogle Scholar
 Kim S, Kon M, DeLisi C, et al. Pathwaybased classification of cancer subtypes. Biology direct. 2012; 7(1):1–22.View ArticleGoogle Scholar
 Khunlertgit N, Yoon BJ. Identification of robust pathway markers for cancer through rankbased pathway activity inference. Advances Bioinforma. 2013; Article ID 618461:8.Google Scholar
 Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinforma. 2007; 24(3):404–11.View ArticleGoogle Scholar
 Wei P, Pan W. Networkbased genomic discovery: application and comparison of Markov randomfield models. J Royal Stat Soc Ser C Appl Stat. 2010; 59(1):105–25.View ArticleGoogle Scholar
 Wei P, Pan W. Bayesian joint modeling of multiple gene networks and diverse genomic data to identify target genes of a transcription factor. Annals Appl Stat. 2012; 6(1):334–55.View ArticleGoogle Scholar
 Gatza ML, Lucas JE, Barry WT, Kim JW, Wang Q, Crawford MD, Datto MB, Kelley M, MatheyPrevot B, Potti A, et al. A pathwaybased classification of human breast cancer. Proc Natl Acad Sci. 2010; 107(15):6994–999.PubMedPubMed CentralView ArticleGoogle Scholar
 Nevins JR. Pathwaybased classification of lung cancer: a strategy to guide therapeutic selection. Proc Am Thoracic Soc. 2011; 8(2):180.View ArticleGoogle Scholar
 Wen Z, Liu ZP, Liu Z, Zhang Y, Chen L. An integrated approach to identify causal network modules of complex diseases with application to colorectal cancer. J Am Med Inform Assoc. 2013; 20(4):659–67.PubMedPubMed CentralView ArticleGoogle Scholar
 Esfahani MS, Dougherty ER. Incorporation of biological pathway knowledge in the construction of priors for optimal Bayesian classification. IEEE/ACM Trans Comput Biol Bioinforma. 2014; 11(1):202–18.View ArticleGoogle Scholar
 Esfahani MS, Dougherty ER. An optimizationbased framework for the transformation of incomplete biological knowledge into a probabilistic structure and its application to the utilization of gene/protein signaling pathways in discrete phenotype classification. IEEE/ACM Trans Comput Biol Bioinforma. 2015; 12(6):1304–1321.View ArticleGoogle Scholar
 Boluki S, Esfahani MS, Qian X, Dougherty ER. Constructing pathwaybased priors within a Gaussian mixture model for Bayesian regression and classification. IEEE/ACM Trans Comput Biol Bioinforma. 2017. In press.Google Scholar
 Guiasu S, Shenitzer A. The principle of maximum entropy. Math Intell. 1985; 7(1):42–8.View ArticleGoogle Scholar
 Hua J, Sima C, Cypert M, Gooden GC, Shack S, Alla L, Smith EA, Trent JM, Dougherty ER, Bittner ML. Tracking transcriptional activities with highcontent epifluorescent imaging. J Biomed Opt. 2012; 17(4):0460081–04600815.View ArticleGoogle Scholar
 Dalton LA, Dougherty ER. Optimal classifiers with minimum expected error within a Bayesian framework–part I: Discrete and Gaussian models. Pattern Recog. 2013; 46(5):1301–1314.View ArticleGoogle Scholar
 Dalton LA, Dougherty ER. Optimal classifiers with minimum expected error within a Bayesian framework–part II: Properties and performance analysis. Pattern Recog. 2013; 46(5):1288–1300.View ArticleGoogle Scholar
 Dalton LA, Dougherty ER. Bayesian minimum meansquare error estimation for classification error–part I: Definition and the bayesian MMSE error estimator for discrete classification. Signal Process IEEE Trans. 2011; 59(1):115–29.View ArticleGoogle Scholar
 MacKay DJC. Introduction to Monte Carlo methods In: Jordan MI, editor. Learning in Graphical Models. NATO Science Series. Dordrecht: Kluwer Academic Press: 1998. p. 175–204.Google Scholar
 Casella G, George EI. Explaining the Gibbs sampler. Am Stat. 1992; 46(3):167–74.Google Scholar
 Robert CP, Casella G. Monte Carlo Statistical Methods. New York: Springer; 2004.View ArticleGoogle Scholar
 Zellner A. Maximal Data Information Prior Distributions, Basic Issues in Econometrics. Chicago: The University of Chicago Press; 1984.Google Scholar
 Ebrahimi N, Maasoumi E, Soofi ES. In: Slottje DJ, (ed).Measuring Informativeness of Data by Entropy and Variance. Heidelberg: PhysicaVerlag HD; 1999, pp. 61–77.Google Scholar
 Dougherty ER, Brun M, Trent JM, Bittner ML. Conditioningbased modeling of contextual genomic regulation. Comput Biol Bioinforma IEEE/ACM Trans. 2009; 6(2):310–20.View ArticleGoogle Scholar
 Kauffman SA. Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol. 1969; 22(3):437–67.PubMedView ArticleGoogle Scholar
 Shmulevich I, Dougherty ER, Kim S, Zhang W. Probabilistic Boolean networks: a rulebased uncertainty model for gene regulatory networks. Bioinforma. 2002; 18(2):261.View ArticleGoogle Scholar
 Fauré A, Naldi A, Chaouiya C, Thieffry D. Dynamical analysis of a generic boolean model for the control of the mammalian cell cycle. Bioinformatics. 2006; 22(14):124.View ArticleGoogle Scholar
 Weinberg R. The Biology of Cancer. New York: Garland science; 2013.Google Scholar
 Esfahani MS, Yoon BJ, Dougherty ER. Probabilistic reconstruction of the tumor progression process in gene regulatory networks in the presence of uncertainty. BMC Bioinformatics. 2011; 12(10):9.View ArticleGoogle Scholar
 Layek RK, Datta A, Dougherty ER. From biological pathways to regulatory networks. Mol BioSyst. 2011; 7:843–51.PubMedView ArticleGoogle Scholar
 Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Boca Raton: Chapman & Hall/CRC; 1984.Google Scholar
 Breiman L. Random forests. Machine Learning. 2001; 45(1):5–32.View ArticleGoogle Scholar
 Cortes C, Vapnik V. Supportvector networks. Machine Learning. 1995; 20(3):273–97.Google Scholar
 Kecman V. Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models. Cambridge: MIT Press; 2001.Google Scholar
 American Cancer Society. Cancer Facts and Figures 2017. Atlanta: American Cancer Society; 2017.Google Scholar
 Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, Cerami E, Sander C, Schultz N. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science Signaling. 2013; 6(269):1–1.View ArticleGoogle Scholar
 Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, Antipin Y, Reva B, Goldberg AP, Sander C, Schultz N. The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012; 2(5):401–4.PubMedView ArticleGoogle Scholar
 Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. Mutational heterogeneity in cancer and the search for new cancerassociated genes. Nature. 2013; 499(7457):214–8.PubMedPubMed CentralView ArticleGoogle Scholar
 West L, Vidwans SJ, Campbell NP, Shrager J, Simon GR, Bueno R, Dennis PA, Otterson GA, Salgia R. A novel classification of lung cancer into molecular subtypes. PLOS ONE. 2012; 7(2):1–11.Google Scholar
 Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28(1):27–30.PubMedPubMed CentralView ArticleGoogle Scholar
 Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016; 44(D1):457–62.View ArticleGoogle Scholar
 LortetTieulent J, Soerjomataram I, Ferlay J, Rutherford M, Weiderpass E, Bray F. International trends in lung cancer incidence by histological subtype: Adenocarcinoma stabilizing in men but still increasing in women. Lung Cancer. 2014; 84(1):13–22.PubMedView ArticleGoogle Scholar
 Waltz RA, Morales JL, Nocedal J, Orban D. An interior algorithm for nonlinear optimization that combines line search and trust region steps. Math Program. 2006; 107(3):391–408.View ArticleGoogle Scholar
 Byrd RH, Hribar ME, Nocedal J. An interior point algorithm for largescale nonlinear programming. SIAM J Optim. 1999; 9(4):877–900.View ArticleGoogle Scholar