Notation
Boldface lower case letters represent column vectors. Occasionally, concatenation of several vectors is also shown by boldface lower case letters. For a vector a, a
_{0} represents the summation of all the elements and a
_{
i
} denotes its ith element. Probability sample spaces are shown by calligraphic uppercase letters. Uppercase letters are for sets and random variables (vectors). Probability measure over the random variable (vector) X is denoted by P(X), whether it be a probability density function or a probability mass function. E
_{
X
}[f(X)] represents the expectation of f(X) with respect to X. P(xy) denotes the conditional probability P(X=xY=y). θ represents generic parameters of a probability measure, for instance P(XY;θ) (or P
_{
θ
}(XY)) is the conditional probability parameterized by θ. γ represents generic hyperparameter vectors. π(θ;γ) is the probability measure over the parameters θ governed by hyperparameters γ, the parameters themselves governing another probability measure over some random variables. Throughout the paper, the terms “pathway” and “network” are used interchangeably. Also, the terms “feature”’ and “variable” are used interchangeably. \(\mathcal {M}ult(\boldsymbol {p};n)\) and \(\mathcal {D}(\boldsymbol {\alpha })\) represent a multinomial distribution with vector parameter p and n samples, and a Dirichlet distribution with vector α, respectively.
Review of optimal Bayesian classification
Binary classification involves a feature vector X=(X
_{1},X
_{2},…,X
_{
d
})^{T}∈ℜ^{d} composed of random variables (features), a binary random variable (label) Y and a classifier
ψ(X) to predict Y. The error is ε[ψ]=P(ψ(X)≠Y). An optimal classifier, ψ
_{bay}, called a Bayes classifier, has minimal error, called the Bayes error, among all possible classifiers. The underlying probability model for classification is the joint featurelabel distribution. It determines the class prior probabilities c
_{0}=c=P(Y=0) and c
_{1}=1−c=P(Y=1), and the classconditional densities f
_{0}(x)=P(xY=0) and f
_{1}(x)=P(xY=1). A Bayes classifier is given by
$$ \psi_{\text{bay}}(\mathbf{x})\,=\, \left\{\begin{array}{ll} 1, & c_{1}f_{1\!}(\mathbf{x})\geq c_{0}f_{0}(\mathbf{x}), \\ 0, & \text{otherwise.} \end{array}\right. {.} $$
(1)
If the featurelabel distribution is unknown but belongs to an uncertainty class of featurelabel distributions parameterized by the vector θ∈Θ, then, given a random sample S
_{
n
}, an optimal Bayeisan classifier (OBC) minimizes the expected error over Θ:
$$ \psi_{\text{OBC}}=\arg \min_{\psi \in \mathcal{C}}E_{\pi^{\ast }(\theta)}[\varepsilon_{\boldsymbol{\theta}}[\psi ]], $$
(2)
where the expectation is relative to the posterior distribution π
^{∗}(θ) over Θ, which is derived from the prior distribution π(θ) using Bayes’ rule [40, 41]. If we let θ
_{0} and θ
_{1} denote the class 0 and class 1 parameters, then we can write θ as θ=[c,θ
_{
0
},θ
_{
1
}]. If we assume that c,θ
_{0},θ
_{1} are independent prior to observing the data, i.e. π(θ)=π(c)π(θ
_{0})π(θ
_{1}), then the independence is preserved in the posterior distribution π
^{∗}(θ)=π
^{∗}(c)π
^{∗}(θ
_{0})π
^{∗}(θ
_{1}) and the posteriors are given by \(\pi ^{\ast }(\boldsymbol {\theta }_{y})\propto \pi (\boldsymbol {\theta } _{y})\prod _{i=1}^{n_{y}}f_{\boldsymbol {\theta }_{y}}(\mathbf {x}_{i}^{y}y)\) for y=0,1, where \(f_{\boldsymbol {\theta }_{y}}(\mathbf {x}_{i}^{y}y)\) and n
_{
y
} are the classconditional density and number of sample points for class y, respectively [42].
Given a classifier ψ
_{
n
} designed from random sample S
_{
n
}, from the perspective of meansquare error, the best error estimate minimizes the MSE between its true error (a function of θ and ψ
_{
n
}) and an error estimate (a function of S
_{
n
} and ψ
_{
n
}). This Bayesian minimummeansquareerror (MMSE) estimate is given by the expected true error, \(\widehat {\varepsilon }(\psi _{n},S_{n})=\mathrm {E}_{\boldsymbol {\theta } }[\varepsilon (\psi _{n},\boldsymbol {\theta })S_{n}]\), where ε(ψ
_{
n
},θ) is the error of ψ
_{
n
} on the featurelabel distribution parameterized by θ and the expectation is taken relative to the prior distribution π(θ) [42]. The expectation given the sample is over the posterior probability. Thus, \(\widehat {\varepsilon }(\psi _{n},S_{n})=\mathrm {E}_{\pi ^{\ast }}[\varepsilon ]\).
The effective classconditional density for class y is defined by
$$ f_{\boldsymbol{\Theta}}\left(\mathbf{x}y\right) =\int_{\boldsymbol{\Theta}_{y}}f_{\boldsymbol{\theta} _{y}}\left(\mathbf{x}y\right) \pi^{\ast }\left(\boldsymbol{\theta}_{y}\right) d\boldsymbol{\theta}_{y}, $$
(3)
Θ
_{
y
} being the space for θ
_{
y
}, and an OBC is given pointwise by [40]
$$ \begin{aligned} &\psi_{\text{OBC}}\left(\mathbf{x}\right) = \\ &\left\{ \begin{array}{ll} 0 & \text{if }\ \mathrm{E}_{\pi^{\ast }}[c]f_{\boldsymbol{\Theta}}\left(\mathbf{x} 0\right) \geq (1\mathrm{E}_{\pi^{\ast }}[c])f_{\boldsymbol{\Theta}}\left(\mathbf{x} 1\right), \\ 1 & \text{otherwise}. \end{array} \right.. \end{aligned} $$
(4)
For discrete classification there is no loss in generality in assuming a single feature X taking values in the set {1,…,b} of “bins”. Classification is determined by the class 0 prior probability c and the classconditional probability mass functions p
_{
i
}=P(X=iY=0) and q
_{
i
}=P(X=iY=1), for i=1,…,b. With uncertainty, we assume beta class priors and define the parameters θ
_{0}={p
_{1},p
_{2},…,p
_{
b−1}} and θ
_{1}={q
_{1},q
_{2},…,q
_{
b−1}}. The bin probabilities must be valid. Thus, {p
_{1},p
_{2},…,p
_{
b−1}}∈Θ
_{0} if and only if 0≤p
_{
i
}≤1 for i=1,…,b−1 and \(\sum _{i=1}^{b1}p_{i} \leq 1\), in which case, \(p_{b}=1\sum _{i=1}^{b1}p_{i}\). We use the Dirichlet priors
$$ \pi (\boldsymbol{\theta}_{0})\propto \prod_{i=1}^{b}p_{i}^{\alpha_{i}^{0}1}\mathrm{\ and\ }\pi (\theta_{1})\propto \prod_{i=1}^{b}q_{i}^{\alpha_{i}^{1}1}, \mathrm{\ } $$
(5)
where \(\alpha _{i}^{y}>0\). These are conjugate priors, leading to the posteriors of the same form. The effective classconditional densities are
$$ f_{\boldsymbol{\Theta}}\left(jy\right) =\frac{U_{j}^{y}+\alpha_{j}^{y}}{n_{y}+\sum_{i=1}^{b} \alpha_{i}^{y}}, $$
(6)
for y=0,1, and the OBC is given by
$$ {\begin{aligned} \psi_{\text{OBC}}(\,j)=\left\{ \begin{array}{ll} 0,\quad\quad~\text{if }\ E_{\pi^{\ast }}[c]f_{\boldsymbol{\Theta}}\left(j0\right)\geq (1E_{\pi^{\ast }}[c])f_{\boldsymbol{\Theta}}\left(j1\right); \\ 1,\quad\quad~\text{otherwise.} \end{array}\right. \end{aligned}} $$
(7)
where \(U_{j}^{y}\) denotes the observed count for class y in bin j [40]. Hereafter, \(\sum _{i=1}^{b}\alpha _{i}^{y}\) is represented by \(\alpha _{0}^{y}\), i.e. \(\alpha _{0}^{y} = \sum _{i=1}^{b}\alpha _{i}^{y}\), and is called the precision factor. In the sequel, the sub(super)script relating to dependency on class y may be dropped; nonetheless, availability of prior knowledge for both classes is assumed.
Multinomial mixture model
In practice, data may not be labeled, due to potential tumortissue sample or stage heterogeneity, but still we want to classify a new sample point. A mixture model is a natural model for this scenario, assuming each sample point x
_{
i
} arises from a mixture of multinomial distributions:
$$ P_{\boldsymbol{\theta}}(\mathbf{x}_{i}) = \sum_{j=0}^{M1}c_{j} P_{\boldsymbol{\theta}_{j}}(\mathbf{x}_{i}), $$
(8)
where M is the number of components. When there exists two components, similar to binary classification, M=2. The conjugate prior distribution family for component probabilities (if unknown) is the Dirichlet distribution. In the mixture model, no closedform analytical posterior distribution for the parameters exists, but Markov chain Monte Carlo (MCMC) methods [43] can be employed to numerically calculate the posterior distributions. Since the conditional distributions can be calculated analytically in the multinomial mixture model, Gibbs sampling [44, 45] can be employed for the Bayesian inference. If the prior probability distribution over the component probability vector (c=[c
_{0},c
_{1},…,c
_{
M
}]) is a Dirichlet distribution \(\mathcal {D}(\boldsymbol {\phi })\) with parameter vector ϕ, the componentconditional probabilities are \(\boldsymbol {\theta }_{j}=[ p^{j}_{1},p^{j}_{2},\ldots,p^{j}_{b}]\), and the prior probability distribution over them is Dirichlet \(\mathcal {D}(\boldsymbol {\alpha ^{j}})\) with parameter vector α
^{j} (as in the classification problem), for j=1,…,M, the Gibbs updates are
$$\begin{aligned} & y_{i}^{(t)} \sim P\left(y_{i}=j\boldsymbol{c}^{(t1)},\boldsymbol{\theta}^{(t1)},\mathbf{x}_{i}\right) \propto c_{j}^{(t1)} {p_{\mathbf{x}_{i}}^{j,(t1)}}\\ & \boldsymbol{c}^{(t)} \sim P\left(\boldsymbol{c}\boldsymbol{\phi},\boldsymbol{y}^{(t)}\right) = \mathcal{D}\left(\boldsymbol{\phi}+ {\sum\nolimits}_{i=1}^{n}\left[I_{y_{i}^{(t)}=1},\ldots,I_{y_{i}^{(t)}=M}\right]\right)\\ & \boldsymbol{\theta_{j}}^{(t)} \sim P\left(\boldsymbol{\theta}_{j}\mathbf{x},\boldsymbol{y}^{(t)},\boldsymbol{\alpha}_{j}\right)\\ &\quad= \mathcal{D}\left(\boldsymbol{\alpha}_{j}+ {\sum\nolimits}_{i=1:y_{i}^{(t)}=j}^{n}\left[I_{\mathbf{x}_{i}=1},\ldots,I_{\mathbf{x}_{i}=b}\right]\right), \end{aligned} $$
where the superscript in parentheses denotes the chain iteration number, I
_{
w
} is one if w is true, and otherwise I
_{
w
} is zero. In this framework, if the inference chain runs for Is iterations, then the numerical approximation of the OBC classification rule is
$$ \psi_{\text{OBC}}(k)\approx \arg \!\max_{y\in { \{1,\dots,M\}}}\sum_{t=1}^{Is}c_{y}^{(t)}p^{y,(t)}_{k}. $$
(9)
Without loss of generality the summation above can be over the iterations of the chain considering burnin and thinning.
Prior construction: general framework
In this section, we propose a general framework for prior construction. We begin with introducing a knowledgedriven prior probability:
Definition 1
(Maximal Knowledgedriven Information Prior) If Π is a family of proper priors, then a maximal knowledgedriven information prior (MKDIP) is a solution to the following optimization problem:
$$ \arg\min_{\pi \in \Pi}E_{\pi}[C_{\boldsymbol{\theta}}(\xi,D)], $$
(10)
where C
_{
θ
}(ξ,D)is a cost function that depends on (1)
θ: the random vector parameterizing the underlying probability distribution, (2)
ξ: state of (prior) knowledge, and (3)
D: partial observation (part of the sample data).
Alternatively, by parameterizing the prior probability as π(θ;γ), with γ∈Γ denoting the hyperparameters, an MKDIP can be found by solving
$$ \arg\min_{\boldsymbol{\gamma}\in \Gamma}E_{\pi(\boldsymbol{\theta};\boldsymbol{\gamma})}[C_{\boldsymbol{\theta}}(\xi,D,\boldsymbol{\gamma})]. $$
(11)
In contrast to noninformative priors, the MKDIP incorporates available prior knowledge and even part of the data to construct an informative prior.
The MKDIP definition is very general because we want a general framework for prior construction. The next definition specializes it to cost functions of a specific form in a constrained optimization.
Definition 2
(MKDIP: Constrained Optimization with Additive Costs) As a special case in which C
_{
θ
} can be decomposed into additive terms, the cost function is of the form:
$$C_{\boldsymbol{\theta}}(\xi,D,\boldsymbol{\gamma}) =(1\beta)g^{(1)}_{\boldsymbol{\theta}}(\xi,\boldsymbol{\gamma}) + \beta g^{(2)}_{\boldsymbol{\theta}}(\xi,D), $$
where β is a nonnegative regularization parameter. In this case, the MKDIP construction with additive costs and constraints involves solving the following optimization problem:
$$ \begin{aligned} &\arg\min_{\boldsymbol{\gamma}\in\Gamma}E_{\pi(\boldsymbol{\theta};\boldsymbol{\gamma})} \Big[(1\beta)g^{(1)}_{\boldsymbol{\theta}}(\xi,\boldsymbol{\gamma}) + \beta g^{(2)}_{\boldsymbol{\theta}}(\xi,D)\Big]\\ & \text{Subject to:} \quad E_{\pi(\boldsymbol{\theta};\boldsymbol{\gamma})}[g^{(3)}_{\boldsymbol{\theta},i}(\xi)]=0;~i\in\{1,\ldots,n_{c}\}, \end{aligned} $$
(12)
where \(g^{(3)}_{\boldsymbol {\theta },i}\), ∀i∈{1,…,n
_{
c
}}, are constraints resulting from the state of knowledge ξvia a mapping:
$$\mathcal{T}: \xi \rightarrow E_{\pi\left(\boldsymbol{\theta};\boldsymbol{\gamma}\right)}\left[g^{(3)}_{\boldsymbol{\theta},i}(\xi)\right], \forall i \in \{1,\dots,n_{c}\}. $$
In the sequel, we will refer to g
^{(1)}(·) and g
^{(2)}(·) as the cost functions, and \(g_{i}^{(3)}(\cdot)\)’s as the knowledgedriven constraints. We begin with introducing informationtheoretic cost functions, and then we propose a general set of mapping rules, denoted by \(\mathcal {T}\) in Definition 2, to convert biological pathway knowledge into mathematical forms. We then consider special cases with informationtheoretic cost functions.
Informationtheoretic cost functions
Instead of having least squares (or meansquared error) as the standard cost functions in classical statistical inference problems, there is no universal cost function in the prior construction literature. That being said, in this paper, we utilize several widely used cost functions in the field:

1.
(Maximum Entropy) The principle of maximumentropy (MaxEnt) for probability construction [38] leads to the least informative prior given the constraints in order to prevent adding spurious information. Under our general framework MaxEnt can be formulated by setting:
$$\beta=0,~g_{\boldsymbol{\theta}}^{(1)} = H[\boldsymbol{\theta}], $$
where H[.] denotes the Shannon entropy.

2.
(Maximal Data Information) The maximal data information prior (MDIP) introduced by Zellner [46] as a choice of the objective function is a criterion for the constructed probability distribution to remain maximally committed to the data [47]. To achieve MDIP, we can set our general framework with:
$$\begin{aligned} \beta=0,~g_{\boldsymbol{\theta}}^{(1)} &= \ln \pi(\boldsymbol{\theta};\boldsymbol{\gamma}) + H[P(x\boldsymbol{\theta})]\\ &= \ln \pi(\boldsymbol{\theta};\boldsymbol{\gamma})E_{x\boldsymbol{\theta}}[\ln P(x\boldsymbol{\theta})]. \end{aligned} $$

3.
(Expected Mean Loglikelihood) The cost function introduced in [35] is the first one that utilizes part of the observed data for prior construction. In that, we have
$$\beta =1,~g^{(2)}_{\boldsymbol{\theta}}=\ell(\boldsymbol{\theta};D), $$
where \(\ell (\boldsymbol {\theta };D)=\frac {1}{n_{D}}\sum _{i=1}^{n_{D}}\log f(\boldsymbol {x}_{i}\boldsymbol {\theta })\) is the mean loglikelihood function of the sample points used for prior construction (D), and n
_{
D
} denotes the number of sample points in D. In [35], it is shown that this cost function is equivalent to the average KullbackLeibler distance between the unknown distribution (empirically estimated by some part of the samples) and the uncertainty class of distributions.
As originally proposed, the preceding approaches did not involve expectation over the uncertainty class. They were extended to the general prior construction form in Definition 1, including the expectation, in [36] to produce the regularized maximum entropy prior (RMEP), the regularized maximal data information prior (RMDIP), and the regularized expected mean loglikelihood prior (REMLP). In all cases, optimization was subject to specialized constraints.
For MKDIP, we employ the same informationtheoretic cost functions in the prior construction optimization framework. MKDIPE, MKDIPD, and MKDIPR correspond to using the same cost functions as REMP, RMDIP, and REMLP, respectively, but with the new general types of constraints. To wit, we employ functional information from the signaling pathways and show that by adding these new constraints that can be readily derived from prior knowledge, we can improve both supervised (classification problem with labelled data) and unsupervised (mixture problem without labels) learning of Bayesian operators.
From prior knowledge to mathematical constraints
In this part, we present a general formulation for mapping the existing knowledge into a set of constraints. In most scientific problems, the prior knowledge is in the form of conditional probabilities. In the following, we consider a hypothetical gene network and show how each component in a given network can be converted into the corresponding inequalities as general constraints in MKDIP optimization.
Before proceeding we would like to say something about contextual effects on regulation. Because a regulatory model is not independent of cellular activity outside the model, complete control relations such as A→B in the model, meaning that gene B is upregulated if and only if gene A is upregulated (after some time delay), do not necessarily translate into conditional probability statements of the form P(X
_{
B
}=1X
_{
A
}=1)=1, where X
_{
A
} and X
_{
B
} represent the binary gene values corresponding to genes A and B, respectively. Rather, what may be observed is P(X
_{
B
}=1X
_{
A
}=1)=1−δ, where δ>0. The pathway A→B need not imply P(X
_{
B
}=1X
_{
A
}=1)=1 because A→B is conditioned on the context of the cell, where by context we mean the overall state of the cell, not simply the activity being modeled. δ is called a conditioning parameter. In an analogous fashion, rather than P(X
_{
B
}=1X
_{
A
}=0)=0, what may be observed is P(X
_{
B
}=1X
_{
A
}=0)=η, where η>0, because there may be regulatory relations outside the model that upregulate B. Such activity is referred to as crosstalk and η is called a crosstalk parameter. Conditioning and crosstalk effects can involve multiple genes and can be characterized analytically via contextdependent conditional probabilities [48].
Consider binary gene values X
_{1},X
_{2},…,X
_{
m
} corresponding to genes g
_{1},g
_{2},…,g
_{
m
}. There are m2^{m−1} conditional probabilities of the form
$$\begin{array}{@{}rcl@{}} &&P(X_{i}=k_{i} X_{1}=k_{1},\dots,X_{i1}=k_{i1},X_{i+1}= \\ &&\quad k_{i+1},\dots,X_{m}=k_{m})\\ &&=a^{k_{i}}_{i}(k_{1},\dots,k_{i1},k_{i+1},\dots,k_{m}) \end{array} $$
(13)
to serve as constraints, the chosen constraints to be the conditional probabilities whose values are known (approximately). For instance, if g
_{2} and g
_{3} regulate g
_{1}, with X
_{1}=1 when X
_{2}=1 and X
_{3}=0, then, ignoring context effects,
$$a^{1}_{1}(1,0,k_{4},\dots,k_{m})=1 $$
for all k
_{4},…,k
_{
m
}. If, however, we take context conditioning into effect, then
$$a_{1}^{1}(1,0,k_{4},\dots,k_{m}) = 1\delta_{1}(1,0,k_{4},\dots,k_{m}), $$
where δ
_{1}(1,0,k
_{4},…,k
_{
m
}) is a conditioning parameter.
Moreover, ignoring context effects,
$$\begin{array}{@{}rcl@{}} a^{1}_{1}(1,1,k_{4},\dots,k_{m})&=& a^{1}_{1}(0,0,k_{4},\dots,k_{m})\\ &=&a^{1}_{1}(0,1,k_{4},\dots,k_{m}) = 0 \end{array} $$
for all k
_{4},…,k
_{
m
}. If, however, we take crosstalk into effect, then
$$\begin{array}{@{}rcl@{}} a^{1}_{1}(1, 1, k_{4},\dots, k_{m})& =& \eta_{1}(1, 1, k_{4},\dots, k_{m})\\ a^{1}_{1}(0, 0, k_{4},\dots, k_{m}) &=& \eta_{1}(0, 0, k_{4},\dots, k_{m})\\ a^{1}_{1}(0, 1, k_{4},\dots, k_{m}) &=& \eta_{1}(0, 1, k_{4},\dots, k_{m}), \end{array} $$
where η
_{1}(1,1,k
_{4},…,k
_{
m
}), η
_{1}(0,0,k
_{4},…,k
_{
m
}), and η
_{1}(0,0,k
_{4},…,k
_{
m
}) are crosstalk parameters. In practice it is unlikely that we would know the conditioning and crosstalk parameters for all combinations of k
_{4},…,k
_{
m
}; rather, we might just know the average, in which case, δ
_{1}(1,0,k
_{4},…,k
_{
m
}) reduces to δ
_{1}(1,0), η
_{1}(1,1,k
_{4},…,k
_{
m
}) reduces to η
_{1}(1,1), etc.
In this paradigm, the constraints resulting from our state of knowledge are of the following form:
$$ \begin{aligned} &g^{(3)}_{\boldsymbol{\theta},i}(\xi)=\\ &P(X_{i}=k_{i} X_{1}=k_{1},\dots,X_{i1}=k_{i1},X_{i+1}=k_{i+1}, \\ & \quad \dots,X_{m}=k_{m}) a^{k_{i}}_{i}(k_{1},\dots,k_{i1},k_{i+1},\dots,k_{m}). \end{aligned} $$
(14)
The basic setting is very general and the conditional probabilities are what they are, whether or not they can be expressed in the regulatory form of conditioning or crosstalk parameters. The general scheme includes previous constraints and approaches proposed in [35] and [36] for the Gaussian and discrete setups. Moreover, in those we can drop the regulatoryset entropy because it is replaced by the set of conditional probabilities based on the regulatory set, whether forward (master predicting slaves) or backwards (slaves predicting masters) [48].
In this paradigm, the optimization constraints take the form
$$\begin{array}{@{}rcl@{}} && a^{k_{i}}_{i}(k_{1},\dots, k_{i1}, k_{i+1},\dots, k_{m}) \\ && \quad \varepsilon_{i}(k_{1},\dots, k_{i1}, k_{i+1},\dots, k_{m}) \\ && \leq E_{\pi(\boldsymbol{\theta};\boldsymbol{\gamma})}[P(X_{i} = k_{i} X_{1} = k_{1},\dots, X_{i1} = k_{i1}, \\ && \quad \quad \quad \quad X_{i+1} = k_{i+1},\dots, X_{m} = k_{m})] \\ && \leq a^{k_{i}}_{i}(k_{1},\dots, k_{i1}, k_{i+1},\dots, k_{m}) + \\ &&\quad \quad \varepsilon_{i}(k_{1},\dots, k_{i1}, k_{i+1},\dots, k_{m}), \end{array} $$
(15)
where the expectation is with respect to the uncertainty in the model parameters, that is, the distribution of the model parameter θ, and ε
_{
i
} is a slackness variable. Not all will be used, depending on our prior knowledge. In fact, the general conditional probabilities will not likely be used because they will likely not be known when there are too many conditioning variables. For instance, we may not know the probability in Eq. (13), but may know the conditioning on part of the variables which can be extracted from some interaction network (e.g. biological pathways). A slackness variable can be considered for each constraint to make the constraint framework more flexible, thereby allowing potential error or uncertainty in prior knowledge (allowing potential inconsistencies in prior knowledge). When using slackness variables, these variables also become optimization parameters, and a linear function (summation of all slackness variables) times a regulatory coefficient is added to the cost function of the optimization in Eq. (12). In other words, when having slackness variables, the optimization in Eq. (12) can be written as
$$ \begin{aligned} & \arg\min_{\boldsymbol{\gamma}\in\Gamma, \boldsymbol{\varepsilon}\in\mathcal{E}}E_{\pi(\boldsymbol{\theta};\boldsymbol{\gamma})}\Big[\lambda_{1}[(1\beta)g^{(1)}_{\boldsymbol{\theta}}(\xi,\boldsymbol{\gamma}) + \beta g^{(2)}_{\boldsymbol{\theta}}(\xi,D)] \\ & \quad \quad \quad \quad \quad \quad \quad \quad \quad + \lambda_{2}\sum_{i=1}^{n_{c}}\varepsilon_{i}\Big]\\ & \text{Subject to:} \varepsilon_{i}\leq E_{\pi(\boldsymbol{\theta};\boldsymbol{\gamma})}[g^{(3)}_{\boldsymbol{\theta},i}(\xi)]\leq\varepsilon_{i};~i\in\{1,\ldots,n_{c}\}, \end{aligned} $$
(16)
where λ
_{1} and λ
_{2} are nonnegative regularization parameters, and ε and \(\mathcal {E}\) represent the vector of all slackness variables and the feasible region for slackness variables, respectively. For each slackness variable, a possible range can be defined (note that all slackness variables are nonnegative). The higher the uncertainty is about a constraint stemming from prior knowledge, the greater the possible range for the corresponding slackness variable can be (more on this in the “Results and discussion” section).
The new general type of constraints discussed here introduces a formal procedure for incorporating prior knowledge. It allows the incorporation of knowledge of the functional regulations in the signaling pathways, any constraints on the conditional probabilities, and also knowledge of the crosstalk and conditioning parameters (if present), unlike the previous work in [36] where only partial information contained in the edges of the pathways is used in an ad hoc way.
An illustrative example and connection with conditional entropy
Now, consider a hypothetical network depicted in Fig. 2. For instance, suppose we know that the expression of gene g
_{1} is regulated by g
_{2}, g
_{3}, and g
_{5}. Then we have
$$P(X_{1} = 1 X_{2} = k_{2}, X_{3} = k_{3}, X_{5} = k_{5}) = a_{1}^{1}(k_{2}, k_{3}, k_{5}).$$
As an example,
$$P(X_{1} = 1 X_{2} = 1, X_{3} = 1, X_{5} = 0) = a_{1}^{1}(1_{2}, 1_{3}, 0_{5}), $$
where the notation 1_{2} denotes 1 for the second gene. Further, we might not know a
_{1}(k
_{2},k
_{3},k
_{5}) for all combinations of k
_{2}, k
_{3}, k
_{5}. Then we use the ones that we know. In the case of conditioning with g
_{2}, g
_{3}, and g
_{5} regulating g
_{1}, with g
_{1} on if the others are on,
$$a^{1}_{1}(1_{2}, 1_{3}, 1_{5}) = 1 \delta_{1}(1_{2}, 1_{3}, 1_{5}). $$
If limiting to 3gene predictors, g
_{3}, and g
_{5} regulate g
_{1}, with g
_{1} on if the other two are on, then
$$a^{1}_{1}(k_{2}, 1_{3}, 1_{5}) = 1  \delta_{1}(k_{2}, 1_{3}, 1_{5}), $$
meaning that the conditioning parameter depends on whether X
_{2}=0 or 1.
Now, considering the conditional entropy, assuming that \(\delta _{1} = \max _{(k_{2},k_{3},k_{5})} \delta _{1}(k_{2},k_{3},k_{5})\) and δ
_{1}<0.5, we may write
$$ \begin{aligned} &H[X_{1}X_{2},X_{3},X_{5}] =\\ &\left[\sum_{\mathcal{X}_{2},\mathcal{X}_{3},\mathcal{X}_{5}}\left[ P\left(X_{1}=0X_{2}=x_{2},X_{3}=x_{3},X_{5}=x_{5}\right)\right.\right.\\ &\quad \quad\times P\left(X_{2}=x_{2},X_{3}=x_{3},X_{5}=x_{5}\right)\\ &\quad \quad\log\left[P\left(X_{1}=0X_{2}=x_{2},X_{3}=x_{3},X_{5}=x_{5}\right)\right]\\ &\quad \quad +P\left(X_{1}=1X_{2}=x_{2},X_{3}=x_{3},X_{5}=x_{5}\right)\\ &\quad \quad\times P\left(X_{2}=x_{2},X_{3}=x_{3},X_{5}=x_{5}\right)\\ &\quad \quad\left.\left.\log\left[P\left(X_{1}=1X_{2}=x_{2}, X_{3}=x_{3},X_{5}=x_{5}\right)\right]\right]{\vphantom{\sum_{\mathcal{X}_{2}}}}\right]\\ &\quad\quad\quad\leq h(\delta_{1}), \end{aligned} $$
where h(δ)=−[δ log(δ)+(1−δ) log(1−δ)]. Hence, bounding the conditional probabilities, the conditional entropy is in turn bounded by h(δ
_{1}):
$${\lim}_{\delta_{1}\rightarrow 0^{+}} H\left[X_{1}X_{2},X_{3},X_{5}\right] = 0. $$
It should be noted that constraining H[X
_{1}X
_{2},X
_{3},X
_{5}] would not necessarily constrain the conditional probabilities, and may be considered as a more relaxed type of constraints. But, for example, in cases where there is no knowledge about the status of a gene given its regulator genes, constraining entropy is the only possible approach.
In our illustrative example, if we assume that the Boolean regulating function of X
_{1} is known as shown in Fig. 2 and context effects exist, then the following knowledge constraints can be extracted from the pathway and regulating function:
$$\begin{aligned} &a^{0}_{1}\left(k_{2}, k_{3}, 0_{5}\right) = 1  \delta_{1}\left(k_{2}, k_{3}, 0_{5}\right)\\ &a^{0}_{1}\left(k_{2}, 1_{3}, k_{5}\right) = 1  \delta_{1}\left(k_{2}, 1_{3}, k_{5}\right)\\ &a^{0}_{1}\left(1_{2}, k_{3}, k_{5}\right) = 1  \delta_{1}\left(1_{2}, k_{3}, k_{5}\right)\\ &a^{1}_{1}\left(0_{2}, 0_{3}, 1_{5}\right) = 1  \delta_{1}\left(0_{2}, 0_{3}, 1_{5}\right). \end{aligned} $$
Now if we assume that the context does not affect the value of X
_{1}, i.e. the value of X
_{1} can be fully determined by knowing the values of X
_{2}, X
_{3}, and X
_{5}, then we have the following equations:
$$\begin{array}{*{20}l} & a^{0}_{1}\left(k_{2}, k_{3}, 0_{5}\right) = P\left(X_{1}=0X_{5}=0\right) = 1 \end{array} $$
(17a)
$$\begin{array}{*{20}l} & a^{0}_{1}\left(k_{2}, 1_{3}, k_{5}\right) = P\left(X_{1}=0X_{3}=1\right) = 1 \end{array} $$
(17b)
$$\begin{array}{*{20}l} & a^{0}_{1}\left(1_{2}, k_{3}, k_{5}\right) = P\left(X_{1}=0X_{2}=1\right)= 1 \end{array} $$
(17c)
$$\begin{array}{*{20}l} & a^{1}_{1}\left(0_{2}, 0_{3}, 1_{5}\right)= P(X_{1}=1X_{2}=0,X_{3}=0, \\ & \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad X_{5}=1) = 1. \end{array} $$
(17d)
It can be seen from the equations above that for some setups of the regulator values, only a subset of them determines the value of X
_{1}, regardless of the other regulator values. If we assume that the value of X
_{5} cannot be observed, for example X
_{5} is an extracellular signal that cannot be measured in gene expression data and thereafter X
_{5} is not in the features of our data, the only constraints relevant to the featurelabel distribution that can be extracted from the regulating function knowledge will be
$$ \begin{aligned} &a^{0}_{1}\left(k_{2}, 1_{3}, k_{5}\right) = P\left(X_{1}=0X_{3}=1\right) = 1\\ & a^{0}_{1}\left(1_{2}, k_{3}, k_{5}\right) = P\left(X_{1}=0X_{2}=1\right)= 1. \end{aligned} $$
(18)
Special case of Dirichlet distribution
Fixing the value of a single gene, being ON or OFF (i.e. X
_{
i
}=0 or X
_{
i
}=1, respectively), corresponds to a partition of the states, \(\mathcal {X }=\{1,\dots,b\}\). Here, the portions of \(\mathcal {X}\) for which (X
_{
i
}=k
_{1},X
_{
j
}=k
_{2}) and (X
_{
i
}≠k
_{1},X
_{
j
}=k
_{2}), for any k
_{1},k
_{2}∈{0,1}, are denoted by \(\mathcal {X}^{i,j}(k_{1},k_{2})\) and \(\mathcal {X}^{i,j}(k_{1}^{c},k_{2})\), respectively. For the Dirichlet distribution, where θ=p and γ=α, the constraints on the expectation over the conditional probability in (15) can be explicitly written as functions of the prior probability parameters (hyperparameters). For the parameter of the Dirichlet distribution, a vector α indexed by \(\mathcal {X}\), we denote the variable indicating the summation of its entities in \(\mathcal { X}^{i,j}(k_{1},k_{2})\) by \(\overline {\alpha }^{i,j}(k_{1},k_{2})=\sum _{k \in \mathcal {X}^{i,j}(k_{1},k_{2})}\alpha _{k}\). The notation can be easily extended for the cases having more than two fixed genes. In this setup, if the set of random variables corresponding to genes other than g
_{
i
} and the vector of their corresponding values are shown by \(\tilde {X}_{i}\) and \(\tilde {X}_{i}\), respectively, the expectation over the conditional probability in (15) is [36]:
$$ \hspace{12pt} \begin{aligned} &E_{\boldsymbol{p}}\left[P\left(X_{i} = k_{i} X_{1} = k_{1},\dots, X_{i1} = k_{i1},\right.\right.\\ &\quad\left. \left.X_{i+1} = k_{i+1},\dots, X_{m} = k_{m}\right)\right]\\ & =E_{\boldsymbol{p}}\left[\frac{\sum_{k\in \mathcal{X}^{i,\tilde{X}_{i}}\left(k_{i},\tilde{x}_{i}\right)}p_{k}}{\sum_{k\in \mathcal{X}^{i,\tilde{X}_{i}}\left(k_{i},\tilde{x}_{i}\right)}p_{k}+\sum_{k\in \mathcal{X}^{i,\tilde{X}_{i}}\left(k_{i}^{c},\tilde{x}_{i}\right)}p_{k}}\right] \\ &=\frac{\overline{\alpha }^{i,\tilde{X}_{i}}\left(k_{i},\tilde{x}_{i}\right)}{\overline{\alpha }^{i,\tilde{X}_{i}}\left(k_{i},\tilde{x}_{i}\right)+\overline{\alpha }^{i,\tilde{X}_{i}}\left(k_{i}^{c},\tilde{x}_{i}\right)}, \end{aligned} $$
(19)
where the summation in the numerator and the first summation in the denominator of the second equality is over the states (bins) for which (\(X_{i} = k_{i}, \tilde {X}_{i} = \tilde {x}_{i}\)), and the second summation in the denominator is over the states (bins) for which (\(X_{i} = k_{i}^{c}, \tilde {X}_{i} = \tilde {x}_{i}\)).
If there exists a set of genes that completely determines the value of gene g
_{
i
} (or only a specific setup of their values that determines the value, as we had in our illustrative example in Eq. (17)), then the constraints on the conditional probability conditioned on all the genes other than g
_{
i
} can be changed to be conditioned on that set only. Specifically, let R
_{
i
} denote the set of random variables corresponding to such a set of genes/proteins and suppose there exists a specific setup of their values r
_{
i
} that completely determines the value of gene g
_{
i
}. If the set of all random variables corresponding to the genes/proteins other than X
_{
i
} and R
_{
i
} is denoted by \(\boldsymbol {B}_{i}=\tilde {X}_{(i,\boldsymbol {R}_{i})}\), and their corresponding values by b
_{
i
}, then the constraints on the conditional probability can be written as
$$ \begin{aligned} &E_{\boldsymbol{p}}\left[P\left(X_{i} = k_{i} \boldsymbol{R}_{i} = \boldsymbol{r}_{i}\right)\right]\\ &=E_{\boldsymbol{p}}\left[\frac{\sum_{\boldsymbol{b}_{i}\in O_{\boldsymbol{B}_{i}}}\sum_{k\in \mathcal{X}^{i,\boldsymbol{R}_{i},\boldsymbol{B}_{i}}\left(k_{i},\boldsymbol{r}_{i},\boldsymbol{b}_{i}\right)}p_{k}}{\sum_{\boldsymbol{b}_{i}\in O_{\boldsymbol{B}_{i}}}\sum_{k\in \mathcal{X}^{i,\boldsymbol{R}_{i},\boldsymbol{B}_{i}}(k_{i},\boldsymbol{r}_{i},\boldsymbol{b}_{i})}p_{k}}\right.\\ & \quad \quad \quad\left.\frac{}{+ \sum_{\boldsymbol{b}_{i}\in O_{\boldsymbol{B}_{i}}}\sum_{k\in \mathcal{X}^{i,\boldsymbol{R}_{i},\boldsymbol{B}_{i}}\left(k_{i}^{c},\boldsymbol{r}_{i},\boldsymbol{b}_{i}\right)}p_{k}}\right] \\ &= \frac{\sum_{\boldsymbol{b}_{i}\in O_{\boldsymbol{B}_{i}}}\overline{\alpha }^{i,\boldsymbol{R}_{i},\boldsymbol{B}_{i}}(k_{i},\boldsymbol{r}_{i},\boldsymbol{b}_{i})}{\sum_{\boldsymbol{b}_{i}\in O_{\boldsymbol{B}_{i}}}\overline{\alpha }^{i,\boldsymbol{R}_{i},\boldsymbol{B}_{i}}(k_{i},\boldsymbol{r}_{i},\boldsymbol{b}_{i})} \\ & \quad \quad \frac{}{+\sum_{\boldsymbol{b}_{i}\in O_{\boldsymbol{B}_{i}}}\overline{\alpha }^{i,\boldsymbol{R}_{i},\boldsymbol{B}_{i}}(k_{i}^{c},\boldsymbol{r}_{i},\boldsymbol{b}_{i})}, \end{aligned} $$
(20)
where \(\phantom {\dot {i}\!}O_{\boldsymbol {B}_{i}}\) is the set of all possible vectors of values for B
_{
i
}.
For a multinomial model with a Dirichlet prior distribution, a constraint on the conditional probabilities translates into a constraint on the above expectation over the conditional probabilities (as in Eq. (15)). In our illustrative example and from the equations in Eq. (17), there are four of these constraints on the conditional probability for gene g
_{1}. For example, in the second constraint from the second line of Eq. (17) (Eq. 17b), X
_{
i
}=X
_{1}, k
_{
i
}=0, R
_{
i
}={X
_{3}}, r
_{
i
}=[0], and B
_{
i
}={X
_{2},X
_{5}}. One might have several constraints for each gene extracted from its regulatory function (more on extracting general constraints from regulating functions in the “Results and discussion” section).