Before describing our algorithm for synthesizing parameters in probabilistic models, we give some back-ground on stochastic modeling in systems biology, specification of time-varying properties using temporal logic and statistical verification of probabilistic systems against behavioral specifications. We also provide definitions for formal concepts that will be used to describe our algorithm.
Remark 1 One of our main objectives in this section is to develop a formal definition of computational probabilistic models: i.e. those stochastic models for which a probability can be assigned to any observed model behavior when the model is executed. The reader familiar with this concept can quickly skim through the next two subsections.
Stochastic biological models
Many biological systems have traditionally been described using deterministic, mathematical models, often in terms of ordinary differential equations [[12] §2.1]. However, in recent years, the need for incorporating the uncertainty inherent in biological systems has led to the development of stochastic models: these are harder to analyze but more accurately reflect the behavior of the underlying system [5]. Common stochastic models in biology include Markov jump processes [5], stochastic differential equations [41], discrete-time Markov chains [[42], Chapter 3] and continuous-time Markov chains [43].
Also, researchers are increasingly developing computational models that naturally capture the bottom-up nature of biological phenomena and hence are more amenable to in-silico implementation [44]. Such models are constructed by observing large sets of timeseries data, combined with their expert insight into the system being modeled [45]. Some of these models are based on experimental data of varying veracity and error propagation into the designed model is an ever-present challenge [46].
We will focus on discrete-time Markov chains, a class of stochastic models that are widely used in the sciences, engineering, economics and other areas. We closely follow Baier and Katoen [8] in formally defining them.
Definition 1 (Discrete-time Markov chain) A discrete time Markov chain (DTMC) is given by where:
-
S is a countable, nonempty, set of states,
-
init : S → [0, 1] is the initial distribution such that ,
-
P : S × S → [0, 1] is the transition probability function,
-
AP is a set of boolean valued atomic propositions,
-
L : S → 2APis a labeling function for states. ■
We next define (a) a parameterized family of DTMCs, given a set of parameters and (b) execution paths over them.
Definition 2 (Parameterized DTMC) A parameterized discrete time Markov chain (ParDTMC) is given by where
-
S is a countable, nonempty, set of states,
-
is the parameter space,
-
init : S →[0, 1] is the initial distribution such that ,
-
P : S × S × Θ
m
→[0, 1] is the transition probability function over an m-dimensional parameter space,
-
AP is a set of boolean valued atomic propositions, and
-
L : S → 2APis a labeling function for states. ■
Definition 3 (ParDTMC execution paths) An execution path of a ParDTMC given by σ = s0, s1, . . .. The suffix of a path σ that starts at state s
i
(i.e. s
i
, si+1, . . .) is denoted σi.
As a case study, we have applied our automated parameter estimation technique to an agent-based model of the acute inflammatory response. We now briefly discuss major characteristics of ABMs.
Agent-based modeling in systems biology
Agent-based models (ABMs) form an interesting, wellstudied subset of complex probabilistic models. In recent years, agent-based modeling has emerged as a popular method for the representation, analysis and simulation of biological models [10].
ABMs allow the specification of high-level model properties as well as fine-grained component behavior. An ABM is composed of autonomous elements whose individual properties can be specified. At a macro-level, model-wide agent-interaction rules can be defined and enforced. Each agent has a physical location and its state evolves with time based on messages exchanged with other agents, allowing rich spatiotemporal properties to emerge [10].
ABMs can incorporate parallelism, object-oriented behavior and stochasticity, making them conducive to the development of computational models in systems biology. Another advantage of using ABMs is that they are bottom-up models that mirror the natural behavior of biological systems [47]. Agents interact with one another based on fixed rules and also have a spatial location [10].
Most agent-based models do not have compact analytic descriptions, and simulations across the input space must be performed to in order to infer general model properties [48]. Also, agent-based models tend to have a large number of parameters, some of them highly-sensitive in the sense that a small change can radically alter model behavior [49]. To the best of our knowledge, the parameter estimation problem in ABMs has only been addressed for very simple models [50] and most approaches use standard optimization techniques [31, 51].
As a case study for our parameter estimation technique, we have used an agent-based model of the acute inflammatory response due to endotoxin administration [52] written using the SPARK tool [53]. SPARK is an ABM framework designed for multi-scale modeling of biomedical models that is implemented in Java and allows users to develop models using its own programming language (SPARK-PL) [54–57].
One we have a basic model design available, we need to find the exact model instantiation in terms of concrete parameter values that meets desired behavior (usually experimental data). Therefore, we need a way to check if the models all requirements. We now discuss how to formally specify and automatically verify probabilistic computational models.
An ABM consists of a fixed number of agents A1, . . . A
n
, where the state of agent A
i
is determined by the values of variables . Assuming that for any agent A
i
, i ∈ {1 . . . n}, variable , j ∈ {1 . . . m} can take values in the set V ar
j
, can be represented as a DTMC.
Definition 4 (DTMC corresponding to an ABM) Consider ABM with n agents A1, . . . A
n
and m variables V1 . . . V
m
that take values over countably finite sets V ar1, . . ., V ar
m
respectively. The DTMC corresponding to is:
-
-
init : S →[0, 1], the initial distribution is imposed by ,
-
P : S × S →[0, 1] is determined by the transition rules over variables of ,
-
AP consists of (boolean-valued) atomic propositions over agent variables and
-
L : S → 2APis the usual function marking each state with propositions that hold there. ■
We generalize agent-variable ABMs, by adding the notion of parameters to obtain a family of ABMs. More formally, we talk of a parametric ABM over n agents, m variables and k parameters whose dynamic behavior (i.e. its transition function P) now also depends on the parameter values. For a given parametric ABM, we can defined an equivalent ParDTMC.
Definition 5 (ParDTMC corresponding to a parametric ABM) Consider ABM with n agents A1, . . . A
n
and m variables V1 . . . V
m
that take values over countably finite sets V ar1 . . . V ar
m
and k parameters θ1 . . . θ
k
. The ParDTMC corresponding to is:
-
,
-
init : S →[0, 1] the initial distribution is imposed by ,
-
P : S × S × Θ
k
→[0, 1], where Θ
k
= θ1 × . . . × θ
k
is determined by the transition rules over variables and parameters of ,
-
AP consists of boolean propositions over variable values of all agents ,
-
L : S → 2APis the usual function marking each state with propositions that hold there. ■
Remark 2 In order to use statistical model checking to solve the probabilistic model checking problem, we need to associate probabilities with executions paths of the model. For DTMCs, this has been shown by Kwiatkowska et al [9] and Baier and Katoen [[8], Chapter 10]. Although we do not prove this, this is also true for ParDTMCs. From now on, we assume that our models can be represented as ParDTMCs and thus have a unique underlying probability measure.
Specifying and checking biological properties
Our algorithm discovers parameters of stochastic biological from experimental data and expert behavioral specifications. While exploring the parameter space, we continually verify whether we have a parameter assignment at which the model meets the given specifications.
Typically, properties of biological systems that need to be formally specified are qualitative, quantitative and time-varying. Since the models are usually stochastic, specifications too should allow reasonable deviations from desired behavior. To capture such rich behavioral properties, we use a probabilistic temporal logic (see Definition 8) to specify expected model behavior.
In order to develop automated techniques for analyzing biological models, we also need the properties to be specified in a way that can be monitored as we perform ABM simulations. Monitoring is the process of determining if an execution trace of the model satisfies given specifications [58, 59]. Therefore, we write specifications using a logic belonging to a class of languages for which monitors can be derived algorithmically [60].
We translate natural language expert insights representing desired model behavior into a logic whose sentences are adapted finitely monitorable (AFM) formulas. The truth value of an AFM formula with respect to a trace of a model execution can be determined by observing a finite prefix of the trace [[61], §2],[[60], §2.2]. Next, we formally define our specification language:
Definition 6 (BLTL grammar) Given a ParDTMC , the grammar of bounded linear temporal logic (BLTL) formulas φ in Backus-Naur Form is as follows:
where a ∈ AP, , and ∧, ∨ and ¬ are the usual proposition logic operators. We call Ud the bounded until operator. ■
The semantics of a BLTL formula φ is defined over an execution path of a ParDTMC (see Definition 3).
Definition 7 (BLTL semantics) Given a path σ of a ParDTMC , the satisfaction of a BLTL formula φ on a path suffix is denoted , and is determined by the following rules:
-
σi⊨ a, where a ∈ AP iff a ∈ L(s
i
),
-
σi⊨ φ1 ^ φ2 iff σi⊨ φ1 and σi⊨ φ2,
-
σi⊨ φ1 ∨ φ2 iff σi⊨ φ1 or σi⊨ φ2,
-
σi⊨ ¬φ iff σi⊭ φ,
-
We now state, without proof, the fact that it is possible to check if any path of a ParDTMC satisfies a given BLTL formula, by observing a finite prefix of the path. For a proof of a similar lemma for continuous-time Markov chains, see Legay et al [62].
Lemma 1 (Monitorability of BLTL formulas over DTMCs) It is always possible to algorithmically decide if any simulation path σ = (s1, s2, . . .) of a ParDTMC satisfies given BLTL formula φ by observing a finite prefix of σ.
Note 1 (Additional bounded operators) We can define bounded versions of the usual linear temporal logic operators G (always) and F (eventually) as follows: Fdψ = true Udψ, Gdψ = ¬Fd¬ψ. Fdψ means that ψ holds at some state within the next d state-transitions, and Gdψ means that ψ continually holds for the next d state-transitions.
The expressive power of our agent-based models emanates from their ability to capture uncertainty. For such stochastic models, a single execution trace that violates a given property cannot serve as a counterexample. We need a more flexible specification language whose formulas allow reasonable deviations from expected model behavior. For this, we define a probabilistic version of BLTL.
Definition 8 (Probabilistic Bounded Linear Temporal Logic) A specification of the form P
≥ θ
(φ) is a probabilistic bounded linear temporal logic (PBLTL) formula if φ is a bounded linear temporal logic formula and θ is a probability threshold such that , 0 ≤ θ ≤ 1. If a θ fraction of the traces of a model satisfy φ, we deem the model to satisfy the (probabilistic) AFM specification P
≥ θ
(φ).■
Automated verification using model checking
Model checking is an automated technique for verifying finite and infinite state transition systems that is widely used for formal assurance of safety-critical systems [8]. Techniques based on model checking have been used for verification in a number of areas such as hybrid dynamical systems [63], computer software [64] and systems biology [43].
In order to use model checking to verify a time-varying system, the model is described using a Kripke structure and the property to be checked is written in a formal temporal logic [32]. We will use PBLTL (Definition 8) to define the probabilistic model checking problem.
Definition 9 (Probabilistic model checking (PMC)) Given a probabilistic model and a PBLTL specification P
≥ θ
(φ), determine if satisfies φ with probability at least θ. ■
There are two approaches for solving the PMC problem:
-
Symbolic and numerical techniques [65, 66] that estimate the exact value of the probability with which satisfies φ (by exhaustively exploring all possible model behaviors) and then compare it to the specification threshold probability θ.
-
Statistical techniques [67–70] that use a set of sample simulations to determine if .
Statistical approaches for probabilistic model checking are more scalable since they avoid the expensive calculation associated with accurately estimating exact probabilities. However, a limitation of statistical model checking algorithms is that the reported result is not guaranteed to be correct, i.e., there may be false positives or false negatives [71]. For an excellent overview of major statistical model checking techniques, we refer the reader to Legay et al [61].
Since computational models in systems biology often have large parameter spaces, we use statistical model checking as part of our parameter estimation algorithm. Next, we describe a reformulation of the probabilistic model checking problem in terms of statistical hypothesis testing - an approach first used by Younes [71].
Hypothesis testing based statistical model checking
We are interested in determining an answer to the probabilistic model checking problem, i.e. "Does ?" (see Definition 9).
Assuming that the model's actual probability of satisfying the specification is u ∈[0, 1], i.e. , we test the (null) hypothesis H : u ≥ θ against the (alternative) hypothesis K : u <θ. If K is rejected we conclude that . If H is rejected we conclude that .
Note 2 (Errors in hypothesis testing) For a hypothesis test procedure, the critical region is the part of the sample space where the null hypothesis is rejected. If the test rejects H when its true, it is considered a type-I error. On the other hand, if the test accepts H when its false, it is a type-2 error. Once the critical region for a test procedure has been decided, it uniquely determines the probabilities of type-1 and type-2 errors. For a given critical region, we denote by α (resp. β) the probability of a type-1 (resp. type-2) error [[72], §1.3].■
Naturally, for any statistical hypothesis testing procedure we want a critical region that minimizes the probabilities α and β of Type-1 and Type-2 errors (see Note 2). However, this implies either using a large value for either α or β, or drawing a large number of samples to ensure test accuracy.
Younes [70] suggests that the solution is to use the more relaxed test of H0 : u ≥ u
r
against H1 : u ≤ u
l
, where 0 ≤ u
l
<θ <u
r
≤ 1. If H0 (H1) is accepted, we consider H (resp. K) to be true.
Remark 3 (Indifierence regions) [u
l
, u
r
] is known as the indifierence region. If u ∈ [u
l
, u
r
] (i.e. when both H0 and H1 are false), we do not care about the result of the test; thus the test procedure is allowed to accept either hypothesis. In practice, we often choose the indifierence region [u
l
, u
r
] to be of width 2 * ∈ (where ), i.e. [u - ∈, u + ∈].
Our parameter estimation technique uses the Bayesian statistical model checking (BSMC) algorithm developed by Jha at al [16], to algorithmically check if a probabilistic model satisfies given behavioral specifications. Note that a fully Bayesian statistical model checking algorithm that verifies Dynamic Bayesian Networks against probabilistic queries was developed by Langmead [73].
Below, we briefly describe the main idea behind BSMC and refer to the reader to the literature for details [60, 61, 16].
Bayesian statistical model checking
Recall that we are interested in testing whether a probabilistic model meets a PBLTL specification with a minimum threshold probability, i.e. "Does ?"
Assuming that , we have posed this problem as hypothesis testing query: test H : u ≥ θ against K : u <θ, and then relaxed it to use indifference regions: H0 : u ≥ u
r
against H1 : u ≤ u
l
(where 0 ≤ u
l
< θ < u
r
≤ 1).
If H1 is rejected, we consider that H : u ≥ θ holds, and hence conclude that . If H0 is rejected, we conclude that , i.e. .
Recall that we do not know the value of the actual probability u with which the model satisfies the specification i.e. . For Bayesian testing, we model this unknown probability as a random variable U and assume the availability of a prior density function g (.) that incorporates our existing knowledge of U.
The Bayesian statistical model checking procedure (see Algorithm 1) sequentially draws traces from in an i.i.d. fashion until it rejects either H0 or H1. For each sample trace σ
i
, (i ∈ {0, 1, . . .}) of , it uses a monitoring algorithm to determine if the path satisfies the BLTL specification φ.
The outcome of each such test of path satisfiability can be represented by a Bernoulli random variable X
i
i.e. , if , X
i
= 1, otherwise (i.e. when ) X
i
= 0. At each iteration, the algorithm calculates a quantity known as the Bayes Factor (BF) that, given the observation of a sample {x1, x2, . . . x
n
}; x
i
∈ {0, 1}, reflects confidence in H0 holding versus H1:
(1)
Algorithm 1 Bayesian statistical model checking (BSMC) algorithm for probabilistic models
Require:
Probabilistic model ,
PBLTL specification ,
Threshold L > 1,
Prior density function g(·) of the unknown parameter u where
Indifference region bounds: (∈1, ∈2), where ∈1 > 0, ∈2 > 0.
{Note: Indifference region is [θ - ∈1, θ + ∈2]
{Note: H0 : u ≥ θ + ∈2; H1 : u ≤ θ - ∈1}
Ensure:
ans = false if H0 rejected,
ans = true if H0 is accepted,
n = Total number of traces sampled from .
1: n ← 0 {Number of traces drawn from .}
2: z ← 0 {Number of traces satisfying φ.}
3: repeat
4: Draw sample σ from
5: n ← n + 1
6: if σ ⊨ φ then
7: z ← z + 1
8: end if
9: (2)
10: until
11: if B >L then
12: ans ← true
13: else
14: ans ← f alse
15: end if
16: return (ans; n)
We need to calculate the probability P(d|H
i
) of observing sample d = (x1, x2, . . . x
n
) given hypothesis H
i
, i ∈ {0, 1}. Therefore, we will need to consider all cases where H0 (resp. H1) holds. Assuming the indifference region [u
l
, u
r
] to be [θ - ∈1, θ + ∈2], whether H0 : u ≥ u
r
or H1 : u ≤ u
l
holds depends on the actual probability u of satisfying φ. For both cases, we integrate over all possible values of u according to our prior g:
Here, is the conditional density of Bernoulli random variable X
i
given the actual probability u. We thus obtain the following expression for the Bayes factor:
In summary, calculating the Bayes factor requires knowing the number of total traces n drawn from , the number z of traces that satisfied specification φ, the indifference region [u
l
, u
r
] and the prior density g(·) of the unknown probability u. For more details about the Bayesian model checking algorithm, we refer the reader to the work of Jha [[60], Chapter 4].