Skip to main content
  • Methodology article
  • Open access
  • Published:

Accelerated maximum likelihood parameter estimation for stochastic biochemical systems



A prerequisite for the mechanistic simulation of a biochemical system is detailed knowledge of its kinetic parameters. Despite recent experimental advances, the estimation of unknown parameter values from observed data is still a bottleneck for obtaining accurate simulation results. Many methods exist for parameter estimation in deterministic biochemical systems; methods for discrete stochastic systems are less well developed. Given the probabilistic nature of stochastic biochemical models, a natural approach is to choose parameter values that maximize the probability of the observed data with respect to the unknown parameters, a.k.a. the maximum likelihood parameter estimates (MLEs). MLE computation for all but the simplest models requires the simulation of many system trajectories that are consistent with experimental data. For models with unknown parameters, this presents a computational challenge, as the generation of consistent trajectories can be an extremely rare occurrence.


We have developed Monte Carlo Expectation-Maximization with Modified Cross-Entropy Method (MCEM2): an accelerated method for calculating MLEs that combines advances in rare event simulation with a computationally efficient version of the Monte Carlo expectation-maximization (MCEM) algorithm. Our method requires no prior knowledge regarding parameter values, and it automatically provides a multivariate parameter uncertainty estimate. We applied the method to five stochastic systems of increasing complexity, progressing from an analytically tractable pure-birth model to a computationally demanding model of yeast-polarization. Our results demonstrate that MCEM2 substantially accelerates MLE computation on all tested models when compared to a stand-alone version of MCEM. Additionally, we show how our method identifies parameter values for certain classes of models more accurately than two recently proposed computationally efficient methods.


This work provides a novel, accelerated version of a likelihood-based parameter estimation method that can be readily applied to stochastic biochemical systems. In addition, our results suggest opportunities for added efficiency improvements that will further enhance our ability to mechanistically simulate biological processes.


Conducting accurate mechanistic simulations of biochemical systems is a central task in computational systems biology. For systems where a detailed model is available, simulation results can be applied to a wide variety of tasks including sensitivity analysis, in silico experimentation, and efficient design of synthetic systems[1]. Unfortunately, mechanistic models for many biochemical systems are not known; consequently, a prerequisite for the simulation of these systems is the determination of model structure and kinetic parameters from experimental data.

Despite recent advances in experimental methodology, the estimation of unknown kinetic parameters from data is a bottleneck for performing accurate simulations[2]. For deterministic models of biochemical systems, where dynamics are typically described by ordinary differential equations, reliable methods for parameter estimation are relatively abundant[3]. In contrast, parameter estimation for stochastic biochemical systems are less well developed[4]. In recent years it has become increasingly clear that stochasticity plays a crucial role in many biological processes, ranging from bistable genetic switches[57] to robust oscillators[8, 9]. Unlike in the deterministic regime, the dynamics of a stochastic system are described by a probability distribution which cannot usually be obtained analytically (although approximate methods such as finite state projection have been used with some success[10]). Instead, sampling methods like the stochastic simulation algorithm (SSA)[11] are used to generate ensembles of trajectories from the unknown distribution.

Given the probabilistic nature of stochastic biochemical models, a natural approach for parameter estimation is to choose values that maximize the probability of the observed data with respect to the unknown parameters (maximum likelihood estimates or MLEs). In the case of fully observed data, where the number of molecules of each system species is known at all time points, MLEs can be calculated analytically. However, since realistic biochemical systems are discretely and partially observed, computational MLE methods are necessary. One of the earliest examples presented, simulated maximum likelihood (SML), combines a non-parametric density function estimator with Monte Carlo simulation to approximate the likelihood function[12]. To maximize the likelihood, SML uses a genetic algorithm requiring absolute bounds on each of the unknown parameters. Horváth and Manini developed an expectation-maximization (EM) approach (see Methods) which artificially modifies a subset of reactions in simulated trajectories to approximate and maximize the likelihood[13]. However, this method can become increasingly inaccurate as species counts approach zero, and it is not clear how to properly choose the number of reactions to modify at each step. More recently, a histogram-based Monte Carlo simulation procedure was developed to estimate data likelihood[2]. Like the SML method, this approach uses a genetic algorithm to maximize the likelihood, requiring prior parameter bounds. Finally, Wang et al. proposed a method combining stochastic gradient descent (SGD) with a reversible jump Markov chain Monte Carlo sampler to maximize parameter likelihood[4]. The SGD method efficiently and heuristically generates trajectories consistent with observed data, iteratively modifying them via a Metropolis-Hastings step until they closely approximate trajectories from the unknown probability distribution.

Although not strictly an MLE method, Boys et al. developed a Bayesian approach for inferring parameters that employs a Poisson process approximation to efficiently generate trajectories consistent with observed data[14]. Like SGD, this method also incorporates a Metropolis-Hastings sampling step to correct for the approximate nature of the generated trajectories.

All of the above MLE approaches essentially iterate between two steps: (A) approximating a parameter likelihood using Monte Carlo sampling and (B) maximizing that approximation with respect to the unknown parameters using an optimization algorithm. We note that the Bayesian method of Boys et al. also requires extensive Monte Carlo sampling in the manner of step (A). Execution of (A) requires the generation of many system trajectories that are consistent with experimental data. When simulating trajectories of a model with unknown parameters, the generation of even a single trajectory consistent with data can be an extremely rare occurrence. The SML and histogram-based methods[2, 12] mitigate this computational challenge by requiring accurate bounds for each unknown parameter. In contrast, the EM-based, SGD, and Poisson approximation methods[4, 13, 14] reduce simulation cost by generating system trajectories in a heuristic manner. Although these strategies have been successful, parameter bounds are not always available, and it is not clear whether heuristically generated trajectories can be used to accurately and efficiently parameterize all systems. In addition, unlike Bayesian methods, existing MLE approaches only return parameter point estimates without quantifying estimation uncertainty.

In this work, we develop Monte Carlo Expectation-Maximization with Modified Cross-Entropy Method (MCEM2), a novel, accelerated approach for computing MLEs along with uncertainty estimates. MCEM2 combines advances in rare event simulation[1518] with an efficient version of the Monte Carlo EM (MCEM) algorithm[19], and it does not require prior bounds on parameters. Unlike the EM-based, SGD, and Poisson approximation methods above, MCEM2 generates probabilistically coherent system trajectories using the SSA. The remainder of the paper is structured as follows: We first provide derivation and implementation details of MCEM2 (Methods). Next, we apply our method to five stochastic biochemical models of increasing complexity and realism: a pure-birth process, a birth-death process, a decay-dimerization, a prokaryotic auto-regulatory gene network, and a model of yeast-polarization (Results). Through these examples, we demonstrate the superior performance of MCEM2 to an existing implementation of MCEM and the SGD and Poisson approximation methods. Finally, we discuss the distinguishing features of our method and motivate several promising future areas of research (Discussion).


Discrete-state stochastic chemical kinetic system

We focus on stochastic biochemical models that assume a well-stirred chemical system with N species {S1, …, S N }, whose discrete-valued molecular population numbers evolve through the firing of M reactions {R1, …, R M }. We represent the state of the system at time t by the N-dimensional random process X(t) ≡ (X1(t) ,…, X N (t)), where X i (t) corresponds to the number of molecules of S i at time t. Associated with each reaction is its propensity function a j (x)(j = 1, …, M), whose product with an infinitesimal time increment dt gives the probability that reaction R j fires in the interval [t, t + dt) given X(t) = x. The sum of all M propensity functions for a given system state x is denoted a0(x). We restrict our attention to reactions that obey mass action kinetics—i.e. where a j (x) ≡ θ j h j (x) with θ j a positive real kinetic constant and h j (x) a function that quantifies the number of possible ways reaction R j can occur given system state x. Examples of h j (x) include: 1, x1, 1 2 x 1 ( x 1 1), and x1x2 for zeroth-order, unimolecular, homo-bimolecular, and hetero-bimolecular reactions, respectively. Further details on mass action propensity functions can be found in[20].

The "direct method" implementation of Gillespie’s stochastic simulation algorithm (SSA) provides a simple numerical procedure for generating exact system trajectories from their underlying (intractable) probability distribution[11]. The method works by sequentially simulating the time to the next reaction (τ) as an exponential random variable with mean 1/a0(x) and the index of the next reaction (j’) as a categorical random variable with probabilities a j (x) / a0(x)(j = 1, …, M). Given a final time T and initial system state X(0) = x0, application of the direct method yields a reaction trajectory z ( τ 1 , j 1 , , τ r , j r ) , where r is the total number of reactions that happen to fire by time T. Although z is only of length 2r, combining it with x0 allows us to identify the complete system state at any time in the interval [0,T] regardless of how large N and M are. Using the above notation, we can express the likelihood of the complete system trajectory (x0, z) as the following function of the kinetic parameters θ ≡ (θ1, …, θ M ) (see[21] for a detailed derivation):

f θ ( x 0 , z ) = i = 1 r θ j i h j i ( x i 1 ) × exp i = 1 r + 1 τ i j = 1 M θ j h j ( x i 1 ) ,

where τr + 1 is the time interval between the firing of the final reaction and T, and xi−1 is the easily computable system state at the time immediately after the (i−1)st firing event (i.e. whent= l = 1 i 1 τ l for i > 1).

Maximum likelihood parameter estimation

If the true values of the kinetic parameters θ are unknown and we are given a complete system trajectory (x0, z), a natural approach for generating parameter estimates θ ̂ is to choose values of θ that maximize the likelihood with respect to the trajectory (Equation (1)). These maximum likelihood parameter estimates (MLEs) can be analytically computed for each reaction as follows (see[21] for a derivation):

θ ̂ j = r j i = 1 r + 1 h j ( x i 1 ) τ i .

where r j is the total number of times reaction R j fires in z. Although simple, Equation (2) is only useful in the presence of a complete system trajectory. Experimentally observed data are typically much less informative, consisting of the initial system state plus numbers of molecules for a subset of the system species at d discrete time points. We represent these "observed data" with y ( x 0 , x 1 , , x d ) , where x i contains the numbers of molecules of a subset of the N species at some time point t i . Knowledge of any y of finite size is insufficient for reconstructing the complete system trajectory (x0, z) and the corresponding likelihood (Equation (1)); thus, Equation (2) is not a feasible approach for computing MLEs. Instead, we require a method that can accommodate "unobserved data"—i.e., the states of all system species at all times not included in the observed data.

In this work we use the expectation-maximization (EM) algorithm[22] to identify MLEs in the presence of unobserved data. This algorithm suggests the following iterative computation given some θ ̂ ( 0 ) (see[23] for details):

θ ̂ ( n + 1 ) = argmax θ Q ( θ | θ ̂ ( n ) ) argmax θ E log f θ ( x 0 , z ) | y , θ ̂ ( n ) = argmax θ z Z ( y ) g ( z | y , θ ̂ ( n ) ) × log f θ ( x 0 , z ) ,

whereE · | y , θ ̂ ( n ) is the expectation operator taken with respect to the conditional distribution of z given y and θ ̂ ( n ) , Z ( y ) is the set of all valid reaction trajectories that are consistent with y (i.e. trajectories that pass through all observed data points exactly), andg(z|y, θ ̂ ( n ) ) represents the unknown conditional density of z. The theory behind the EM algorithm guarantees that Equation (3) will converge to estimates that locally maximize the observed data likelihood, given n sufficiently large (Section 3 of[22]). Unfortunately, we cannot work with Equation (3) directly, as an explicit evaluation of the summation is intractable. Instead, we use a Monte Carlo extension of EM (MCEM)[24] that samples reaction trajectories using the direct method of the SSA to approximate θ ̂ ( n + 1 ) :

θ ̂ ( n + 1 ) argmax θ k = 1 K I z k ( n ) Z ( y ) × log f θ x 0 , z k ( n )
= argmax θ k = 1 K log f θ x 0 , z k ( n ) ,

where z k ( n ) is the kth SSA trajectory simulated using the parameter vector θ ̂ ( n ) ,I z k ( n ) Z ( y ) is an indicator function taking a value of 1 if z k ( n ) is consistent with y (and 0 otherwise), and K is the total number of simulated trajectories. Equation (4b) presents a simplified expression in which k indexes only the K simulated trajectories that are consistent with the observed data. In practice, we set K to the value that leads to the desired number of consistent trajectories K. We note that Equations (4a) and (4b) describe a rejection sampling approach to generating reaction trajectories conditional to the observed data, in which only those simulated trajectories consistent with data are retained and all others are rejected. In practice, we simulate trajectories incrementally between two data points at a time, further propagating only those trajectories that pass through the second data point exactly. Although this incremental approach is much more efficient than performing rejection sampling across full length trajectories, as we describe below it can still be computationally prohibitive.

By simplifying Equation (4b) with the same procedure used to derive Equation (2)[21], we obtain an iterative, MCEM version of the MLE for each reaction:

θ ̂ j ( n + 1 ) = k = 1 K r j k ( n ) k = 1 K i = 1 r k ( n ) + 1 h j ( x i 1 , k ( n ) ) τ i k ( n ) .

Equation (5) is analogous to Equation (2), with trajectory features having an added subscript k and superscript (n).

An open question in the use of MCEM involves efficient selection of the numbers of consistent trajectories K and iterations n. We adopt the ascent-based MCEM algorithm[19] for this task, which suggests increasing K at each iteration according to an estimate of the current Monte Carlo error and terminating the algorithm when the estimated change in conditional log-likelihood E log f θ ( x 0 , z ) | y , θ ̂ ( n ) passes below a constant threshold. Specifically, we set the initial value of K to 10 and the sample size increment parameters α, β, and k to their respective default values of .25, .25, and 3. We terminate the algorithm when an upper bound of the change in conditional log-likelihood (using γ = .25 ) was less than .005 for three consecutive iterations (see[19] for more details).

Accelerating MLE computation

Equation (5) requires the generation of Ktrajectories that are consistent with observed data. For datasets with closely spaced time points and reasonably accurate initial parameter estimates θ ̂ ( 0 ) , this task may be computationally feasible. For the more realistic case of a sparse dataset and inaccurate values of θ ̂ ( 0 ) , it quickly becomes intractable, as the simulation of even one consistent trajectory is an extremely rare event. In light of this fact, we adapt methods from rare event simulation to substantially accelerate the use of MCEM. Below we describe the incorporation of three techniques: the cross-entropy method, multilevel splitting, and parameter perturbation. Specifically, we employ these three techniques together as a standalone algorithm to quickly compute plausible parameter estimates θ ̂ CE (see below for details). We then use these parameter estimates as input to an otherwise unmodified ascent-based MCEM algorithm, which further refines the estimates until MLEs are obtained. The advantage of this two-step process is that we retain all of the desirable properties of MCEM while dramatically accelerating the time to convergence (due to the use of much more accurate MCEM initial parameter estimates).

The cross-entropy method

The cross-entropy (CE) method was first developed by Rubinstein[15] to accelerate the simulation of stochastic rare events. Since that time, the method has been used in many contexts, including combinatorial optimization[25] and stochastic biochemical modeling[18]. Briefly, the CE method begins by simulating K trajectories using an initial parameter vector θ ̂ ( 0 ) . Next, a subset of ρK trajectories (with ρ [0, 1] and · the ceiling function) that are closest to a given system state (i.e. observed data) is selected and used to compute a better parameter estimate θ ̂ ( 1 ) . This process is then repeated until all ρK subset trajectories reach the given state, upon which the algorithm computes a final parameter vector θ ̂ CE and terminates. Unless otherwise noted in the examples below, we set K = 104 and ρ = .001, which were shown empirically to confer good performance (see Discussion).

When applied to the task of stochastic parameter estimation, the CE method proposes an iterative optimization very similar to Equation (4a):

θ ̂ ( m + 1 ) = argmax θ k = 1 K I ( d ( z k ( m ) , y ) δ ( m ) ) × log f θ ( x 0 , z k ( m ) )

whered( z k ( m ) ,y) is a user-defined function measuring the distance between a simulated trajectory and the observed data, and δ(m) is the (ρ × 100)th quantile of distances achieved by the K simulated trajectories. In this work, we choose d(·,·) to be a normalized L1 distance evaluated at each observed time point for each observed species (i.e. we divide each absolute deviation by the quantity [1 + the value of the corresponding data point]). Upon simplification of Equation (6), we obtain the following expression for each CE reaction parameter:

θ ̂ j ( m + 1 ) = k = 1 K I ( d ( z k ( m ) , y ) δ ( m ) ) × r jk ( m ) k = 1 K I ( d ( z k ( m ) , y ) δ ( m ) ) × i = 1 r k ( m ) + 1 h j ( x i 1 , k ( m ) ) τ ik ( m ) .

Once δ(m) = 0, Equation (7) is used a final time to obtain θ ̂ CE and the algorithm terminates. If we then set θ ̂ ( 0 ) θ ̂ CE for MCEM, we expect that on average only K / ρ total trajectories must be simulated to provide Kconsistent trajectories. Generally speaking, the algorithm is guaranteed to terminate provided ρ and K are sufficiently small and sufficiently large, respectively (see[26] and below for more details). As will be shown below, use of the CE method coupled with MCEM provides enormous computational savings when compared to MCEM initiated with arbitrary parameter values.

Multilevel splitting

If the observed data consist of many time points, simulating a trajectory that passes through all of the data will be extremely unlikely, even when using the true parameter values. Consequently, our CE method will require a very small ρ (with accompanying very large K) in order to converge in a reasonable number of iterations. As a means of reducing this computational expense, we have added a "divide and conquer" approach with respect to the data inspired by multilevel splitting (MS) methods for rare event simulation[16, 17]. MS methods divide rare trajectories leading to a given system state into less rare sub-trajectories passing through intermediate states. Sub-trajectories that reach the intermediate states in a given time are split into multiple copies, while the others are killed with some probability. In this way, an ensemble of simulated trajectories is gradually enriched for those that reach the state of interest.

A natural definition of a sub-trajectory in the context of observed data is the portion of a trajectory from time 0 to a recorded time point t i  ≤ t d . Starting from t = 0 for a given iteration of our CE method, we simulate K trajectories only until the first observed time point, giving rise to the sub-trajectories( z 1 , 1 ( m ) , z 1 , 2 ( m ) ,, z 1 , K ( m ) ), where the first subscript of z i , k ( m ) denotes a sub-trajectory spanning the time interval [0, t i ]. We then compute the distanced( z 1 , k ( m ) , y 1 ) of each sub-trajectory with respect to the first observed data point y 1 ( x 0 , x 1 ) . Sub-trajectories falling in the (ρ × 100)th quantile of distances (where we typically choose ρ = ρ) are "split" by sampling from them with replacement to generate K new trajectories, while the remaining trajectories are killed. The new trajectories are simulated forward to the second observed time point to yield( z 2 , 1 ( m ) , z 2 , 2 ( m ) ,, z 2 , K ( m ) ), and the distancesd( z 2 , k ( m ) , y 2 ) are computed (with y 2 ( x 0 , x 1 , x 2 ) ). As before, sub-trajectories are split according to their distances from the observed data, and the process is continued until trajectories reach the final time point. The resulting K trajectories, enriched for sub-trajectories passing close by observed data, are used as input to Equation (7) to update the parameter estimates, after which the next CE iteration begins. Figure1 illustrates this overall process of splitting combined with the CE method. We note that setting ρ = 1 results in a nearly unmodified CE method as described above, and the amount of trajectory splitting can be easily tuned to the desired level by changing ρaccordingly.

Figure 1
figure 1

Multilevel splitting applied to CE phase of MCEM2. Using θ ̂ ( 0 ) , we first simulate an ensemble of K trajectories from the initial system state (black circle at t = 0) until time t1 (red traces). The ending states of the ρK trajectories closest to the first observed data point (bold red traces) are sampled with replacement to provide starting states for the next simulation interval. We then simulate a second ensemble of K trajectories starting at time t1 until reaching t2. Here, we select the ρK trajectories spanning the interval [0, t2] that are closest to the first and second data points (black circles at times t1 and t2) and use them to initiate the third simulation ensemble. We repeat this process until reaching t4, at which time we compute the first set of parameter estimates θ ̂ ( 1 ) using the ρK trajectories closest to all data points (full length bold red traces). Using θ ̂ ( 1 ) , we begin the process again at t = 0, producing the green traces. Finally, using θ ̂ ( 2 ) to generate the blue traces, we obtain ρK trajectories coinciding exactly with all data points, which we use to compute θ ̂ CE θ ̂ ( 3 ) .

Parameter perturbation

Both the CE method and its MS modification rely on the system’s intrinsic variability to refine parameter estimates. If a system exhibits a low level of variability, each selected subset of ρK trajectories will not lie much closer to the data than the other trajectories. This will result in a slowly progressing algorithm. To overcome this potential problem, we have introduced a parameter λ [0, 1] which we use to independently perturb the components of the current parameter estimate for each simulated trajectory over each of the observed time intervals. We generate θ ~ j , i , k ( m ) (j = 1,…,M; i = 1, …, d; k = 1, …, K) as follows:

θ ~ j , i , k ( m ) U((1λ) θ ̂ j ( m ) ,(1+λ) θ ̂ j ( m ) ),

where U(a, b) is a uniformly distributed random variable with minimum and maximum values a and b, respectively. We simulate each of the d observed time intervals for each of the K trajectories using independently perturbed parameters; thus, Equation (8) is evaluated M × d × K times for each iteration m of our modified CE method. Depending on the magnitude of λ, this procedure generates substantially more variability in each ensemble of sub-trajectories, leading to faster progression of the CE method. Although parameter perturbation is not generally used in rare event simulation, we note that a similar approach is present in iterated filtering versions of sequential Monte Carlo algorithms[27] where the perturbations allow the algorithm to escape from local minima of an objective function. In all examples presented below, we choose λ = .25.

Computing MLE uncertainty estimates

An advantage of using MCEM to identify MLEs is the simplicity with which uncertainty estimates can be computed. In general, MLEs exhibit asymptotic normality; consequently, their covariance matrix can also be estimated using Monte Carlo simulation[23, 28]. In order to insure that parameter confidence bounds derived from the MLE covariance matrix are positive, we introduce the transformed parameters ω j =log θ j (j=1,,M). Due to the functional invariance property of maximum likelihood estimators, ω ̂ j =log θ ̂ j , and by modeling θ ̂ as a log-normally distributed random variable (which is only defined for strictly positive real numbers), ω ̂ becomes multivariate normal with mean vector(log θ 1 ,,log θ M ) and covariance matrix Σ. We can estimate this covariance matrix using the following expression (see[23, 28] for details):

Σ ̂ 1 = 1 K k = 1 K 2 ω 2 log f ω ( x 0 , z k ) + 1 K k = 1 K ω log f ω ( x 0 , z k ) × ω log f ω ( x 0 , z k ) T 1 K k = 1 K ω log f ω ( x 0 , z k ) × 1 K k = 1 K ω log f ω ( x 0 , z k ) T ,

where {·} delimits a matrix, aT represents the transpose of vector a, f ω (·) is equivalent to Equation (1) with exp(ω) substituted for θ, z k is a reaction trajectory simulated using θ ̂ =exp( ω ̂ ), and k indexes only the Ksimulated trajectories that are consistent with the observed data. After some simplification, we arrive at:

Σ ̂ 1 = 1 K k = 1 K i = 1 r k exp ( ω ̂ j ) h j ( x i 1 , k ) τ i k j + 1 K k = 1 K r j k i = 1 r k exp ( ω ̂ j ) h j ( x i 1 , k ) τ i k j × r j k i = 1 r k exp ( ω ̂ j ) h j ( x i 1 , k ) τ i k j T 1 K k = 1 K r j k i = 1 r k exp ( ω ̂ j ) h j ( x i 1 , k ) τ i k j × 1 K k = 1 K r j k i = 1 r k exp ( ω ̂ j ) h j ( x i 1 , k ) τ i k j T

where { · } j is a diagonal matrix with j ranging from 1 to M along the diagonal and ( · ) j is a column vector with j ranging from 1 at the top-most element to M at the bottom. All trajectories in Equation (10) are simulated using parameter values θ ̂ =exp( ω ̂ ).

Upon solving Equation (10) for Σ ̂ , we can compute the coordinates of confidence intervals and ellipses (end points and boundaries, respectively) for ω using the properties of the multivariate normal distribution. We then transform these coordinates by exponentiation to yield (strictly positive) confidence bounds for θ. We note that all of the components of Equation (10) were previously required for computing MLEs using MCEM. In practice, after identifying θ ̂ , we simulate one additional ensemble of trajectories to estimate parameter uncertainties. For all examples described below, we use K =1 04 in this final computation.

To summarize, our proposed method for accelerating MLE identification in stochastic biochemical systems works in three steps: first, it identifies an initial parameter estimate θ ̂ CE using a modified cross-entropy method with multilevel splitting and parameter perturbation; second, it uses this initial estimate as input to ascent-based MCEM, which is run until convergence to yield θ ̂ ; third, it uses this MLE to compute parameter uncertainty estimates via Equation (10). We provide pseudo-code for the complete method below (see Algorithms 1-3), which we refer to as MCEM2: Monte Carlo Expectation-Maximization with Modified Cross-Entropy Method.


We now illustrate the utility of MCEM2 for estimating unknown parameters by applying it to data from five stochastic biochemical model systems: a pure-birth process, a birth-death process, a decay-dimerization, an auto-regulatory gene network, and a model of yeast-polarization. For each model, we first simulate a single system trajectory (with known parameters) using the SSA for a given final time T. Next, we extract data from this trajectory for all species at d equally-spaced time points, where d = T / Δt for a time step Δt. Finally, we run MCEM2 on the dataset and a version of the model where all information about model parameters has been withheld. Unless otherwise noted, we set the initial parameter vector for each system θ ̂ ( 0 ) equal to a vector of all ones. We display point estimates and confidence bounds for each simulation.

Pure-birth process

A system for which MLEs can be computed analytically from discretely observed data is the pure-birth process, also known as a homogeneous Poisson process. The model is given by the single reaction

θ S

with initial conditions x0 = 0. The MLE for a given dataset from this model can be easily computed by dividing the number of molecules of S present at the final time point by the corresponding time: θ ̂ = x d /T. By design, both MCEM2 and standard ascent-based MCEM will also return this MLE (albeit at a greater computational expense), as any version of EM applied to this model ultimately reduces to the exact computation x d / T .

Thus, the only potential difference between MCEM2 and MCEM for this system is the required computing time. To quantify this difference, we generated data for 100 pure-birth models, with θ, the true value, ranging from .01 to 10. For each model, we used T = 1000 and d = 30, givingΔt=33 1 3 . We then applied ascent-based MCEM and MCEM2, both with θ ̂ ( 0 ) =1, to each dataset and ran until convergence. Figure2 displays the computing time for both methods as a function of θ. We see that the time required for MCEM increases dramatically as values of θ depart from θ ̂ ( 0 ) . The rapidly accelerating computational cost for MCEM is due to the rapidly decreasing likelihood of simulating a consistent trajectory as the discrepancy between θ ̂ ( 0 ) and θincreases. As shown in Figure2, MCEM is only feasible to use when θ ̂ ( 0 ) is within a factor of two from θ. In contrast, the computing time for MCEM2 stays approximately constant for values of θ less than 1 and increases relatively slowly for values greater than 1. This cost increase is due to the simulation cost of firing more birth reactions required for larger θ. MCEM2 does not appear to suffer from a cost associated with the discrepancy between θ ̂ ( 0 ) and θ.

Figure 2
figure 2

Computing time of MCEM versus MCEM2for pure-birth process. Red circles and curve fit depict computing time required for MCEM2 to return MLEs for the pure-birth model with θ ̂ ( 0 ) =1 and varying θ values. Blue circles and curve fit depict identical quantities for ascent-based MCEM. Performance of MCEM2 is robust to the discrepancy between initial and true parameter values, while ascent-based MCEM quickly becomes computationally intractable as the discrepancy increases.

We next investigated the accuracy of MCEM2 uncertainty estimates. Figure3 shows the normalized MCEM2 MLEs with 95% confidence intervals (CIs) for all models. Out of 100 CIs, only eight (denoted by blue circles) do not overlap the true values. This figure matches well with the expected number of missed overlaps (100 × (1 − .95) = 5) and suggests that our asymptotic normality assumption for deriving MLE confidence bounds is valid. We note that the relative magnitudes of the CIs decrease with increasing θ; this is due to the diminishing effect of noise on the system as the average number of reaction firings per unit time increases.

Figure 3
figure 3

Pure-birth process MCEM2MLEs and confidence intervals. Colored circles depict MCEM2 MLEs normalized by true parameter values for the pure-birth model with θ ̂ ( 0 ) =1 and varying θ. Error bars denote 95% confidence intervals (CIs) for each model. Out of 100 models tested, only eight (centered at blue circles) do not overlap the true parameter values (green line) whereas the remaining 92 (centered at red circles) enclose the truth. This agrees well with the expected 95/100.

Birth-death process

The second model doubles the number of reactions of the pure-birth process by adding a degradation reaction. The birth-death process takes the form:

θ 1 S S θ 2 .

The presence of a single first order reaction (degradation) renders the analytical calculation of MLEs infeasible. Furthermore, computational parameter identification for the birth-death process is significantly more challenging than for the pure-birth process. This challenge stems from the degeneracy present in a discretely observed dataset: the net increase of a single molecule of S can result from any combination of r + 1 R1 and r R2 reaction firings (where r is a non-negative integer). To evaluate MCEM2 on this system, we first generated single trajectory data for a model with θ = (1, .06) and x0 = 17, where the system starts in stochastic equilibrium. We used T = 200 and d = 40, giving Δt = 5. Figure4 displays the progression of θ ̂ 1 and θ ̂ 2 as a function of MCEM2 iteration. The modified cross-entropy phase of the algorithm required only three iterations (labeled -2, -1, 0), transforming θ ̂ ( 0 ) =(1,1) to θ ̂ ( 3 ) =(4.24,.28). From this point onward, the subset of trajectories given by ρ = .001 were consistent with the data, and the MCEM phase of the algorithm further modified the parameters to their final values θ ̂ =(1.446,.093), which were reached upon satisfying the convergence criterion (marked by black vertical line). Figure4 also includes the results from an additional 100 iterations of MCEM to illustrate the diminishing returns from running the algorithm beyond the convergence criterion. Throughout the MCEM phase, we note that the ratio θ ̂ 2 ( n ) / θ ̂ 1 ( n ) .065, indicating that multiple parameter values satisfying this ratio are sufficient to generate consistent trajectories. Nevertheless, Figure4 demonstrates that substantial parameter refinement is achieved by running MCEM to convergence.

Figure 4
figure 4

Birth-death process MCEM2parameter estimate progression. Green and blue bold lines denote MCEM2 parameter estimates θ ̂ 1 and θ ̂ 2 , respectively, as a function of iteration number. True parameter values are marked by green and blue horizontal dotted lines. The cross-entropy phase completes in three iterations (gray shaded region), followed by 234 iterations of MCEM until convergence (black vertical line). An additional 100 iterations of MCEM are included to illustrate the diminishing returns from running the algorithm beyond convergence. Although the parameter estimates from the first MCEM iteration are far from the true values, their ratio is nearly correct and this ratio is preserved as the estimates are refined toward the true values.

Next, we investigated the effect of appending data at additional time points to the original data set. Figure5 illustrates results from the original and three expanded datasets, all with Δt = 5. We display the MCEM2 MLEs along with 68%, 95%, and 99% confidence ellipses (warped due to exponentiation—see Methods) that represent parameter uncertainty as a function of both parameters. We see that as d increases, θ ̂ approaches θ until at d = 100 they are approximately equal. This trend demonstrates the increasing accuracy of MLEs with increasing d. Furthermore, although the true parameter values are always contained within the 95% confidence ellipses, all of the ellipses shrink in size as d increases. This behavior indicates the reduction in estimate uncertainty resulting from the addition of data points. Finally, all of the ellipses are clearly skewed, with major axes nearly overlapping the line passing through the origin whose slope is the ratio of the true parameter values (.06/1). This geometry shows that most of the uncertainty involves the magnitude of the parameters, whereas their ratio can be determined confidently from relatively few data points. We note that the computational run time of MCEM2 (1 × Intel 3 GHz processor) on each of the four datasets was approximately the same: one hour.

Figure 5
figure 5

Effects of birth-death dataset size on parameter estimates and MCEM2uncertainty. Each panel displays MCEM2 and SGD birth-death MLEs (red and blue circles, respectively) as well as Poisson method point estimates (orange circles) versus the true parameter values (green circles), along with MCEM2 68%, 95%, and 99% confidence ellipses (black curves ranging in size from smallest to largest, respectively). A, B, C, and D display results for datasets of 40, 60, 80, and 100 data points, respectively. The three methods tested identified parameters with comparable accuracy across all datasets. As the numbers of data points increase, the MCEM2 MLEs get closer to the truth and the confidence ellipses shrink in size. The green sloped line plots the ratio θ 2 / θ 1 , highlighting that the uncertainties of the parameter ratio are lower than the uncertainties of the parameter magnitudes. For all datasets, the 95% confidence ellipse encloses the true parameter values.

We also compared MCEM2 performance to that of two recent methods: an MLE method utilizing reversible jump Markov chain Monte Carlo coupled with stochastic gradient descent ("SGD")[4] and a Bayesian method using a Poisson process approximation ("Poisson")[14]. For the former, we used the provided MATLAB package to run SGD with the maximum number of iterations set to 500 and the initial sample size set to 600 (incrementing by 500 every 10 iterations). For the latter, we used the provided C code from the author’s web site implementing the program to run the Poisson method with tuning parameter .05 and total number of iterations 107 (with 105 burn-in iterations and 104 thinning interval). These options were chosen to yield sufficient mixing and convergence properties as evidenced by the diagnostic plots from the R package. We then computed the mean value of each parameter to arrive at point estimates. As with MCEM2, we set θ ̂ ( 0 ) =(1,1) for both methods. Figure5 displays the SGD and Poisson method results for the four birth-death process datasets. When compared to MCEM2, all three methods identified parameters with comparable accuracy, with SGD and Poisson methods performing better when d = 40 and d = 60 and MCEM2 performing better when d = 80 and d = 100. The confidence ellipses generated by the Poisson method were very similar in appearance to those of MCEM2, conveying the same information regarding the ratios of the two parameters (not shown). As noted above, the SGD method did not provide parameter uncertainty estimates. Regarding run time, the Poisson method required between 20 and 60 minutes to identify parameters for the four datasets, while the SGD method needed between 30 minutes and several days (the latter time due to a lack of convergence when using the d = 100 dataset).

We next modified the birth-death process such that the equilibrium value of species S gradually approached zero. Specifically, we created five models with true parameter values θ 1 =.5 and θ 2 taking the increasing values (.1, .5, 1, 2.5, 5). To insure that each system started roughly at stochastic equilibrium, we also set x0 to each of the following values (listed in order): (5, 1, 1, 1, 1). We then generated 20 independent datasets for each of the five models, using T = 25 and d = 25. Figure6 displays boxplots of the mean relative error (calculated as in[4]: 1 M j = 1 M | θ ̂ j θ j |/ θ j ) when applying MCEM2 and the Poisson method to each of these datasets. Although both methods perform equally well for the first three models (when the equilibrium value of S ≥ .5), MCEM2 clearly identifies parameters more accurately than the Poisson method for the last two datasets (when the equilibrium values of S are .2 and .1, respectively). This result illustrates the gradual loss of accuracy of the Poisson approximation for systems in which a species tends stochastically to zero. In contrast, MCEM2, which generates exact system trajectories using the SSA, experiences no such loss of accuracy. Unfortunately, we were unable to evaluate SGD on these modified birth-death process datasets, as the MATLAB package consistently terminated with an error related to the zero molecule count of S.

Figure 6
figure 6

Effects of decreasing birth-death equilibrium on MCEM2and Poisson method performance. Boxplots (displaying median, first and third quartiles, and most extreme data point within 1.5 ×the interquartile range from the box) summarize mean relative errors of MCEM2 and the Poisson method applied to 20 birth-death datasets for each of five models (true parameter values listed on x-axis). Models are sorted in decreasing order of the equilibrium value of S, ranging from 5 to .1. MCEM2 performance does not vary appreciably across the different models, while the Poisson method exhibits increasing error with decreasing equilibrium value.

Decay-dimerization model

The next system contains reactions involving species decay and dimerization. We begin with the following three reactions, where the dimerization step is reversible:

S 1 θ 1 S 1 + S 1 θ 2 S 2 S 2 θ 3 S 1 + S 1

with x0 = (40, 0). We generated ten single-trajectory datasets for a model where θ = (.2, .04, .5), using T = 5 and d = 25. We then modified the model such that the dimerization step is no longer reversible, leading to the following description:

S 1 θ 1 S 1 + S 1 θ 2 S 2 S 2 θ 3

with all other properties unchanged. We again generated ten single-trajectory datasets for this model. Finally, we evaluated MCEM2, the Poisson approximation method, and SGD on each of the 20 datasets. Figure7 displays the results for each of the methods in terms of mean relative error. We see that MCEM2 and the Poisson method perform very similarly in terms of accuracy (as well as run time: between 3 and 10 minutes for both models), with a slightly higher error for the irreversible model. In contrast, use of SGD results in higher errors for both models, with the irreversible model consistently yielding estimates with infinite error. This latter error is due to the estimate of θ1 quickly tending to infinity, regardless of how small we set the initial gradient descent step size. These results highlight a significant limitation of the SGD method: in order to generate a diversity of consistent trajectories, there must exist combinations of reactions that do not alter species counts. The reversible decay-dimerization model contains such a combination (reactions 2 and 3), while the irreversible model does not, leading to a divergent gradient descent.

Figure 7
figure 7

Effects of decay-dimerization model structure on MCEM2, Poisson method, and SGD performance. Boxplots summarize mean relative errors of the three methods applied to 10 decay-dimerization datasets for each of two three-reaction models. The two models differ only in their third reaction (listed on x-axis); the first model contains a reversible dimerization, while the second model does not. MCEM2 and the Poisson method perform similarly across both models, while SGD consistently incurs an infinite mean relative error (due to the estimate of θ1quickly tending to infinity) when applied to the second (irreversible) model.

To further explore the ability of MCEM2 to estimate parameters for a decay-dimerization, we introduced a third model which adds a conversion reaction to the reversible model above. Previously analyzed in[29], the precise system description is as follows:

S 1 θ 1 S 1 + S 1 θ 2 S 2 S 2 θ 3 S 1 + S 1 S 2 θ 4 S 3

with x0 = (1000, 10, 10). We generated single trajectory data for a model where θ = (.2, .04, .5, 1), using T = .1 and d = 5. Figure8 shows the data points for each of the three species. Given that Δt = .02, hundreds of reactions occur before the first observed time point. As the system evolves closer to its steady state, the number of reaction firings decreases, with only dozens of reactions firing between the last two time points. We note that the initial propensity for reaction R2 is nearly 4000 times larger than the propensity of its backwards counterpart R3; consequently, we expect observed data to reflect relatively few R3 firings (and thus contain relatively little information about θ 3 ).

Figure 8
figure 8

Decay-dimerization dataset. Red, green, and blue circles depict initial system states and five data points for species S1, S2, and S3, respectively. This dataset is sparsely observed, as species S1 changes substantially between t = 0 and the first observed time point.

To investigate the impact of parameter perturbation on the performance of MCEM2, we estimated parameters from this decay-dimerization dataset using both λ = .25(default) and λ = 0(no perturbation). Figure9 shows the progression of each parameter during the cross-entropy phase of the algorithm for both default perturbation (solid line) and no perturbation (dotted line). With λ = .25, the CE phase required only 23 iterations before beginning MCEM, whereas setting λ = 0 increased the number of CE iterations to 152. More importantly, the CE phase computing times for perturbation and no perturbation were 59 s and 32 min, respectively, resulting in a 33-fold speedup when perturbing parameters. The reason for this large reduction in computational time is due to the larger parameter values explored by the CE phase without perturbation (see θ ̂ 1 and θ ̂ 3 ), which equates to simulating trajectories with many more reaction firings. By using perturbation, MCEM2 appears to navigate the parameter space more efficiently and hence require much less computational time. We note that three of the four parameters reach approximately the same values at the end of the CE phase in the perturbed and non-perturbed cases, with θ ̂ 3 providing a slight exception. However, as we show below, the large uncertainty associated with θ ̂ 3 prevents us from determining whether this parameter is substantially different between the two cases. We thus conclude that perturbation does not systematically alter the final parameter estimates returned by the CE phase.

Figure 9
figure 9

Effects of parameter perturbation on decay-dimerization cross-entropy phase. Red, blue, green, and orange lines represent MCEM2 parameter estimates θ ̂ 1 , θ ̂ 2 , θ ̂ 3 , and θ ̂ 4 , respectively, as a function of cross-entropy (CE) phase iteration number. Solid lines display parameter values observed using perturbation (λ = .25), while dotted lines depict parameter values obtained without perturbation (λ = 0). Perturbation substantially accelerated completion of the CE phase, both in number of iterations (23 versus 152) and, more strikingly, in simulation time (59 s versus 32 min). Final CE phase parameter estimates were approximately the same whether or not perturbation was used.

Figure10 displays the MLEs and pairwise confidence ellipses computed by MCEM2 when applied to this decay-dimerization dataset. Specifically, MCEM2 returned θ ̂ =(.220,.039,.110,1.006), which represents a 22.8% mean relative error when compared to the truth. For all combinations of parameters, the corresponding 68% confidence ellipses enclose the true parameter values, and apart from θ ̂ 3 these ellipses are relatively compact. As noted above, the uncertainty associated with reaction R3 is much larger than for the other reactions, confirming our hypothesis that the dataset contains substantially less information about the backwards rate of the dimerization.

Figure 10
figure 10

Parameter estimation results for decay-dimerization model. Each panel displays MCEM2 MLEs (red circles) versus the true parameter values θ = (.2, .04, .5, 1)(green circles), along with 68%, 95%, and 99% confidence ellipses. All six pairwise parameter comparisons are shown. The mean relative error for MCEM2 was 22.8%. All MCEM2 confidence ellipses enclose the true parameter values, and uncertainty is relatively low for all estimates except θ ̂ 3 .

Auto-regulatory gene network

To further compare MCEM2 to the Poisson method and SGD, we tested all methods on a system for which SGD was previously shown to perform well: a prokaryotic auto-regulatory gene network[4]. This system contains the following eight reactions, organized as four reversible pairs:

DNA + P 2 θ 1 DNA - P 2 DNA - P 2 θ 2 DNA + P 2 DNA θ 3 DNA + mRNA mRNA θ 4 P + P θ 5 P 2 P 2 θ 6 P + P mRNA θ 7 mRNA + P P θ 8 ,

where DNA, P, P2, and mRNA represent DNA promoters, protein gene products, protein dimers, and messenger RNA molecules, respectively. We set x0 ≡ (DNA, DNA-P2, mRNA, P, P2) = (7, 3, 10, 10, 10) and generated single trajectory data using θ = (.1, .7, .35, .3, .1, .9, .2, .1)with T = 50 and d = 100. Using the same options as before, we applied MCEM2 and SGD to this dataset using θ ̂ ( 0 ) =(1,1,1,1,1,1,1,1). We also applied the Poisson method using total number of iterations 108, with 106 burn-in iterations and 105 thinning interval (these values were increased from before to preserve adequate mixing and convergence). As in previous examples, we initially used ρ = .001 in the CE phase of MCEM2. However, this proportion was not small enough to enable the generation of ρK consistent trajectories for this system (and thus to progress to MCEM). To compensate, we re-ran MCEM2 using ρ = .0001 and K = 105. This time, the CE phase completed easily in five iterations.

Figure11 displays MLEs for all three methods, as well as the MCEM2 pairwise confidence ellipses for the four reversible reaction pairs. We see that all methods estimate most parameters with approximately equal accuracy, although MCEM2 and SGD more accurately determine θ 1 and θ 2 , while the Poisson method and SGD more accurately determine θ 5 and θ 6 . The mean relative errors for MCEM2, SGD, and the Poisson method were 52%, 20%, and 30%, respectively. The MCEM2 95% confidence ellipses enclose all true parameters except θ 5 and θ 6 , and as in the birth-death system, all ellipses attribute most of the uncertainty to knowledge of the magnitudes of parameter pairs rather than their ratios. The ellipses generated by the Poisson method were skewed in the same manner, conveying similar information regarding parameter ratios (not shown). Regarding run times, the Poisson method was by far the fastest, requiring only 1.5 hours to estimate parameters. In contrast, SGD and MCEM2 required 2.3 and 8.7 days on a single processor, respectively, to complete.

Figure 11
figure 11

Parameter estimation results for auto-regulatory gene network. Each panel displays MCEM2 and SGD MLEs and Poisson method point estimates computed using θ ̂ ( 0 ) =(1,1,1,1,1,1,1,1) (red, blue, and orange circles, respectively), true parameter values (green circles), and MCEM2 68%, 95%, and 99% confidence ellipses. A, B, C, and D compare the four reversible pairs of reactions in the system. Mean relative errors for MCEM2, SGD, and the Poisson method were 52%, 20%, and 30%, respectively. MCEM2 95% confidence ellipses enclosed all true parameter values except θ 5 and θ 6 ; like the birth-death system, their skew indicates that the uncertainties of the parameter ratios are lower than the uncertainties of the parameter magnitudes.

In[4], the SGD method was also used to identify parameters from datasets where only a subset of species were observed. We modified our original dataset by removing observed molecule counts for species DNA and DNA − P2 at all time points except t = 0 and re-ran MCEM2. Upon convergence, we obtained θ ̂ =(0.043,0.538,0.302,0.377,0.301,3.103,0.494,0.243) for a 107% mean relative error. This roughly translates to a 2-fold increase in relative error due to a 40% decrease in observed data points. Unfortunately, we were not able to compare to the performances of SGD or the Poisson method, as neither implementation was executable on datasets with missing species.

Yeast-polarization model

The final system we used to evaluate MCEM2 models the pheromone-induced G-protein cycle in Saccharomyces cerevisiae[18, 30]. This model consists of the following eight reactions:

θ 1 R R θ 2 L + R θ 3 RL + L RL θ 4 R RL + G θ 5 G a + G bg G a θ 6 G d G d + G bg θ 7 G θ 8 RL ,

where R, L, and RL represent pheromone receptors, ligands, and receptor-ligand complexes, respectively. Species G corresponds to a G-protein with separate subunits G a , G bg , and G d . We used x0 ≡ (R, L, RL, G, G a , G bg , G d ) = (500, 4, 110, 300, 2, 20, 90) and generated single trajectory data for θ = (.38, .04, .082, .12, .021, .1,.005, 13.21)using T = 5 and d = 15. Figure12 displays the data points for all species. As with the final decay-dimerization model, this dataset is sparsely observed, particularly with respect to species G, G a , and G bg at early time points.

Figure 12
figure 12

Yeast-polarization dataset. Colored circles depict initial system states and 15 data points for all seven species. Like the decay-dimerization dataset, these data are sparsely observed, particularly with respect to species G, G a , and G bg between t = 0 and the first observed time point.

We first tested MCEM2 on this dataset with θ ̂ ( 0 ) =(1,1,1,1,1,1,1,1) and ρ = .001. As with the auto-regulatory gene network, this value of ρ was not small enough to enable the generation of ρK consistent trajectories. Given the greater computational expense of simulating the yeast-polarization model, we decided against reducing ρ and increasing K further until the CE phase converged. Instead, we prematurely terminated the CE phase once the distance from the observed data reached a steady minimum value, and proceeded to MCEM. This occurred at 70 iterations, when δ(m) ≈ .033(see Methods). Although we expected premature entry into MCEM to increase the time required to simulate consistent trajectories in the first few iterations, we did not notice an appreciable trend and MCEM converged (defined here as when the change in conditional log-likelihood was less than .005 for at least one iteration) in 55 iterations. The resulting MLEs and available 68% confidence intervals (CIs) are displayed in Table1. MCEM2 achieved a 34.7% mean relative error, and all determined CIs enclosed the corresponding true parameter values.

Table 1 Yeast-polarization model: parameter estimates and mean relative error (% Error) for MCEM 2 MLEs, SGD MLEs (SGD 2 initialized with values exhibiting 12% mean relative error), and Poisson method point estimates, along with the MCEM 2 and Poisson method 68% confidence intervals (CIs) for each parameter

We next tested the Poisson method on the yeast-polarization dataset, using θ ̂ ( 0 ) =(1,1,1,1,1,1,1,1) and the same options as in the auto-regulatory gene network example. Table1 displays the resulting parameter estimates, along with the 68% CIs. Compared to MCEM2, the Poisson method incurred a 2.7-fold higher mean relative error, and only half of the CIs enclosed the true parameter values. Although less accurate for this example, the Poisson method required substantially less run time than MCEM2: three hours versus 30 days on a single processor. This difference reflects the significant cost of simulating trajectories with the SSA rather than using a Poisson approximation.

Finally, we tested SGD on the yeast-polarization dataset using the same options as in previous examples ("SGD1"). As in the decay-dimerization model, the SGD estimate for one of the parameters (θ5) tended to infinity within nine steps of the algorithm (and thus resulted in an infinite mean relative error), even when using an initial gradient descent step size as small as 10−6(see Table1). We then retested SGD using initial parameter values much closer to the truth (12% mean relative error): θ ̂ ( 0 ) =(.461,.047,.086,.123,.015,.085,.005,12.299)and other options unchanged ("SGD2"). This is in contrast to MCEM2 and SGD1, which were run with initial parameter values set to a vector of ones. As before, the same parameter estimate tended to infinity, although this time 46 steps were required to do so. Although the yeast-polarization system contains combinations of reactions that leave species numbers unchanged, they are evidently not sufficient to allow adequate trajectory generation for a non-divergent gradient descent. Table1 displays both sets of SGD parameter estimates without CIs, as the method does not provide uncertainty estimates.


This work presents MCEM2, a novel enhancement of MCEM that accurately estimates unknown parameters of stochastic biochemical systems from observed data. MCEM2 combines a state of the art, adaptive implementation of MCEM (ascent-based MCEM) with algorithms from rare event simulation (the CE method and multilevel splitting) to substantially accelerate parameter estimation. Unlike a previous application of the EM algorithm to stochastic parameter estimation[13], which performs an error-prone estimation of likelihood via reaction modification, MCEM2 concludes by executing an unmodified MCEM iteration. This places MCEM2 on solid theoretical foundations, with the CE phase of the algorithm serving only to accelerate the eventual MCEM phase. We note that this acceleration is essential for the method to be useful, as the use of unmodified MCEM is computationally tractable only when initial parameter estimates are close to the true values (see Figure2). We demonstrated that the addition of a third technique, parameter perturbation, accelerated execution of MCEM2 even further, without noticeable effects on the resulting parameter estimates. This was true even when using values of λ (denoting the maximum percent perturbation applied to each parameter) other than .25 (results not shown). If we decreased λ toward zero, the CE phase ran progressively slower with the same final results. If instead we increased λ toward one, the CE phase ran faster for some models while requiring larger sample sizes to converge (and thus running slower) for others. This latter effect is due to the increased noise conferred by using larger parameter perturbations. Ultimately, we found that by setting λ = .25, we achieved a useful speedup for all models tested without imposing larger sample size requirements.

MCEM2 requires selection of three additional user-defined quantities to achieve good performance: d(z, y), an observed data distance function; K, the total number of simulated trajectories; and ρ, the proportion of trajectories selected that are closest to observed data. For the former, we chose a normalized L1 distance, intended to provide approximately equal weight to each of the system species. Although this distance function yielded excellent performance, other functions are certainly possible (e.g. sum of squared deviations). However, we note that work performed using the related approximate Bayesian computation (ABC) methods suggests that the resulting parameter estimates are not sensitive to the choice of the distance metric[31]. The latter two parameters dictate the number of trajectories ρK used to refine parameter estimates at each step of the CE phase. Additionally, in order for the CE phase to converge, the proportion of simulated trajectories that are consistent with data in each time interval must be ≥ ρ in the final step. In the first three models tested in this work, we found K = 104 and ρ = .001 to be sufficient for relatively fast completion of the CE phase. For the auto-regulatory gene network model, these values were not adequate to enable the generation of (100 × ρ)% consistent trajectories, and we increased K to 105 and lowered ρ to .0001 to achieve convergence. Similarly, the original values were not sufficient for the yeast-polarization model, although we chose to terminate the CE phase prematurely rather than incur an additional simulation cost by increasing K. This practice did not noticeably impact the time required to execute MCEM iterations, which suggests that the actual proportion of simulated consistent trajectories was only slightly less than .001. In general, we suggest starting with K = 104 and ρ = .001 and increasing K only if computationally favorable. Otherwise, we would recommend terminating the CE phase when the distance from the observed data reaches a steady minimum value. We note that the CE phase of MCEM2 with early termination resembles the ABC method of Toni et al.[31], with two important differences. First, the ABC method requires a user-defined threshold for selecting simulated trajectories based on their distances from observed data, whereas MCEM2 automatically chooses this threshold using the parameter ρ. Second, the method of Toni et al. requires accurate prior bounds on parameter values, whereas MCEM2 needs no prior parameter information. This latter difference also sets our method apart from the SML and histogram-based approaches for identifying MLEs[2, 12], both of which require prior parameter bounds to execute a genetic algorithm.

Another important advantage of MCEM2 over existing MLE methods is the ease with which it can estimate parameter uncertainty. Existing MLE methods return parameter point estimates, but these estimates carry no measures of confidence or interdependency. In contrast, MCEM2 returns a multivariate parameter uncertainty estimate. This estimate indicates correlations between particular parameter estimates (see Figures5 and11), along with measures of the information content of the observed data for each unknown parameter (compare confidence ellipses of θ ̂ 3 to other parameters in Figure10). In order to generate uncertainty estimates, MCEM2 assumes that MLEs are multivariate log-normally distributed, which can be shown to be true as the number of data points increases asymptotically. However, 30 data points appear to be sufficient to satisfy this assumption (Figure3), with possibly as few as five being acceptable (decay-dimerization dataset: Figure10). Of the pairwise confidence ellipses generated in this work (describing estimates of the birth-death process, decay-dimerization, and auto-regulatory gene network), we observed only one instance where the true parameter pair did not reside within the 99% confidence ellipse (parameters θ5 and θ6 of auto-regulatory gene network: Figure11C). Nevertheless, we note that the true parameter values in this case line up with the major axis of the corresponding ellipse, suggesting that MCEM2 was still able to correctly identify the ratio of the parameters. We note that Bayesian approaches like the Poisson approximation method also generate multivariate parameter uncertainty estimates which provide similar information to that given by MCEM2.

We compared MCEM2 to the recently proposed Poisson approximation and SGD approaches by applying all three methods to four examples: birth-death process, decay-dimerization, auto-regulatory gene network, and the yeast-polarization model. Overall, the results demonstrate that MCEM2 performs relatively well for all examples. The first example illustrated that predictions made by the Poisson approximation method increasingly lose accuracy as species molecule counts tend to zero. MCEM2 avoids any such accuracy loss due to its exact simulation of consistent trajectories. The second example illustrated a limitation of the SGD method: to function properly, it requires systems to contain combinations of reactions that do not alter species counts. MCEM2 (as well as the Poisson method) imposes no such requirement. The divergence of the gradient descent in the yeast-polarization model also suggests that the mere presence of these combinations of reactions are not sufficient to lead to good SGD performance.

When functioning correctly on larger systems, an advantage of both SGD and the Poisson approximation method over MCEM2 is their lower required computational time. In particular, SGD ran 3.78-fold faster than MCEM2 for the auto-regulatory gene network, and the Poisson method ran an additional 36.8-fold faster than SGD. On the yeast-polarization model, the Poisson method ran 240-fold faster than MCEM2. These speed-ups are due to both methods’ "simulation free" approaches for generating consistent trajectories, which is advantageous for computationally expensive models. Although the CE phase of MCEM2 typically completes in only a few iterations, the MCEM phase can require ≥ 100 iterations, with each iteration modifying the parameter estimates only slightly. Thus, a modified version of MCEM that takes larger steps in parameter space would further accelerate convergence. Such modifications have previously been described in the literature[28]; consequently, current work focuses on incorporating these modifications into MCEM2. We note that one simple way to reduce the computational time required by MCEM2 is to simulate trajectories in parallel, using either clusters of CPUs (central processing units) or GPUs (graphics processing units). Since each consistent trajectory can be simulated independently of all others, the computation time of each MCEM2 iteration can in principle be reduced to the longest time required to simulate a single consistent trajectory.

One final enhancement that would broaden the applicability of MCEM2 involves accommodating measurement error in the observed data. Implementing this enhancement would be relatively straightforward given probabilistic error with known distribution. In this case, we could simply replace the indicator function in Equation 4b with the corresponding density function of the error, given a simulated trajectory. This modification would substantially improve the efficiency of MCEM2, as any simulated trajectory could now have a nonzero likelihood of generating the observed data (and thus all trajectories could be consistent with observed data). Future work will focus on incorporating this enhancement into MCEM2.


In this work, we developed Monte Carlo Expectation-Maximization with Modified Cross-Entropy Method (MCEM2), a novel method for maximum likelihood parameter estimation of stochastic biochemical systems. Through applying MCEM2 to five example systems, we demonstrated its accurate performance and distinct advantages over existing methods. We expect these advantages to permit analysis of larger and more realistic biochemical models, ultimately providing an improved mechanistic understanding of important biological processes.

Algorithm 1: Pseudo-code for CE phase of MCEM2

1: θ ̂ ( 0 ) (1,1,,1), δ ( 0 ) ,m0

2: while δ(m)> 0 do

3:   m ← m + 1

4:   t0 ← 0

5:   r jk ( m ) 0j,k

6:   for i = 1 to d do

7:     for k = 1 to K do

8:     generate θ ~ i , k ( m 1 ) θ ~ 1 , i , k ( m 1 ) , , θ ~ M , i , k ( m 1 ) byevaluating Equation (8) M times

9:     t ← ti−1

10:    if i = 1 then

11:      x ← x0

12:    else

13:      x ← final state of z i 1 , k ( m )

14:    end if

15:    while t ≤ t i do

16:      compute all h j (x)

17:      generate τ, jusing the SSA with θ ~ i , k ( m 1 ) ,augment z i , k ( m )

18:      t ← t + τ, r j k ( m ) r j k ( m ) +1, update x to reflect the firing of reaction R j

19:    end while

20:      end for

21:      δ ( m ) ( ρ × 100 ) th quantile of d ( z i , 1 ( m ) , y i ) , , d ( z i , K ( m ) , y i )

22:      if i < d then

23:      replace z i , 1 ( m ) , , z i , K ( m ) by sampling withreplacement from the z i , k ( m ) satisfyingd( z i , k ( m ) , y i )δ(m))

24:      end if

25:      end for

26:      compute θ ̂ ( m ) according to Equation (7)

27: end while

28: return θ ̂ CE = θ ̂ ( m )

Algorithm 2: Pseudo-code for MCEM phase of MCEM2

1: θ ̂ ( 0 ) θ ̂ CE , n ← 0

2: while (upper bound of the change in conditionallog-likelihood > .005) do

3:     n ← n + 1

4:     if n > 1 then

5:        increment K as described in[19]

6:     end if

7:     t0 ← 0

8:     r j k ( n ) 0j, k

9:     for i = 1 to d do

10:        for k = 1 to Kdo

11:            t ← ti−1

12:            if i = 1 then

13:              x ← x0

14:            else

15:             x x i 1

16:            end if

17:            while t ≤ t i do

18:     compute all h j (x)

19:     generate τ, jusing the SSA with θ ̂ ( n 1 ) ,augment z i , k ( n )

20:     t ← t + τ, r j k ( n ) r j k ( n ) +1, update x to reflect the firing of reaction R j

21:            end while

22:            ifd( z i , k ( n ) , y i )>0then

23:     reset z i , k ( n ) , r j k ( n ) to values held before step 17

24:     go to step 11

25:            end if

26:        end for

27:     end for

28:     compute θ ̂ ( n ) according to Equation (5)

29: end while

30: return θ ̂ = θ ̂ ( n )

Algorithm 3: Pseudo-code for computing MCEM2 uncertainty estimates

1: t0 ← 0

2: r j k 0j, k

3: for i = 1 to d do

4:     for k = 1 to Kdo

5:        t ← ti−1

6:        if i = 1 then

7:          x ← x0

8:        else

9:         x x i 1

10:        end if

11:        while t ≤ t i do

12:          compute all h j (x)

13:          generate τ, jusing the SSA with θ ̂ , augment z i , k

14:        t ← t + τ, r j k r j k +1, update x to reflect the firing of reaction R j

15:        end while

16:        if d ( z i , k , y i ) > 0 then

17:          reset z i , k , r j k to values held before step 11

18:          go to step 5

19:        end if

20:     end for

21: end for

22: compute Σ ̂ according to Equation (10)

23: return Σ ̂


  1. Tan C, Song H, Niemi J, You L: A synthetic biology challenge: making cells compute. Mol Biosyst 2007, 3(5):343–53. 10.1039/b618473c

    Article  CAS  PubMed  Google Scholar 

  2. Poovathingal SK, Gunawan R: Global parameter estimation methods for stochastic biochemical systems. BMC Bioinf 2010, 11: 414. 10.1186/1471-2105-11-414

    Article  Google Scholar 

  3. Moles CG, Mendes P, Banga JR: Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res 2003, 13(11):2467–74. 10.1101/gr.1262503

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Wang Y, Christley S, Mjolsness E, Xie X: Parameter inference for discretely observed stochastic kinetic models using stochastic gradient descent. BMC Syst Biol 2010, 4: 99. 10.1186/1752-0509-4-99

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. McAdams HH, Arkin A: Stochastic mechanisms in gene expression. Proc Natl Acad Sci USA 1997, 94(3):814–9. 10.1073/pnas.94.3.814

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. McAdams HH, Arkin A: It’s a noisy business! Genetic regulation at the nanomolar scale. Trends Genet 1999, 15(2):65–9. 10.1016/S0168-9525(98)01659-X

    Article  CAS  PubMed  Google Scholar 

  7. Tan C, Marguet P, You L: Emergent bistability by a growth-modulating positive feedback circuit. Nat Chem Biol 2009, 5(11):842–8. 10.1038/nchembio.218

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  8. Vilar JMG, Kueh HY, Barkai N, Leibler S: Mechanisms of noise-resistance in genetic oscillators. Proc Natl Acad Sci U S A 2002, 99(9):5988–92. 10.1073/pnas.092133899

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. El-Samad H, Khammash M: Coherence Resonance: A Mechanism for Noise Induced Stable Oscillations in Gene Regulatory Networks. Decision and Control, 2006 45th IEEE Conference on 2006, 2382–2387.

    Google Scholar 

  10. Munsky B, Khammash M: Identification from stochastic cell-to-cell variation: a genetic switch case study. IET Syst Biol 2010, 4(6):356–66. 10.1049/iet-syb.2010.0013

    Article  CAS  PubMed  Google Scholar 

  11. Gillespie DT: Exact stochastic simulation of coupled chemical reactions. J Phys Chem 1977, 81(25):2340–2361. 10.1021/j100540a008

    Article  CAS  Google Scholar 

  12. Tian T, Xu S, Gao J, Burrage K: Simulated maximum likelihood method for estimating kinetic rates in gene expression. Bioinformatics 2007, 23: 84–91. 10.1093/bioinformatics/btl552

    Article  CAS  PubMed  Google Scholar 

  13. Horváth A, Manini D: Parameter Estimation of Kinetic Rates in Stochastic Reaction Networks by the EM Method. BMEI (1) 2008, 713–717.

    Google Scholar 

  14. Boys RJ, Wilkinson DJ, Kirkwood TBL: Bayesian inference for a discretely observed stochastic kinetic model. Stat Comput 2008, 18(2):125–135. 10.1007/s11222-007-9043-x

    Article  Google Scholar 

  15. Rubinstein RY: Optimization of computer simulation models with rare events. Eur J Operational Res 1997, 99: 89–112. 10.1016/S0377-2217(96)00385-2

    Article  Google Scholar 

  16. Glasserman P, Heidelberger P, Shahabuddin P, Zajic T: Multilevel splitting for estimating rare event probabilities. Operations Res 1999, 47(4):585–600. 10.1287/opre.47.4.585

    Article  Google Scholar 

  17. Rubino G, Tuffin B: Rare Event Simulation Using Monte Carlo methods. Wiley, Chichester; 2009.

    Book  Google Scholar 

  18. Daigle Jr BJ, Roh MK, Gillespie DT, Petzold LR: Automated estimation of rare event probabilities in biochemical systems. J Chem Phys 2011, 134(4):044110. 10.1063/1.3522769

    Article  Google Scholar 

  19. Caffo BS, Jank W, Jones GL: Ascent-based Monte Carlo expectation-maximization. J R Stat Soc Ser B 2005, 67(2):235–251. 10.1111/j.1467-9868.2005.00499.x

    Article  Google Scholar 

  20. Gillespie DT: Stochastic simulation of chemical kinetics. Annu Rev Phys Chem 2007, 58: 35–55. 10.1146/annurev.physchem.58.032806.104637

    Article  CAS  PubMed  Google Scholar 

  21. Wilkinson DJ: Stochastic modelling for systems biology. Chapman and Hall/CRC mathematical and computational biology series, Boca Raton: Taylor and Francis; 2006.

    Google Scholar 

  22. Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM Algorithm. J R Stat Soc Series B (Methodological) 1977, 39: 1–38.

    Google Scholar 

  23. Robert CP, Casella G: Monte Carlo Statistical Methods. Springer, New York; 2004.

    Book  Google Scholar 

  24. Wei G, Tanner M: A Monte-Carlo implementation of the EM Algorithm and the poor man’s data Augmentation Algorithms. J Am Stat Assoc 1990, 85(411):699–704. 10.1080/01621459.1990.10474930

    Article  Google Scholar 

  25. Rubinstein RY, Kroese DP: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer, New York; 2004.

    Book  Google Scholar 

  26. Homem-de Mello T, Rubinstein RY: Rare event estimation for static models via cross-entropy and importance sampling.. Ohio State University; 2002.

    Google Scholar 

  27. Ionides EL, Bretó C, King AA: Inference for nonlinear dynamical systems. Proc Natl Acad Sci USA 2006, 103(49):18438–43. 10.1073/pnas.0603181103

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  28. Tanner MA: Tools for statistical inference: methods for the exploration of posterior distributions and likelihood functions. Springer, New York; 1996.

    Book  Google Scholar 

  29. Gillespie DT: Approximate accelerated stochastic simulation of chemically reacting systems. J Chem Phys 2001, 115(4):1716–1733. 10.1063/1.1378322

    Article  CAS  Google Scholar 

  30. Drawert B, Lawson MJ, Petzold L, Khammash M: The diffusive finite state projection algorithm for efficient simulation of the stochastic reaction-diffusion master equation. J Chem Phys 2010, 132(7):074101. 10.1063/1.3310809

    Article  PubMed Central  PubMed  Google Scholar 

  31. Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MPH: Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J R Soc Interface 2009, 6(31):187–202. 10.1098/rsif.2008.0172

    Article  PubMed Central  PubMed  Google Scholar 

Download references


We thank Matthew Wheeler for useful suggestions and comments on this work. We also acknowledge the following financial support: BJDJ and LRP were supported by the Institute for Collaborative Biotechnologies through grant W911NF-09-0001 from the U.S. Army Research Office. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. MKR and LRP were supported by NIH Grant No. 5R01EB007511-03 and DOE Grant No. DE-FG02-04ER25621. LRP was also supported by NSF Grant No. DMS-1001012.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jarad Niemi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Conceived and designed the experiments: BJDJ MKR LRP JN. Performed the experiments: BJDJ MKR. Wrote the paper: BJDJ LRP JN. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Daigle, B.J., Roh, M.K., Petzold, L.R. et al. Accelerated maximum likelihood parameter estimation for stochastic biochemical systems. BMC Bioinformatics 13, 68 (2012).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: