Skip to main content
  • Methodology Article
  • Open access
  • Published:

pSpatiocyte: a high-performance simulator for intracellular reaction-diffusion systems



Studies using quantitative experimental methods have shown that intracellular spatial distribution of molecules plays a central role in many cellular systems. Spatially resolved computer simulations can integrate quantitative data from these experiments to construct physically accurate models of the systems. Although computationally expensive, microscopic resolution reaction-diffusion simulators, such as Spatiocyte can directly capture intracellular effects comprising diffusion-limited reactions and volume exclusion from crowded molecules by explicitly representing individual diffusing molecules in space. To alleviate the steep computational cost typically associated with the simulation of large or crowded intracellular compartments, we present a parallelized Spatiocyte method called pSpatiocyte.


The new high-performance method employs unique parallelization schemes on hexagonal close-packed (HCP) lattice to efficiently exploit the resources of common workstations and large distributed memory parallel computers. We introduce a coordinate system for fast accesses to HCP lattice voxels, a parallelized event scheduler, a parallelized Gillespie’s direct-method for unimolecular reactions, and a parallelized event for diffusion and bimolecular reaction processes. We verified the correctness of pSpatiocyte reaction and diffusion processes by comparison to theory. To evaluate the performance of pSpatiocyte, we performed a series of parallelized diffusion runs on the RIKEN K computer. In the case of fine lattice discretization with low voxel occupancy, pSpatiocyte exhibited 74% parallel efficiency and achieved a speedup of 7686 times with 663552 cores compared to the runtime with 64 cores. In the weak scaling performance, pSpatiocyte obtained efficiencies of at least 60% with up to 663552 cores. When executing the Michaelis-Menten benchmark model on an eight-core workstation, pSpatiocyte required 45- and 55-fold shorter runtimes than Smoldyn and the parallel version of ReaDDy, respectively. As a high-performance application example, we study the dual phosphorylation-dephosphorylation cycle of the MAPK system, a typical reaction network motif in cell signaling pathways.


pSpatiocyte demonstrates good accuracies, fast runtimes and a significant performance advantage over well-known microscopic particle methods in large-scale simulations of intracellular reaction-diffusion systems. The source code of pSpatiocyte is available at


Intracellular space plays an important role in many biochemical systems operating in the timescales of minutes to hours such as cell signaling [1], division [2], polarization [3], morphogenesis [4] and chemotaxis [5, 6]. These systems are regulated by the spatiotemporal dynamics of molecules. Recent quantitative biology methods can obtain high-resolution spatiotemporal measurements of the molecules. These disparate sources of measurements can be combined and interpreted using spatially resolved simulators to construct models of the systems that are realistic and consistent with physical principles. By simulating the models, detailed analysis of the systems can be conducted in silico and future experiments can be designed [7].

The choice of spatial simulators largely depends on the timescale and spatial resolution of the system of interest [813]. For example, high-performance molecular dynamics (MD) simulators [14] can accurately capture the atomistic behavior of up to several macromolecules in a system but are limited in their timescale, allowing only simulations for up to a few milliseconds [15, 16]. It is thus not feasible to use MD for example, to simulate cell signaling systems such as the mitogen-activated protein kinase (MAPK) cascade, which takes place at the cellular scale with timescales spanning minutes to hours [17, 18]. For these longer spatial and temporal scales, numerical methods that solve partial differential equations (PDEs) or coarse-grained stochastic particle simulation methods can be used. PDE-based tools such as the freely available Virtual Cell [19] and the commercially available COMSOL Multiphysics (COMSOL Inc.) are useful when the system of interest is deterministic. They are especially fast and convenient when simulating molecules with very high copy numbers. Conversely, particle methods are preferable when we need to account for the noise and fluctuations arising from low copy number of reacting molecules in the cell [20].

Lattice-based particle methods based on the reaction-diffusion master equation (RDME) have the advantage to simulate a large number of diffusing molecules for extended spatiotemporal scales [2125]. However, since RDME methods represent molecules as dimensionless point particles in lattice voxels, they do not directly capture the effects of excluded volume brought by intracellular macromolecular crowding [26]. About 20–30% of the total volume inside cells are occupied by macromolecules [27]. This amount of crowding has been shown to affect reaction equilibria both in vivo and in vitro, and alter protein binding and gene expression characteristics [2831]. Moreover, crowded media can also cause non-intuitive effects such as molecules performing directed motion [32], and a change in the statistics of molecular number fluctuations in simple reactions [33]. To capture the effects of volume exclusion in systems that are in equilibrium, Cianci and colleagues [34] recently reported a modified version of the RDME method called vRDME. Despite this enhancement, RDME-based methods are still constrained when simulating diffusion-limited reactions and rebinding events [3537] because they assume molecules to be well-mixed in each voxel.

Off-lattice microscopic particle methods such Smoldyn [38], eGFRD [36], SpringSaLaD [39] and ReaDDy [40] can capture the effects of crowding directly because each molecule is represented individually with sphere-like physical dimensions. These simulators also support different sizes of volume excluding molecules. ReaDDy and SpringSalaD can also account for the coarse shape of molecules. However, because of these additional details, microscopic particle methods are more computationally demanding than RDME methods. In a recent performance benchmark of the microscopic methods [41], Smoldyn required the shortest runtime when simulating the well-known Michaelis-Menten reaction-diffusion kinetics.

Spatiocyte [42] is another microscopic method but molecules diffuse on lattice by hopping from one voxel to another. The current stable version of Spatiocyte accounts for volume exclusion by allowing only a single molecule to occupy a voxel at a time. The maximum size of a molecule is roughly equals to the size of a voxel [43]. The method therefore captures steric interactions most realistically when the size of volume excluding molecules is the same and almost equals to that of a voxel. To better simulate the effects of crowding, there is also a development version of Spatiocyte that allows a single molecule to occupy multiple voxels according to its size (unpublished). In Spatiocyte, fine and fast-diffusing molecules such as messengers, metabolites and ions are simulated at the compartment scale using the Next-Reaction method [42, 44]. The accuracy and consistency of Spatiocyte have been validated in detail recently in both volume [43] and surface [45] compartments. Spatiocyte achieves better execution time scaling behavior compared to other methods [43] because it resolves molecular collisions by looking only at the target voxel for occupancy. At the typical intracellular protein concentration range, the performance of Spatiocyte is comparable to Smoldyn when molecules are represented as point particles. On the other hand, when the molecules have physical dimensions, the runtime of Spatiocyte is at least two times faster than Smoldyn. These recent performance benchmarks [41, 43] imply that at present Spatiocyte is one of the fastest methods for simulating individual molecules in crowded systems.

There have been several efforts in the past to improve the performance of particle simulation methods using parallelization approaches. RDME methods have been accelerated on Graphics Processing Units (GPUs) [4648] and CPU clusters [49]. The GPU-based implementations of RDME can simulate up to two orders of magnitude faster than the serial version on CPU. Chen and De Schutter [49] used Message Passing Interface (MPI) to run a neuron model and achieved 500-fold speedup on a cluster with 1000 processes. Microscopic methods such as ReaDDy [50] and Smoldyn [51, 52] have also been parallelized on GPUs. The performance gain of ReaDDy was up to 115 times over its serial counterpart on CPU. The GPU versions of Smoldyn required between 135- to 200-fold shorter runtimes than the original CPU implementation. Recently, the ReaDDy method was extended to run using multiple threads in parallel on CPU [53]. It showed 6-fold reduction in the simulation time when running with 6 threads compared to the serial implementation.

Here we introduce a parallel implementation of the Spatiocyte method, called pSpatiocyte. The new algorithm was written bottom-up in C++ and MPI to exploit the resources of conventional workstations and massively parallel computers for high-performance simulations of large or crowded cell models. We demonstrate efficient simulations of reaction-diffusion systems at the microscopic scale with volume occupying molecules to recapitulate the crowded nature of intracellular media. We achieve scalability over 500,000 CPU cores on a distributed memory architecture.

In the following section we describe the parallelization schemes and numerical implementation of pSpatiocyte. We then provide computational results that validate parallel diffusion and reaction processes. We also demonstrate the performance of pSpatiocyte on the RIKEN K computer with thousands of cores and on a common workstation with eight cores. We show the applicability of pSpatiocyte in actual biological problems by simulating the dual phosphorylation-dephosphorylation cycle of MAPK. Finally, we conclude by providing a summary of the validation and performance results, and future directions of this work.


In a Spatiocyte model, a molecule of a species si diffuses in space by performing random walk on lattice from one voxel to a nearest neighbor voxel. The diffusion interval, \(\tau _{d}^{i}\) between two successive walks is determined by the diffusion coefficient Di. When the molecule collides with a molecule of a reactant species sj in the target voxel, they perform a bimolecular reaction with an acceptance probability, Wij. Wij captures the intrinsic reaction rate kij of the pair according to the Smoluchowski-Collins-Kimball (SCK) model [54, 55]. The accuracy and consistency of the bimolecular reaction on lattice in both activation-limited and diffusion-limited regimes have been verified [43, 45]. To represent volume occupying molecules, a voxel can be occupied by only a single molecule at any given time. As a result, if the diffusing molecule meets a non-reactive molecule in the target voxel, a collision occurs and it remains in its original voxel. In the following subsections, we describe the parallelization schemes of the Spatiocyte method in detail.

Coordinate system

Spatiocyte adopts the hexagonal close-packed (HCP) lattice arrangement (Fig. 1a) as it supports the highest density of sphere voxels in a given volume [56]. For comparison, the average density of HCP voxels is 74.048%, whereas the more commonly used cubic lattice has a density of 52.359%. The highest density of voxels is preferable because it allows the simulator to represent highly packed and crowded regions in a compartment to its maximum theoretical limit. Moreover, it was recently demonstrated that the voxels in HCP lattice need to be only about 2% larger than the molecule for the simulations to be consistent with the SCK model [43]. This is in contrast to the cubic lattice, which requires the voxel size to be at least 8% larger. The HCP lattice therefore, can more closely represent hard-sphere molecules in space. For two-dimensional (2D) planar simulations, Spatiocyte employs the triangular lattice arrangement, which is a plane of the HCP lattice [45]. Grima and Schnell [57] have previously shown that simulations on a triangular lattice are closer to Brownian dynamics and produce less discretization error than on square lattice. Simulations on square lattice also overestimate macromolecular crowding effects compared to the triangular lattice.

Fig. 1
figure 1

HCP lattice arrangement and a coordinate system for accessing voxels. a A schematic view of HCP lattice. Left: odd and even planes, shown in different colors. Right: the twelve nearest neighbors (green) of a voxel (yellow). b 2D slice of an HCP coordinate system based on a unit cell. A compartment mapped along the axes i and j makes a rhomboid instead of a rectangle. A tilted third axis, k (not shown) further contorts the boundaries of the compartment. c Top row: accessing neighbor voxels in a twisted Cartesian coordinate system for the odd plane in k-axis. Bottom row: another stencil is used for the even plane

Although HCP lattice has a regular grid arrangement of voxels, some considerations are necessary to define the coordinate axes to access the voxels. In Fig. 1b, the axes i and j aligned to the voxels in a plane of HCP lattice make up a rhomboid instead of a rectangle. The third axis k is also tilted when aligned to the voxels across the planes of the lattice. Such unusual axes arrangement requires substantial computational steps to convert the integer coordinates of a voxel into real coordinates. A data structure without a coordinate system can also be used to identify and access neighbor voxels. For example, a one-dimensional array of voxels can be used, wherein each voxel has pointers to its 12 nearest neighbors. The serial version of Spatiocyte adopts this scheme [42]. From our performance profiling results, the additional memory required to store 12 pointers per voxel and indirect memory accesses to load neighbor voxel data adversely impacts performance because of increased memory bandwidth usage and cache misses. In addition, with an unconventional grid it can also be cumbersome to split the computational domain spatially into smaller subdomains (green, red, orange and blue regions in Fig. 1b) for parallel execution by processes.

To overcome these issues, we propose a coordinate system called twisted Cartesian as depicted in Fig. 1c. It comprises a straight and two zigzag axes. The figure illustrates how each of the 12 neighbors is identified with these axes. Depending on whether the voxel plane, k is even- or odd- numbered, one of two procedures is used in i- and j-axes to identify a neighbor voxel. The two procedures are shown by the top and bottom panels of Fig. 1c. This coordinate system can be readily mapped onto a conventional Cartesian coordinate system without almost any modification and enables straightforward programmability. Despite being unconventional, the twisted Cartesian coordinate system works well for identifying and accessing neighbor voxels pointer-free by only using conditional statements. In a preliminary implementation of pSpatiocyte, we used the system successfully for parallelization [58]. Recently, a similar approach was adopted for cellular automata simulations in 2D space [59].

Parallelized Spatiocyte algorithm

The Spatiocyte method advances the simulation time, ts in discrete steps using an event scheduler [42, 60]. The scheduler gets the next event, e to be executed from a priority queue Q, which contains events sorted according to scheduled times, te. The types of events that can be defined in a model include walk (performs diffusion and bimolecular reactions), unimolecular reaction, species numbers logger and molecule coordinates logger. Upon execution, an event returns the interval, τe for its next execution. The next execution time, te=ts+τe is then passed to the priority queue to reschedule the event. The scheduler executes all events in a loop until the simulation end time, tend is reached.

The pSpatiocyte method is a parallelized version of Spatiocyte and is completely written in C++. The method is parallelized using the domain decomposition approach illustrated in Fig. 2a. With this approach, the complete lattice space is divided equally into N subdomains to be executed concurrently by N processes. To minimize synchronization overheads between processes, the scheduler and events are duplicated across all processes at initialization, which is described in Algorithm 1. The scheduler of each process then executes the main loop of the simulation in parallel according to Algorithm 2. The simulation proceeds synchronously over all processes by ensuring that (1) the scheduled execution time of each event is identical across processes; and (2) each process synchronizes with adjacent processes when molecules from local subdomain walk or react across adjacent subdomains. We describe how these two conditions are satisfied by each event in the following subsections.

Fig. 2
figure 2

Schemes for large-scale parallel simulation of particles. a Domain decomposition of an HCP lattice plane. A 122 plane with reflective boundary is equally divided into four subdomains. Each subdomain measuring 62 voxels is allocated to one of four available processes, P0, P1, P2 and P3. Subdomain voxels adjoining other subdomains are defined as out voxels. Ghost voxels are added locally to each subdomain to encapsulate out voxels. The ghost voxels serve to reflect the state of out voxels residing in adjacent subdomains. b Subdomain division into subvolumes and three-stage inter-process communication. Each subdomain in (a) is divided into four equal subvolumes (eight subvolumes, if 3D subdomain). To avoid biased walk events, one of the four subvolumes is randomly chosen before the corresponding local subvolume is executed simultaneously by the four processes. In the above example, subvolume 3 was selected randomly and it is currently being executed in parallel by the four processes. Ghost voxels will be updated using the three-stage communication scheme before they are accessed. The scheme updates the voxels consecutively in x- and y-directions (and z-direction, if 3D subvolume). After performing the walk and reaction events in the subvolume, the out voxels in adjacent subdomains will be updated to reflect the state of local subvolume ghost voxels. The updates will be performed successively in (z-,) y- and x-directions. In the example above, the state of an out voxel of P3 subvolume 0 is first transferred to a ghost voxel of P2 subvolume 1 in x-direction before it is communicated to the ghost voxel in P0 subvolume 3 in y-direction. Conversely, the state of the out voxel is updated in reverse, first in y-direction, and then in x-direction

Parallelized walk event

Einstein [61] and von Smoluchowski [62] have independently shown that small particles in one-dimensional (1D) system perform Brownian walk with a root-mean-squared displacement of \(\sqrt {2Dt}\), where t is the interval between walks and D is the diffusion coefficient of the particles. The 1D relation can be expressed in three-dimensional (3D) space with the mean-squared displacement (MSD) given as 6Dt. Similarly, in the 3D HCP lattice space, the displacement of a molecule of species si within a diffusion interval \(\tau _{d}^{i}\) must be consistent with its diffusion coefficient Di. Since the molecule displacement over the interval is equivalent to the voxel diameter, we can use the MSD relation in 3D to obtain the interval, \(\tau _{d}^{i}=2r_{v}^{2}/(3D_{i})\), where rv is the voxel radius. We have previously shown that this approach is accurate for modeling bimolecular reactions on HCP lattice when rv≈1.0209R, where R is the molecule radius [43].

Bimolecular reactions are handled by the walk event because they take place upon the collision of two reactant molecules on lattice during diffusion. Since the reaction acceptance probability, \(W_{ij}=k_{ij}/(6\sqrt {2}(D_{i}+D_{j})r_{v})\) is inversely proportional to the diffusion coefficients of reactants, highly diffusion-limited reactions can cause Wij>1 [42, 43]. To address this inaccurate condition, we first determine the maximum Wij of all bimolecular reactions involving si and assign it as ρi. Then, we obtain the walk probability,

$$ \alpha_{i} = \left\{\begin{array}{ll} P_{i}/\rho_{i}, & \text{if \ }\rho_{i}>P_{i}\\ 1, & \text{otherwise} \end{array}\right\} $$

where Pi is a user-defined upper limit of reaction acceptance probabilities and 0<Pi≤1 (by default, Pi=1). Next, we rescale the reaction probabilities to Wij=Wijαi. Finally, we obtain the walk interval as \(\tau _{e}^{i}=\tau _{d}^{i}\alpha _{i}\). This walk interval is fixed and identical across all processes throughout the simulation procedure. With the above bimolecular reaction scheme, we have previously shown that the rebinding-time probability distribution of a reactive molecule pair on HCP lattice agrees well with continuum theory (see Section III.A of [43]). We have also verified the accuracy of the reaction rate coefficient and its time-dependent behavior by comparison to SCK theory. Consequently, reactions that are not highly diffusion-limited (Wij1, αi=1) will have a smaller impact on the simulation performance since they will be far fewer than diffusion steps in the microscopic lattice space.

One of the difficult problems in parallelizing stochastic diffusion and reaction events is maintaining consistency at subdomain boundaries during simulation time steps. Processes should take careful consideration when accessing or writing to voxels residing in adjacent subdomains since they are also simultaneously accessible to the adjacent processes. Figures 2a and b illustrate our scheme to achieve consistency during walk and reaction events. We define the voxels at the edge of a subdomain and adjoining other subdomains as out voxels. We add a virtual set of voxels called ghost voxels locally in each subdomain to represent the current state of out voxels residing in adjacent subdomains. With updated ghost voxels, molecules in a subdomain can walk and react across subdomains seamlessly in a time step without requiring many inter-process synchronization requests. At the end of a walk event, the state of out voxels in adjacent subdomains will be updated to reflect the state of local ghost voxels. Since in a walk event a molecule can at most hop to or react with a molecule in one of its immediate neighbor voxels, only a single layer of ghost voxels is necessary to encapsulate local out voxels (Fig. 2b).

To ensure the updated state of ghost voxels remains valid until the end of the walk event, we further divide the subdomain equally into eight subvolumes and execute each subvolume synchronously with all processes. In Fig. 2b, four of the subvolumes are shown for each subdomain. Only the ghost voxels belonging to the selected subvolume is updated before executing the molecules in the subvolume. This scheme ensures out voxels in adjacent subdomains are isolated and free from modification when their corresponding ghost voxels in the local subvolume are accessed.

Algorithm 3 provides the complete pseudocode of the walk and bimolecular reaction procedure in a subdomain. For the walk event, pSpatiocyte uses two random number generators. The first generator is initialized with a seed that is unique to each process, whereas the second generator is initialized with a global seed. With the globally seeded generator, a random number that is drawn locally will be the same for all processes. This scheme reduces communication cost when we need a common random number for all processes. Unless stated otherwise, all random numbers are drawn using the locally seeded generator from a uniform distribution with the interval [0,1). Both generators use the Mersenne Twister algorithm [63] to generate random numbers.

The walk event executes the random walk of all molecules of a species when it is called. For each molecule m of species si, a random target voxel, v1 out of 12 neighbor voxels is first selected. If v1 is a ghost voxel, m is appended to a list \(M_{i}^{g}\), containing molecules targeting ghost voxel, while v1 is added to Vg, a list of the targeted ghost voxels. However, if v1 is not a ghost voxel and is vacant, a random number r is drawn. If r is less than or equals to the species walk probability αi, the walk is successful and m is moved to v1. Otherwise, if v1 contains a reactant pair of species sj, then a random number r is drawn. If r is less than or equals to the bimolecular reaction probability Wij, then the reaction is performed. If v1 is instead occupied by a non-reactive molecule, a collision occurs and m stays in its current voxel.

After completing the above procedure for all molecules of si, a list containing the execution order of eight subvolumes is randomly shuffled using the globally seeded random number generator. For each subvolume Vh in the ordered list, we first update its ghost voxels by loading the state of corresponding out voxels from adjacent subdomains. The update is performed using the MPI Sendrecv function. Next, for each molecule m in \(M_{i}^{g}\) that is in the current subvolume Vh, we get its target ghost voxel v1 from Vg. If v1 is vacant, a random number r is drawn. If r is less than or equals to the species walk probability, αi then the molecule is moved to the target voxel v1. Otherwise, the walk fails and the molecule stays in its voxel. If v1 is instead occupied by a reactive pair of species sj, then a random number r is drawn. If r is less than or equals to reaction acceptance probability Wij, the reaction is executed. Otherwise, the reaction fails and the molecule stays in its voxel. After executing all the molecules in the subvolume, we update the out voxels in adjacent subdomains with the state of their corresponding ghost voxels using MPI Sendrecv function. The process then repeats the above procedure with the next subvolume in the ordered list and continues until all subvolumes have been executed.

Note that the walk event in each subvolume is performed locally at all times by each process. Although mutual exclusion is naturally realized by this scheme, inter-process communication is performed once at the beginning and again at the end of each subvolume execution. This scheme of mutual exclusion however does not necessarily require eightfold communication requests because the number of voxels to be sent or received is also reduced in proportion to the subvolume boundary surface area. We have confirmed the effectiveness of this method with at least a few thousand processes [58].

Another problem common in lattice-based parallel computations is the communication at the subdomain vertices. Out voxels located at the vertices are accessed by many more processes than at other locations. Generally, latency has the most impact during the short communications required at these voxels. In addition, contention between requests tends to occur because of the limited bandwidth or the number of available communication channels. From the viewpoint of strong scaling, the communications will show poor performance especially when they involve such a large number of processes. To overcome these constraints, we employed the three-stage communication scheme, which is well-established and described previously [64]. With this scheme, it is sufficient for each process to communicate with six directly adjacent processes in three consecutive stages. In our implementation, we adapted the scheme to update ghost and out voxels when executing each of the eight subvolumes. An example of the communication scheme is illustrated in Fig. 2b.

Parallelized unimolecular reaction event

The sequential Spatiocyte method employs the Next-Reaction method for unimolecular reaction events [42, 44]. However, in the pSpatiocyte method, we have adopted the Gillespie’s direct-method [65] to execute the events in parallel because of its simplicity. In the direct-method, the propensity for a unimolecular reaction, \(\mathrm {A} \overset {k_{j}}{\rightarrow } \mathrm {B}\) is aj=kjNA, where NA is the number of A molecules in the volume. If there are m unimolecular reaction channels, the total propensity is given by

$$ a_g = \sum^{m}_{j=1} a_j. $$

The interval, τe for the next reaction event is expressed as

$$ \tau_{e} = -\frac{\ln r_{1}}{a_{g}}, $$

where r1 is a random number drawn from a uniform distribution between 0 and 1. At the end of the interval, the reaction channel u, out of the total m channels is selected to be executed such that

$$ \sum^{u-1}_{j=1} a_j < a_g r_{2}\leq\sum^{u}_{j=1} a_j, $$

where r2 is another random number from the uniform distribution.

We have parallelized the direct-method using two simple schemes. First, in each process, we use MPI Allgather to get the local propensities, al from all subdomains and sum it locally to get the global propensity, ag. Second, we use the globally seeded random number generator to draw r1 and r2 to determine τe and u, respectively. With these two schemes, τe and u will be identical for all processes without additional synchronization requests. The pseudocode of the parallelized direct-method is given in Algorithm 4.

At initialization, the scheduler gets the next time to execute the unimolecular reaction event, te=τe by calling the get_direct_method_new_interval function and schedules it in the priority queue, Q. The new interval is calculated from Eq. (3). In the main loop of the simulation, if a walk event of a unimolecular reactant species has been called, the number of reactant molecules may have changed. Therefore, at the end of the walk event, the next time of the reaction event is updated by calling the get_direct_method_next_interval function. It gets the updated interval from the scaling expression, τe=a0(tets)/ag, where a0 and ag are the old and new global propensities, respectively, te is the old scheduled time and ts is the the current simulation time. Finally, the scheduler executes the reaction event at the scheduled time by calling the react_direct_method function and reschedules it using a new interval from get_direct_method_new_interval. The react_direct_method function selects the reaction channel to be executed according to (4).

When the reaction channel is executed, if there are two product molecules, one of them will replace the reactant in its current voxel. Another random vacant voxel from the 12 nearest neighbors of the reactant will be selected to occupy the second product. When the compartment or the region near the reactant is highly crowded, no vacant voxel may be found for the product. In the original Spatiocyte method, this will result in a failed unimolecular reaction. We have also adopted this approach for the pSpatiocyte method. In addition, for such a highly crowded scenario, we have added an option in the model to randomly vacate one of the neighbor voxels of the reactant and place the second product in it. The voxel can only be vacated if the molecule occupying it is a mobile (diffusing) species. The voxel is first vacated by moving the molecule to a vacant neighbor voxel. If there is no vacant voxel available for the molecule, the procedure is repeated with another randomly selected neighbor molecule of the reactant. The reaction fails if none of the nearest neighbors of the reactant can be vacated for the second product.

Parallelized logger events

Two types of logger events are available to save the snapshots of pSpatiocyte simulation. The species numbers logger event saves the number of molecules of each species in a comma-separated values (CSV) file when called by the scheduler. At initialization, the log interval is fixed and duplicated across all processes, ensuring that the scheduled execution time of the logger event is always identical across processes. During simulation, each process writes the molecule numbers available in its subdomain into a local file, thus avoiding inter-process communication. A Python script is provided to gather and sum all the numbers from the process files into a single conventional CSV file after the simulation.

The coordinates logger is implemented the same way as the numbers logger but it saves the unique identity number (ID) and integer (voxel) coordinates of each molecule within its subdomain. The ID of a molecule is persistent across subdomains, hence it is possible to track the trajectory of each molecule throughout the simulation. The logger can also save the coordinates of out and ghost voxels for debugging purposes. A Python script gathers and converts the integer coordinates into real 3D coordinates before saving them into a CSV file.

Results and Discussion

To confirm the consistency of pSpatiocyte in terms of physical accuracy, we validated diffusion and reaction processes when running on all eight cores of a workstation with Intel Core i9-9900K CPU (8 cores, 5 GHz maximum processor frequency), 64 GB memory and Ubuntu 19.10 operating system. Parallel performances of diffusion were examined across thousands of computational cores of the K computer [66]. pSpatiocyte performance was also compared with Spatiocyte, Smoldyn and ReaDDy when simulating the benchmark enzymatic reaction model on the workstation. Finally, the parallelized simulation outcomes of the well known MAPK model are provided as an application example. All simulations were performed on lattices with reflective boundaries.

Validation of diffusion

We initially observed the trajectories of molecules diffusing across subdomains to verify the coordinates logger, inter-process communications and the overall simulation algorithm. Figure 3a displays the trajectories of five molecules diffusing with D=0.06 μm2s−1 for 10 s. The simulation was executed with eight processes, employing all cores of the workstation. At the beginning of the simulation, the molecules were placed randomly in a compartment volume of 10 μm3 with lattice dimensions 4763, divided into eight subdomains. All trajectories in Fig. 3a appear consistent with molecules performing Brownian motion.

Fig. 3
figure 3

Parallelized 3D diffusion in dilute and crowded conditions. a Trajectories of five molecules diffusing across lattice subdomains. b Mean-squared displacements (MSDs) of molecules diffusing in a sparsely populated compartment. c MSDs of molecules diffusing in crowded conditions. D0 is the diffusion coefficient specified in the model, whereas D is effective rate fitted to the resulting MSD. The fraction of voxels occupied by immobile crowder molecules are indicated by ϕ

We then validated the consistency of pSpatiocyte in reproducing the correct diffusion behavior in a dilute volume that is equally distributed to eight processes. The MSD of a molecule performing random walk in 3D space is given as 6Dt. We monitored the diffusion of a single molecule placed at the center of a 9603 lattice with voxels measuring 2.5 nm in radius. No other molecules were present on the lattice. We then performed random walks repeatedly with the same initial conditions aside from the random number generator seed. 100 random walks were performed for 40 ms and the average MSDs were computed from the ensembles. Figure 3b shows the log-log plots of the results for three different diffusion coefficients. The slopes, the vertical distances, and the absolute values coincide well with the expected theoretical lines.

When a compartment is crowded with obstacles as in the cell [67], the effective rate of diffusion is expected to decrease. To evaluate if pSpatiocyte is able to replicate the rate reduction when running with eight processes, we obtained the MSD of a diffusing molecule in a compartment occupied by immobile crowder molecules. The fraction of compartment voxels occupied by the crowder molecules is given by ϕ. We evaluated three different ϕ conditions, with 1000 independent simulation runs for each condition. Each run adopted a unique seed for drawing random numbers. Hence, the random placement of immobile crowders at initialization was different in each run. The averaged MSDs and the fitted effective diffusion rates are displayed in Fig. 3c. As expected, a significant decrease in the diffusion rate corresponding to an increase in ϕ is clearly observed. Further detailed analysis is required to compare the effective diffusion rates in crowded condition on HCP lattice with the rates in continuous space as reported by Novak et al. [68].

Validation of reactions

We validated parallelized irreversible and reversible reactions separately because they have distinct underlying physics.

Irreversible reaction

A unimolecular reaction given by

$$ \mathrm{A} \overset{k}{\rightarrow} \mathrm{B} + \mathrm{C}, $$

is irreversible. In pSpatiocyte, the reaction is executed according to the parallelized Gillespie’s direct-method. We applied three different reaction rates, k in an uncrowded volume and compared the results with that of an ordinary differential equation (ODE) solver. The simulation was executed on eight CPU cores with parameters D=10 μm2s−1, voxel radius rv=5 nm and 9603 lattice size. The initial number of reactant molecules was 64,000. Figures 4a and b show the simulation results of the reactant and products, respectively. In all cases, the outcomes of simulation agree very well with the ODE solver.

Fig. 4
figure 4

Parallelized unimolecular and bimolecular reactions. a Time profiles of A in the dissociation reaction \(\mathrm {A} \overset {k}{\rightarrow } \mathrm {B} + \mathrm {C}\). b The corresponding time profiles of B and C. c Effects of volume occupancy on the rate of the dissociation reaction on HCP lattice. ϕ denotes the fraction of voxels occupied by crowder molecules. In the original method, the reaction fails if there are no vacant voxels among the 12 nearest neighbors of the reactant molecule to place the second product molecule. In the vacated voxels approach, a diffusing molecule from a nearest neighbor is selected randomly and moved to one of its nearest neighbors to allocate a vacant voxel for the product. Simulation model parameters: total volume was 90 μm3 with 643 lattice, initial number of A molecules was 50,000, crowder molecules were added to achieve ϕ as shown, the diffusion coefficient of A, B, C and crowder molecules was 10 μm3s−1, and 100 runs for each ϕ. d Time profiles of reactants and products in the reversible reaction \(\mathrm {B} + \mathrm {C} \overset {k_{f}}{\underset {k_{r}}{\rightleftarrows }} \mathrm {A}\)

We investigated the effects of crowding on the dissociation rate of (5), with the two different ways of finding a vacant voxel for the second product molecule. In the original approach, the reaction fails if all neighbor voxels of the reactant are occupied. In the second approach, if they are all occupied, the simulator attempts to vacate one of them. The reaction also fails if no voxels can be vacated for the product. Figure 4c shows the log-log plots of the reactant concentration using the two approaches when ϕ is between 0.5 and 1.0. At ϕ=1.0, all voxels of the HCP lattice are occupied, giving about 74% volume occupancy. With the original approach, the dissociation rates agree well with the ODE result up to ϕ=0.7, which translates to about 52% volume occupancy. In the second approach, the rates are comparable up to ϕ=0.9. In vitro results of crowding experiments showed that the dissociation rate of molecules are unaffected even when the volume occupancy reaches 30% [69]. However, it is still unknown if occupancies above 50% would affect the dissociation rate as we have found with our original approach. The results show that the original approach is sufficient for simulating the estimated 30% volume exclusion in the cytoplasm [70]. In cases where the dissociating molecules are much smaller that the voxel size, such as messengers, metabolites and ions, the sequential version of Spatiocyte simulates them at the compartment scale using the Next-Reaction method. Since this feature is not yet supported by pSpatiocyte, we leave it for future work

Reversible reaction

A bimolecular reaction is dependent on the diffusion of reactant molecules because the reaction takes place as the molecules meet and collide in space. For simplicity, we considered a forward bimolecular reaction with a single product and a reverse unimolecular reaction, as an example of reversible reaction,

$$ \mathrm{B} + \mathrm{C} \overset{k_f}{\underset{k_r}{\rightleftarrows}} \mathrm{A}, $$

where kf and kr denote the effective forward and reverse reaction rates, respectively. The effective forward rate can be converted to the intrinsic rate that is used by pSpatiocyte,

$$ k_f' = \frac{k_f k_D}{k_D-k_f}, $$

where kD=8πrv(DB+DC) [43]. DA and DB are the diffusion coefficients of B and C, respectively. The intrinsic reverse reaction rate is given as

$$ k_r' = \frac{k_f'k_r}{k_f}. $$

The forward reaction is executed by the walk event when molecules of the reactant species collide on lattice, whereas the reverse reaction is performed by the unimolecular reaction event. The parameters of the simulation include kf=2 μm3s−1, kr=1.35 s−1, D=10 μm2s−1 and rv=5 nm. The initial number of A and B molecules was 64,000 each and the lattice size was 9603. The simulation volume was distributed to eight processes. As comparison, an ODE solver was used to generate the output of the reversible reaction with the effective rates, kf and kr. The results of the simulation are provided in Fig. 4d. The closely matching curves of pSpatiocyte and the ODE solver verify the parallel simulation accuracy of reversible bimolecular reactions.

Performance of parallelized 3D diffusion

In Spatiocyte simulations, diffusion of molecules typically takes place at step intervals that are several orders of magnitude smaller than that of reactions if they are not highly diffusion-limited. The fine intervals are needed for the very short displacements between voxels. Since diffusion computations at these fine intervals dominate the total computation cost of most simulations, we used a diffusion model to evaluate the parallel simulation performance of pSpatiocyte. The simulation parameters and conditions are the same as in the diffusion model in the previous section, except for the lattice resolution and occupancy.

To estimate the parallel performance of pSpatiocyte, we measured its strong and weak scaling efficiencies. Strong scaling measures how fast a program is able to process a fixed workload by splitting it into smaller sizes and distributing them to an increasing number of CPUs or cores. On the other hand, weak scaling measures how large of a problem a program can handle without the loss of speed. To measure the efficiency of weak scaling, the amount of workload given to each CPU or core is fixed and the total workload processed by the program is increased by adding CPUs or cores with the corresponding amount of workload.

For strong scaling, we used three different voxel radii to measure performance. Given that the physical dimensions of a compartment remain the same, smaller voxels would result in higher resolution and finer lattice. We denote voxels having 10, 5, and 2.5 nm radii as coarse, intermediate, and fine lattices, respectively. The molecule occupancy, ϕ was fixed at 0.3, whereas the voxel sizes determined the spatial resolution of the compartment. The compartment resolution was 5123 (coarse), 10243 (intermediate) or 20483 (fine). For computations using the maximum 663552 cores of the K computer, the resolution was set to 512×480×540 (coarse), 1024×960×1080 (intermediate) or 2048×1920×2160 (fine) to ensure that the model conforms to the physical configuration of the processes.

Note that we examined the relative speedups instead of the floating point operations per second (FLOPS) because on lattice, integer or logical instructions dominate the overall computational cost. The speedups measured from the elapsed times are shown in Fig. 5a. Here, we used the results of the coarse lattice with 64 cores as the baseline reference to calculate the speedups since simulations with fewer cores were not possible due to memory size limitation. For the intermediate and fine lattices, the results from 64 cores were extrapolated using the time consumed per voxel on the coarse lattice. The parallel efficiencies of strong scaling are summarized in Fig. 5b. For the fine lattice with 663552 cores, we obtained a speedup of 7686. The corresponding strong scaling efficiency was at 74.1%. In contrast to the coarse and intermediate lattices, the results from the fine lattice were the closest to the ideal curve. By extrapolating the results from Fig. 5a, we can predict that the speedup and efficiency on the fine lattice would be about 13000 times and at 40%, respectively if two million cores were utilized.

Fig. 5
figure 5

Parallelized 3D diffusion performance of pSpatiocyte. Diffusion model is the same as in Fig. 3 but with 30% of the total voxels occupied by diffusing molecules. a Speedup ratios relative to 64 cores. Red, green, and blue lines represent lattices with 5123 (coarse), 10243 (intermediate), and 20483 (fine) voxels, respectively. b Efficiency of strong scaling on increasing number of CPU cores. The resolutions of lattice are the same as in (a). c Components of simulation elapsed times. The average times taken for initialization and computation are given by init and calc, respectively. For inter-process communications, the pack and unpack times are the durations required to manage data before sending and after receiving it, respectively. The time spent on the MPI Sendrecv function is denoted by mpi. The duration to duplicate communicator objects is indicated by dup. d Average elapsed times per voxel per simulation step. Voxel sizes from smallest to largest denote 323, 643, 1283, 2563, and 5123 voxels per process, respectively. e Efficiency of weak scaling on increasing number of CPU cores. Colors have the same representation as in (d)

To identify the cause of the performance deterioration on the coarse lattice, we determined the major components of elapsed time as shown in Fig. 5c. We found that the times taken for initialization, computation, pack and unpack events were decreasing at least with up to 262144 cores, whereas the MPI time saturated and exceeded these times when it was over 32768 cores. To improve the performance, the constant duplication time due to redundant communicator objects should be eliminated by sophisticated programming. The saturation of MPI time is likely the most significant factor that needs to be addressed to improve the scaling performance further. However, such saturated timings generally originate from the latency of inter-process communication, which is dependent on the hardware and firmware. Therefore, the immediate approach for improving strong scaling is to reduce the computation time and ensure that it is as close to the latency as possible.

Figure 5d displays the weak scaling performance of pSpatiocyte in elapsed time per voxel per step. Here, the labels on the horizontal axis, smallest, smaller, medium, larger and largest, denote the subdomain lattice dimensions, 163, 323, 643, 1283 or 2563 on each process, respectively. In an ideal weak scaling performance setting, these times would be identical for all lattice sizes since they would be independent of the number of cores and voxel sizes. In spite of the variation in the absolute values, all lattice dimensions provided good scaling properties. We scrutinized the elapsed times of the smallest and smaller lattices and it revealed that the times are dominated by the constant communication latency. This explains the source of the larger absolute times in the smaller lattices. The efficiencies of parallel computation in terms of weak scaling are summarized in Fig. 5e. For each lattice size, the elapsed time with 64 cores was used as a reference to calculate the parallel efficiency. Although the efficiencies tend to deteriorate on higher number of cores, we were still able to achieve 60% efficiency with more than half a million cores.

Performance benchmark

We compared the runtime of pSpatiocyte with Spatiocyte (git 5e88f40), Smoldyn (v2.61) and ReaDDy (v2.0.2-py37_55_g78bd07) when executing the benchmark Michaelis-Menten enzymatic reaction model [41] on a common workstation as it is more accessible to general users than a supercomputer. ReaDDy has both serial and parallel versions. All simulators were evaluated on a single core. Additionally, the parallel simulators pSpatiocyte and ReaDDy were executed on two, four and eight cores. We used the same model parameters from [41, 71] but increased the size of the reaction volume tenfold to 909 μm3. The larger volume raises the computational cost to an adequate level when running the model in parallel on all eight cores of the workstation. For Smoldyn and ReaDDy, we set the simulation interval, Δt to 1 ms. In pSpatiocyte and Spatiocyte models, the event with the smallest interval is the walk event and it was set to 0.5 ms. We also evaluated pSpatiocyte with a smaller walk event interval of 0.2 ms to see how the runtime scales with the interval. Smaller simulation interval results in an overall higher computational cost and up to a certain limit, better accuracy.

Figure 6a displays the results of the performance benchmark. The runtimes shown are the averages of three independent runs of each simulator to execute the model for 10 s. The resulting concentration profiles from each simulator are plotted in Fig. 6b. On a single core, pSpatiocyte with Δt=0.5 ms is about two times faster than with Δt=0.2 ms. It is also about 2.4 times faster than the sequential version of Spatiocyte. On a single core, the main difference between pSpatiocyte and its sequential counterpart is the new pointer-free voxel accessing scheme adopted by the former. This scheme has likely lowered the memory bandwidth usage and cache misses, contributing to the significant reduction in simulation runtime. The execution time of pSpatiocyte is about 7.7 times shorter than Smoldyn on a single core. It is also roughly 30- and 38-fold times faster than the serial and parallel versions of ReaDDy, respectively.

Fig. 6
figure 6

Performance benchmark of the Michaelis-Menten reaction on a workstation. Model parameters [41, 71]: total volume 909 μm3, diffusion coefficient 10 μm2s−1, k1 = 0.01 μm3s−1, k2 = k3 = 1 s−1. The initial numbers of E and S molecules are 9090 and 90910, respectively. Duration of simulation 10 s. Simulation or walk event interval (Δt) and runtime (T) are as indicated. a Comparison of simulator runtimes on different number of CPU cores. b Concentration profile outcomes from the simulators

The runtime of pSpatiocyte (Δt=0.5 ms) also scales favorably with the number of additional cores used in the simulation. When the number of cores was increased from one to two, the runtime was reduced by 1.87 times. Similary, the runtime was shorter by about 1.82 times when the number of cores increased from two to four. On eight cores, it is about 1.7 times faster than on four cores. Similar scaling behavior was also observed with Δt=0.2 ms. On eight cores, pSpatiocyte (Δt=0.5 ms) is roughly 55 times faster than the parallel version of ReaDDy. It also required about 45- and 14-fold shorter runtimes than Smoldyn and Spatiocyte, respectively to complete the simulation on the workstation. Overall, the benchmark results show that pSpatiocyte has a significant performance advantage over other well-known microscopic particle simulators.

Parallelized simulation of MAPK model

In the dual phosphorylation-dephosphorylation cycle of the MAPK cascade, shown in Fig. 7a, molecular rebinding effects at the microscopic scale can alter the macroscopic dynamics of the system [36]. MAPK kinase (KK) phosphorylates MAPK (K) in a two-step process to generate a doubly phosphorylated MAPK (Kpp), whereas the phosphatase, P dephosphorylates Kpp twice to recover K. Upon unbinding from their products, the enzymes go through an inactive state (denoted by ). The time required to reactivate the enzymes is given by τrel1/ka, where

$$\begin{array}{@{}rcl@{}} \mathrm{P^{*}} &\overset{k_{a}}{\rightarrow}& \mathrm{P}, \end{array} $$
Fig. 7
figure 7

Parallelized simulation results of MAPK model. a Dual phosphorylation-dephosphorylation cycle of the MAPK cascade. b Concentration profiles when KK0/P0=1 and D=4 μm2s−1. Simulation was executed with all eight-cores of the workstation. c Scalability of pSpatiocyte at low molecule number. Shown are the runtimes of pSpatiocyte when simulating the MAPK model (KK0/P0=1) for 10 s with varying number of cores and diffusion coefficients. The model consists of 1800 total molecules. d Response curves for different diffusion coefficients. All simulations were executed with all eight cores of the workstation

$$\begin{array}{@{}rcl@{}} \mathrm{KK^{*}} &\overset{k_{a}}{\rightarrow}& \text{KK}. \end{array} $$

If the enzyme-substrate reactions are diffusion-limited and τrel is short, a newly dissociated enzyme can rebind to its product to catalyze it again, before escaping into the bulk. Takahashi et al. [36] have previously shown with eGFRD particle simulations that these rebinding events can change the response sensitivity of the phosphorylation state, which could result in the loss of bistability. We have also recently replicated the results with Spatiocyte [43]. These spatiotemporal correlations between enzyme and substrate molecules, and fluctuations at the molecular scale are difficult to be captured by RDME and PDE-based methods.

As an example of pSpatiocyte biological application and to further verify the method, we have simulated the MAPK model with the same parameters from [36] but increased the volume tenfold (10 μm3) to raise the computational cost. The initial molecule numbers of K, KK and P were 1200, 300 and 300, respectively. There were no initial molecules for the remaining species. The model was simulated for 300 s with τrel=1 μs. For all species, D=4 μm2s−1. The lattice size was 4763 with rv=2.5 nm. Figure 7b shows the concentration profiles in the first 100 s of the simulation. The total runtime of the simulation was 5466 s using all eight cores of the workstation. In contrast, it took 76650 s to complete the simulation with Spatiocyte. Thus, pSpatiocyte is about 14 times faster than its sequential counterpart. As comparison, in Fig. 7b we have also plotted the predictions of the corresponding mean-field (ODE) MAPK model [36]. The model describes the diffusion of molecules implicitly by renormalizing the reaction rates. pSpatiocyte concentration profiles coincide well with the ODE outcomes, although in the former, there were fluctuations from the small number of reacting molecules.

The total number of molecules in the MAPK model is 1800, which is a small number for high-performance simulations. To evaluate how well pSpatiocyte scales with such low number of molecules, we compared the runtimes with varying number of cores and diffusion coefficients. Figure 7c shows the runtimes when the model was executed for 10 s. With D=0.06 μm2s−1, pSpatiocyte is about twofold faster on four cores than on a single core. Increasing the number of cores from four to eight did not noticeably improve the runtime further. However, with D=4 μm2s−1, the speedup achieved with four cores is 2.6 times. With the addition of another four cores, the speedup increased to 3.4 times. This is notable because pSpatiocyte is still able to achieve a favorable speedup although each core only executed on average 225 molecules. Compared to the slower diffusion model, the shorter diffusion interval (\(\tau _{d}^{i}\)) when D=4 μm2s−1 increases the computational cost more than the communication overhead, resulting in the improved speedups.

We also simulated the MAPK model with different initial ratios of KK0/P0 and diffusion coefficients, and obtained the steady-state Kpp/K0 curves as shown in Fig. 7d. The outcomes of the corresponding mean-field models of a distributive (with D=4 μm2s−1) and a processive (with D=0.06 μm2s−1) system are also shown. In the distributive scheme, the enzyme needs to detach from the substrate before it can catalyze it the second time. The double encounters between enzyme and substrate molecules can lead to ultrasensitive switchlike response. Conversely, in the processive scheme, a single encounter between them is sufficient to generate the dual modifications of the substrate. At fast diffusion (D=4 μm2s−1), pSpatiocyte response curve agrees well with that of the distributive mean-field model. However, at much smaller diffusion coefficient (D=0.06 μm2s−1), it instead reproduces the graded response curve of the processive model.

How slower diffusion in the pSpatiocyte model weakens the switchlike response curve can be explained as follows. At D=4 μm2s−1, the resulting diffusion interval, \(\tau _{d}^{i}\) and walk event interval, \(\tau _{e}^{i}\) for the enzymes are the same (1 μs) because the walk probability, αi=1. Since \(\tau _{d}^{i}\) and τrel have the same intervals, after catalyzing the substrates the first time, the enzymes KK and P can escape into the bulk before they can reactivate and rebind with the substrates. In constrast, at D=0.06 μm2s−1, it gives \(\tau _{d}^{i}=72\,\mu \mathrm {s}\) and \(\tau _{e}^{i}=2\,\mu \mathrm {s}\) for the enzymes because αi=0.028. Since \(\tau _{d}^{i} \gg \tau _{rel}\), the enzymes have enough time to rebind with their substrates upon reactivation, before they can escape. This processivelike mechanism leads to the graded response curve as shown in Fig. 7d. Overall, we note that pSpatiocyte correctly reproduces the expected ultrasensitivity dynamics of MAPK [36].


We have developed a high spatiotemporal resolution parallel stochastic method to simulate intracellular reaction-diffusion systems on HCP lattice. To realize large-scale parallel computations, we have introduced several advanced simulation schemes, including a twisted Cartesian coordinate system with pointer-free voxel access, a parallelized event scheduler with priority queue, synchronized random subvolume executions and parallelized Gillespie’s direct-method using globally seeded random number generators, and three-stage inter-process data transfers.

We have also validated the physical correctness of the simulator. The simulated diffusion rates in dilute conditions showed very good agreement with theory. In crowded conditions, the diffusion rates decreased, as expected. Further work is required to compare the crowded diffusion behavior on lattice with the relation between diffusion and excluded volume fraction obtained in continuous space [68]. Both irreversible and reversible reaction curves coincided very well with predicted ODE results. Parallel performance of diffusion on the K computer was sufficiently high for large-scale computations. From the viewpoint of strong scaling, pSpatiocyte achieved a 7686-fold speedup with 663552 cores compared to the runtime with 64 cores on a 20483 lattice. The efficiency was equivalent to 74.1%. In terms of weak scaling, efficiencies of at least 60% were obtained.

In the Michaelis-Menten enzymatic reaction benchmark, pSpatiocyte performed significantly better than other well-known microscopic particle simulators. On a workstation with eight CPU cores, pSpatiocyte is about 55 times faster than the parallel version of ReaDDy and 45 times faster than Smoldyn. In addition, the parallelized simulation of the MAPK model revealed that the program can correctly capture the weakening of ultrasensitive response by enzyme-substrate rebindings at very short timescales. The accurate simulation of the model also demonstrated that pSpatiocyte is applicable in real biological problems. On the same workstation with eight cores, pSpatiocyte required 14-fold faster execution times than the sequential version of Spatiocyte to simulate the MAPK model. Notably, pSpatiocyte is able to achieve 3.4 times speedup with all cores on the workstation although the average number of molecules executed per core is only 225.

In recent papers by Smith and Grima [72] and Lawson et al. [73], RDME reactions with non-mass-action propensities such as Hill-type and Michaelis-Menten were shown not converging to the chemical master equation. At present, both sequential and parallel versions of Spatiocyte only support elementary reactions, namely unimolecular and bimolecular reactions. Since all complex reactions, including Hill-type and Michaelis-Menten, can be broken down to elementary (mass-action) reactions, the currently supported reactions should be sufficient for most modeling purposes. Nonetheless, to conveniently model complex reactions, it would be beneficial to support them in the future. Further work would also be needed to solve the convergence problem.

Recently, several GPU-based high-performance simulators of reaction-diffusion systems have been reported [46, 48, 5052, 74]. The ability to run fast simulations on common workstations equipped with GPUs supports wider application of the simulators. Implementing the Spatiocyte method to run on GPUs should be straightforward since most of the parallelization schemes presented in this work can also be applied on a GPU.

Availability of data and materials

The pSpatiocyte software and the complete scripts to generate the data in this study are available at









Central processing unit


Floating point operations per second


Graphics processing unit


Hexagonal close-packed


Mitogen-activated protein kinase


Molecular dynamics


Message passing interface


mean-squared displacement


Partial differential equations


Reaction-diffusion master equation




  1. Li X, Holmes WR. Biophysical attributes that affect CaMKII activation deduced with a novel spatial stochastic simulation approach. PLoS Comput Biol. 2018; 14(2):e1005946.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  2. Denk J, Kretschmer S, Halatek J, Hartl C, Schwille P, Frey E. MinE conformational switching confers robustness on self-organized Min protein patterns. Proc Natl Acad Sci. 2018; 115(18):201719801.

    Article  CAS  Google Scholar 

  3. Trogdon M, Drawert B, Gomez C, Banavar SP, Yi T-M, Campàs O, Petzold LR. The effect of cell geometry on polarization in budding yeast. PLoS Comput Biol. 2018; 14(6):e1006241.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  4. Du H, Wang Y, Haensel D, Lee B, Dai X, Nie Q. Multiscale modeling of layer formation in epidermis. PLoS Comput Biol. 2018; 14(2):1–25.

    Article  CAS  Google Scholar 

  5. Miao Y, Bhattacharya S, Edwards M, Cai H, Inoue T, Iglesias PA, Devreotes PN. Altering the threshold of an excitable signal transduction network changes cell migratory modes. Nat Cell Biol. 2017; 19(4):329–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Tan RZ, Chiam KH. A computational model for how cells choose temporal or spatial sensing during chemotaxis. PLoS Comput Biol. 2018; 14(3):1–21.

    Article  CAS  Google Scholar 

  7. Berro J. Essentially, all models are wrong, but some are useful - a cross-disciplinary agenda for building useful models in cell biology and biophysics. Biophys Rev. 2018; 10(6):1637–47.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Takahashi K, Arjunan SNV, Tomita M. Space in systems biology of signaling pathways–towards intracellular molecular crowding in silico. FEBS Lett. 2005; 579(8):1783–8.

    Article  CAS  PubMed  Google Scholar 

  9. Burrage K, Burrage PM, Marquez-lago TM, Nicolau DV. Stochastic Simulation for Spatial Modelling of Dynamic Processes in a Living Cell In: Koeppl H, Setti G, di Bernardo M, Densmore D, editors. Design and Analysis of Biomolecular Circuits: Engineering Approaches to Systems and Synthetic Biology chapter 2. New York: Springer: 2011. p. 43–62.

    Google Scholar 

  10. Klann M, Koeppl H. Spatial simulations in systems biology: from molecules to cells. Int J Mol Sci. 2012; 13(6):7798–827.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Schöneberg J, Ullrich A, Noé F. Simulation tools for particle-based reaction-diffusion dynamics in continuous space. BMC Biophys. 2014; 7(1):11.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. Earnest T, Cole JA, Luthey-Schulten ZL. Simulating biological processes: Stochastic physics from whole cells to colonies. Rep Prog Phys. 2018; 81(5):052601.

    Article  PubMed  CAS  Google Scholar 

  13. Smith S, Grima R. Spatial Stochastic Intracellular Kinetics: A Review of Modelling Approaches. Bull Math Biol. 2019; 81(8):2960–3009.

    Article  PubMed  Google Scholar 

  14. Bottaro S, Lindorff-Larsen K. Biophysical experiments and biomolecular simulations: A perfect match?. Science. 2018; 361(6400):355–60.

    Article  CAS  PubMed  Google Scholar 

  15. Shaw DE, Dror OR, Salmon JK, Grossman JP, Mackenzie KM, Bank JA, Young C, Deneroff MM, Batson B, Bowers KJ, Chow E, Eastwood MP, Ierardi DP, Klepeis JL, Kuskin JS, Larson RH, L-Larsen K, Maragakis P, Moraes MA, Piana S, Shan Y, Towles B. Millisecond-scale molecular dynamics simulations on Anton. In: International Conference for High Performance Computing, Networking, Storage and Analysis. Portland: ACM/IEEE: 2009.

    Google Scholar 

  16. Kohlhoff KJ, Shukla D, Lawrenz M, Bowman GR, Konerding DE, Belov D, Altman RB, Pande VS. Cloud-based simulations on Google Exacycle reveal ligand modulation of GPCR activation pathways. Nat Chem. 2014; 6(1):15–21.

    Article  CAS  PubMed  Google Scholar 

  17. Simon J, Arthur C, Ley SC. Mitogen-activated protein kinases in innate immunity. Nat Rev Immunol. 2013; 13(9):679–92.

    Article  CAS  Google Scholar 

  18. Burotto M, Chiou VL, Lee JM, Kohn EC. The MAPK pathway across different malignancies: A new perspective. Cancer. 2014; 120(22):3446–56.

    Article  CAS  PubMed  Google Scholar 

  19. Blinov ML, Schaff JC, Vasilescu D, Moraru II, Bloom JE, Loew LM. Compartmental and Spatial Rule-Based Modeling with Virtual Cell. Biophys J. 2017; 113(7):1365–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Mahmutovic A, Fange D, Berg OG, Elf J. Lost in presumption: stochastic reactions in spatial models. Nat Methods. 2012; 9(12):1163–6.

    Article  CAS  PubMed  Google Scholar 

  21. Baras F, Mansour MM. Reaction-diffusion master equation: A comparison with microscopic simulations. Phys Rev E. 1996; 54(6):6139.

    Article  CAS  Google Scholar 

  22. Hattne J, Fange D, Elf J. Stochastic reaction-diffusion simulation with MesoRD. Bioinformatics. 2005; 21(12):2923–24.

    Article  CAS  PubMed  Google Scholar 

  23. Isaacson SA. The Reaction-Diffusion Master Equation as an Asymptotic Approximation of Diffusion to a Small Target. SIAM J Appl Math. 2009; 70(1):77–111.

    Article  Google Scholar 

  24. Drawert B, Engblom S, Hellander A. URDME: a modular framework for stochastic simulation of reaction-transport processes in complex geometries. BMC Syst Biol. 2012; 6(1):76.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Hepburn I, Chen W, Wils S, De Schutter E. STEPS: efficient simulation of stochastic reaction–diffusion models in realistic morphologies. BMC Syst Biol. 2012; 6(1):36.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Mourão MA, Hakim JB, Schnell S. Connecting the dots: The effects of macromolecular crowding on cell physiology. Biophys J. 2014; 107(12):2761–6.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  27. Ellis RJ. Macromolecular crowding: obvious but underappreciated. Trends Biochem Sci. 2001; 26(10):597–604.

    Article  CAS  PubMed  Google Scholar 

  28. Zimmerman SB, Minton AP. Macromolecular Crowding: Biochemical, Biophysical, and Physiological Consequences. Annu Rev Biophys Biomol Struct. 1993; 22(1):27–65.

    Article  CAS  PubMed  Google Scholar 

  29. Zhou H-X, Rivas G, Minton AP. Macromolecular Crowding and Confinement: Biochemical, Biophysical, and Potential Physiological Consequences. Annu Rev Biophys. 2008; 37(1):375–97.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Rivas G, Minton AP. Macromolecular Crowding In Vitro, In Vivo, and In Between. Trends Biochem Sci. 2016; 41(11):970–81.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Tan C, Saurabh S, Bruchez MP, Schwartz R, LeDuc P. Molecular crowding shapes gene expression in synthetic cellular nanosystems. Nat Nanotechnol. 2013; 8(8):602–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Smith S, Cianci C, Grima R. Macromolecular crowding directs the motion of small molecules inside cells. J R Soc Interface. 2017; 14(131):20170047.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  33. Grima R. Intrinsic biochemical noise in crowded intracellular conditions. J Chem Phys. 2010; 132(18):185102.

    Article  CAS  Google Scholar 

  34. Cianci C, Smith S, Grima R. Molecular finite-size effects in stochastic models of equilibrium chemical systems. J Chem Phys. 2016; 144(8):084101.

    Article  PubMed  CAS  Google Scholar 

  35. Lagerholm BC, Thompson NL. Theory for ligand rebinding at cell membrane surfaces. Biophys J. 1998; 74(3):1215–28.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Takahashi K, T-Nicola S, ten Wolde PR. Spatio-temporal correlations can drastically change the response of a MAPK pathway. Proc Natl Acad Sci USA. 2010; 106(6):2473–8.

    Article  Google Scholar 

  37. Mugler A, Bailey AG, Takahashi K, ten Wolde PR. Membrane clustering and the role of rebinding in biochemical signaling. Biophys J. 2012; 102(5):1069–78.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Andrews SS. Smoldyn: Particle-based simulation with rule-based modeling, improved molecular interaction and a library interface. Bioinformatics. 2017; 33(5):710–17.

    Article  CAS  PubMed  Google Scholar 

  39. Michalski PJ, Loew LM. SpringSaLaD: A spatial, particle-based biochemical simulation platform with excluded volume. Biophys J. 2016; 110(3):523–29.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Schöneberg J, Noé F. ReaDDy-a software for particle-based reaction-diffusion dynamics in crowded cellular environments. PLoS One. 2013; 8(9):e74261.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Andrews SS. Particle-Based Stochastic Simulators In: Jaeger D, Jung R, editors. Encyclopedia of Computational Neuroscience. New York: Springer: 2018. p. 1–5.

    Google Scholar 

  42. Arjunan SNV, Tomita M. A new multicompartmental reaction-diffusion modeling method links transient membrane attachment of E. coli MinE to E-ring formation. Syst Synth Biol. 2010; 4(1):35–53.

    Article  PubMed  Google Scholar 

  43. Chew W-X, Kaizu K, Watabe M, Muniandy SV, Takahashi K, Arjunan SNV. Reaction-diffusion kinetics on lattice at the microscopic scale. Phys Rev E. 2018; 98:032418.

    Article  CAS  Google Scholar 

  44. Gibson MA, Bruck J. Efficient exact stochastic simulation of chemical systems with many species and many channels. J Phys Chem A. 2000; 104(9):1876–89.

    Article  CAS  Google Scholar 

  45. Chew W-X, Kaizu K, Watabe M, Muniandy SV, Takahashi K, Arjunan SNV. Surface reaction-diffusion kinetics on lattice at the microscopic scale. Phys Rev E. 2019; 99:042411.

    Article  CAS  PubMed  Google Scholar 

  46. Vigelius M, Lane A, Meyer B. Accelerating reaction-diffusion simulations with general-purpose graphics processing units. Bioinformatics. 2011; 27(2):288–90.

    Article  CAS  PubMed  Google Scholar 

  47. Roberts E, Stone JE, Luthey-Schulten Z. Lattice microbes: High-performance stochastic simulation method for the reaction-diffusion master equation. J Comput Chem. 2013; 34(3):245–55.

    Article  CAS  PubMed  Google Scholar 

  48. Hallock MJ, Stone JE, Roberts E, Fry C, Luthey-Schulten Z. Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations. Parallel Comput. 2014; 40(5-6):86–99.

    Article  PubMed  PubMed Central  Google Scholar 

  49. Chen W, De Schutter E. Parallel STEPS: Large Scale Stochastic Spatial Reaction-Diffusion Simulation with High Performance Computers. Front Neuroinformatics. 2017; 11(February):1–15.

    CAS  Google Scholar 

  50. Biedermann J, Ullrich A, Schöneberg J, Noe F. ReaDDyMM: Fast interacting particle reaction-diffusion simulations using graphical processing units. Biophys J. 2015; 108(3):457–61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Gladkov DV, Alberts S, D’Souza RM, Andrews S. Accelerating the Smoldyn Spatial Stochastic Biochemical Reaction Network Simulator Using GPUs. In: Proceedings of the 19th High Performance Computing Symposia. San Diego: Society for Computer Simulation International: 2011. p. 151–8.

    Google Scholar 

  52. Dematte L. Smoldyn on graphics processing units: Massively parallel brownian dynamics simulations. IEEE/ACM Trans Comput Biol Bioinformatics. 2012; 9(3):655–67.

    Article  Google Scholar 

  53. Hoffmann M, Fröhner C, Noé F. Readdy 2: Fast and flexible software framework for interacting-particle reaction dynamics. PLOS Comput Biol. 2019; 15(2):1–26.

    Article  CAS  Google Scholar 

  54. Smoluchowski MV. Mathematical theory of the kinetics of the coagulation of colloidal solutions. Z Phys Chem. 1917; 92:129–168.

    Google Scholar 

  55. Collins FC, Kimball GE. Diffusion-controlled reaction rates. J Colloid Sci. 1949; 4(4):425–437.

    Article  CAS  Google Scholar 

  56. Szpiro GG. Kepler’s Conjecture: How Some of the Greatest Minds in History Helped Solve One of the Oldest Math Problems in the World. New York: Wiley; 2003.

    Google Scholar 

  57. Grima R, Schnell S. A systematic investigation of the rate laws valid in intracellular environments. Biophys Chem. 2006; 124(1):1–10.

    Article  CAS  PubMed  Google Scholar 

  58. Arjunan SNV, Miyauchi A, Takahashi K. A high-performance microscopic lattice reaction-diffusion method for biochemical network simulation. In: The Second Bio-supercomputing Symposium. Tokyo: RIKEN: 2010.

    Google Scholar 

  59. Szkoda S, Koza Z, Tykierko M. Accelerating cellular automata simulations using AVX and CUDA. arXiv:1208.2428. 2012.

  60. Takahashi K, Kaizu K, Hu B, Tomita M. A multi-algorithm, multi-timescale method for cell simulation. Bioinformatics. 2004; 20(4):538–46.

    Article  CAS  PubMed  Google Scholar 

  61. Einstein A. Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen. Ann Phys. 1905; 322(8):549–60.

    Article  Google Scholar 

  62. von Smoluchowski M. Zur kinetischen Theorie der Brownschen Molekularbewegung und der Suspensionen. Ann Phys. 1906; 326(14):756–80.

    Article  Google Scholar 

  63. Matsumoto M, Nishimura T. Mersenne twister: A 632-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans Model Comp Sim. 1998; 8:3–30.

    Article  Google Scholar 

  64. Golub GH, Ortega JM. Scientific Computing: An Introduction with Parallel Computing. 1993.

    Chapter  Google Scholar 

  65. Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J Comput Phys. 1977; 22:403–34.

    Article  Google Scholar 

  66. Yonezawa A, Watanabe T, Yokokawa M, Sato M, Hirao K. Advanced institute for computational science (AICS): Japanese national high-performance computing research institute and its 10-petaflops supercomputer K. Seattle: ACM/IEEE; 2011.

  67. Minton A. The influence of macromolecular crowding and macromolecular confinement on biochemical reactions in physiological media. J Biol Chem. 2001; 276(14):10577–80.

    Article  CAS  PubMed  Google Scholar 

  68. Novak IL, Kraikivski P, Slepchenko BM. Diffusion in Cytoplasm: Effects of Excluded Volume Due to Internal Membranes and Cytoskeletal Structures. Biophys J. 2009; 97(3):758–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Phillip Y, Sherman E, Haran G, Schreiber G. Common Crowding Agents Have Only a Small Effect on Protein-Protein Interactions. Biophys J. 2009; 97(3):875–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Zimmerman SB, Trach SO. Estimation of macromolecule concentrations and excluded volume effects for the cytoplasm of Escherichia coli. J Mol Biol. 1991; 222(3):599–620.

    Article  CAS  PubMed  Google Scholar 

  71. Andrews SS, Addy NJ, Brent R, Arkin AP. Detailed Simulations of Cell Biology with Smoldyn 2.1. PLoS Comput Biol. 2010; 6(3):e1000705.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  72. Smith S, Grima R. Breakdown of the reaction-diffusion master equation with nonelementary rates. Phys Rev E. 2016; 93(5):052135.

    Article  PubMed  CAS  Google Scholar 

  73. Lawson MJ, Petzold L, Hellander A. Accuracy of the Michaelis–Menten approximation when analysing effects of molecular noise. J R Soc Interface. 2015; 12(106):20150054.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  74. Holmen JK, Foster DL. Accelerating single iteration performance of CUDA-based 3D reaction-diffusion simulations. Int J Parallel Prog. 2014; 42(2):343–63.

    Article  Google Scholar 

Download references


We thank Steven Andrews and Frank Noé for their help with the benchmark models of Smoldyn and ReaDDy, respectively. We are grateful to the four anonymous reviewers for their constructive comments. We thank Wei-Xiang Chew for helpful discussions. We also thank Peter Karagiannis and Kylius Wilkins for reading the initial version of the manuscript and providing valuable suggestions. AM wishes to thank Yukihiro Eguchi for continuous encouragement and Hisashi Nakamura of RIST for giving him the opportunity to start this study.


This research is part of the HPCI Strategic Program for Innovative Research in Supercomputational Life Science (SPIRE Field 1), which is funded by the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan. Part of the results were obtained by using the K computer at RIKEN Advanced Institute for Computational Science (Proposal number hp120309). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations



AM, KT and SNVA conceived the project. AM and SNVA conceived the parallelization schemes and simulation algorithm. AM and KI wrote the initial version of the simulator. SNVA wrote, tested and optimized the simulation algorithm and software, and created the build system. AM and SNVA generated data and wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Satya N.V. Arjunan.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arjunan, S.N., Miyauchi, A., Iwamoto, K. et al. pSpatiocyte: a high-performance simulator for intracellular reaction-diffusion systems. BMC Bioinformatics 21, 33 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: