 Research Article
 Open Access
 Published:
LASSIE: simulating largescale models of biochemical systems on GPUs
BMC Bioinformatics volume 18, Article number: 246 (2017)
Abstract
Background
Mathematical modeling and in silico analysis are widely acknowledged as complementary tools to biological laboratory methods, to achieve a thorough understanding of emergent behaviors of cellular processes in both physiological and perturbed conditions. Though, the simulation of largescale models—consisting in hundreds or thousands of reactions and molecular species—can rapidly overtake the capabilities of Central Processing Units (CPUs). The purpose of this work is to exploit alternative highperformance computing solutions, such as Graphics Processing Units (GPUs), to allow the investigation of these models at reduced computational costs.
Results
LASSIE is a “blackbox” GPUaccelerated deterministic simulator, specifically designed for largescale models and not requiring any expertise in mathematical modeling, simulation algorithms or GPU programming. Given a reactionbased model of a cellular process, LASSIE automatically generates the corresponding system of Ordinary Differential Equations (ODEs), assuming massaction kinetics. The numerical solution of the ODEs is obtained by automatically switching between the RungeKuttaFehlberg method in the absence of stiffness, and the Backward Differentiation Formulae of first order in presence of stiffness. The computational performance of LASSIE are assessed using a set of randomly generated synthetic reactionbased models of increasing size, ranging from 64 to 8192 reactions and species, and compared to a CPUimplementation of the LSODA numerical integration algorithm.
Conclusions
LASSIE adopts a novel finegrained parallelization strategy to distribute on the GPU cores all the calculations required to solve the system of ODEs. By virtue of this implementation, LASSIE achieves up to 92× speedup with respect to LSODA, therefore reducing the running time from approximately 1 month down to 8 h to simulate models consisting in, for instance, four thousands of reactions and species. Notably, thanks to its smaller memory footprint, LASSIE is able to perform fast simulations of even larger models, whereby the tested CPUimplementation of LSODA failed to reach termination. LASSIE is therefore expected to make an important breakthrough in Systems Biology applications, for the execution of faster and indepth computational analyses of largescale models of complex biological systems.
Background
Systems Biology is a multidisciplinary research field relying on the crosstalk between mathematical, computational and experimental tools to investigate the functioning of complex biological systems, and to predict how they might behave in both physiological and perturbed conditions. To this aim, different computational methods—e.g., parameter estimation, sensitivity analysis or reverse engineering [1, 2]—are usually exploited to define or calibrate the mathematical model that describes the system of interest. These methods require the execution of a large number of simulations, each one generally corresponding to a distinct model structure or parameterization, that is, to a different set of molecular interactions or to different initializations of the species amounts and/or reaction constants. As a result, the computational burden required by these computational analyses can rapidly overtake the capabilities of Central Processing Units (CPUs), therefore limiting indepth computational investigations to smallscale models consisting in a few tens of reactions and molecular species at most. Generalpurpose Graphics Processing Units (GPUs) can be exploited to overcome these drawbacks. Indeed, they are parallel multicore coprocessors that are drawing an evergrowing attention by the scientific community, since they give access to terascale performances on common workstations (and petascale performances on GPUequipped supercomputers [3]). As such, they can markedly decrease the running times required by traditional CPUbased software, still maintaining lowcosts and energetic efficiency. As a matter of fact, in the latter years GPUs have been widely adopted as an alternative approach to classic parallel architectures for the parallelization of computational methods in Systems Biology, Computational Biology and Bioinformatics [4].
In this work we propose LASSIE (LArgeScale SImulator), a novel GPUaccelerated software designed to simulate largescale reactionbased models of cellular processes, consisting in hundreds or thousands of reactions and molecular species. An example of killerapplication of LASSIE would consist in the simulation of rulebased models according to the socalled indirect methods (see [5, 6] for more information), especially when some proteins are characterized by multiple phosphorylation sites or binding domains, a condition that yields a combinatorial explosion of intermediate chemical complexes and chemical reactions [7]. We designed LASSIE as a general “blackbox” tool able to simulate, in principle, any largescale reactionbased biochemical system based on massaction kinetics (e.g., the ErbB signaling pathways modeled by Chen et al. [8]), given that the available GPU memory is sufficient to accommodate the necessary data structures. However, considering the difficulty in the manual definition of such massive models, LASSIE may be adopted as an efficient simulation engine for rulebased modeling tools (e.g., BioNetGen [9], PySB [10], Kappa [11]). As a matter of fact, rulebased modeling can generate extremely largescale systems characterized by very long simulation times: LASSIE may represent an enabling tool to prevent the application of advanced computational investigations of such biological models.
In silico simulations allow to determine the quantitative variation of molecular species amount in time and/or in space, by exploiting either deterministic, stochastic or hybrid algorithms [12–14]. In particular, when the concentrations of molecular species is high and the effect of biological noise can be neglected [15], Ordinary Differential Equations (ODEs) represent the typical modeling approach for cellular processes. Given a model parameterization (i.e., the initial state of the system and the set of kinetic parameters), the temporal dynamics of the system can be simulated by solving the ODEs using some numerical integrator, such as Euler or RungeKutta methods [16]. Unfortunately, ODEs can be affected by a wellknown phenomenon named stiffness [17], which occurs when the system of biochemical reactions is characterized by two wellseparated dynamical modes, determined by fast and slow reactions, respectively [18]. Stiffness can cause the stepsize of integration algorithms to reach extremely small values, thus increasing the overall running time. To solve this issue, advanced integration methods like LSODA [19] can be exploited, thanks to their capability of efficiently solving stiff systems. LSODA is able to recognize when a system is stiff and to dynamically select between the most appropriate integration algorithm: the Adams methods [16] in the absence of stiffness, and the Backward Differentiation Formulae (BDF) [20] otherwise. Despite the improvement of efficiency granted by LSODA, the numerical integration of the system of ODEs can become excessively burdensome when the numbers of reactions and molecular species increase. LASSIE overcomes this limitation by distributing over thousands of GPU cores all the calculations required by the numerical integration methods it embeds, therefore paving the way for fast simulations of largescale and stiff models of cellular processes. One interesting feature of GPUs is that they can have different characteristics, both in terms of resources (e.g., amount of high performance memories, number of cores) and computing power (e.g., clock rate). Kernels’ performances transparently scale on different GPUs, since they automatically leverage the additional resources offered by the latest architectures, a characteristic known as transparent scalability.
Notably, LASSIE was designed to be a “blackbox” deterministic simulator, not requiring any expertise in mathematical modeling nor any GPU programming skill. More precisely, given the formalization of a cellular process as a reactionbased model [21, 22] and assuming massaction kinetics [23, 24], LASSIE proceeds according to the following workflow: (1) it automatically generates the system of ODEs—one ODE for each molecular species occurring in the biochemical system—according to the biochemical reactions included in the model; (2) it automatically derives the Jacobian matrix, taking advantage of the symbolic derivation, to apply the BDF; (3) it executes the numerical integration of the ODEs by automatically switching between the RungeKuttaFehlberg (RKF) [25] method in the absence of stiffness and firstorder BDF (also known as Backward Euler method) [20] in presence of stiffness. We point out that LASSIE is a fully automatic simulator: the user does not need to enter the ODEs directly. On the contrary, the input consists in a set of (parameterized) chemical reactions, specified by means of text files. The corresponding system of ODEs is automatically determined according to the massaction kinetics, making LASSIE usable without any prior knowledge about ODEs modeling and integration. In order to further simplify the execution of simulations, LASSIE is provided with a userfriendly Graphical User Interface (Fig. 1), whose functioning is described in Additional file 1. A comprehensive description of the input files is provided in Additional file 2.
The computational performances of LASSIE are assessed by measuring the running time required to simulate a set of randomly generated synthetic reactionbased models of increasing size—ranging from 64 to 8192 reactions and species—which is compared to the running time required by a CPUimplementation of LSODA. Moreover, we show the accuracy of LASSIE by comparing its outcome with LSODA outcome for the simulation of a model of the Ras/cAMP/PKA signal transduction pathway in S. cerevisiae [26], which is characterized by stiffness.
We highlight that, in general, the implementation of computational methods able to fully exploit the peculiar architecture of GPUs is challenging, since specific programming skills are required and a complete algorithm redesign is often necessary. For instance, the parallelization on the GPU cores can rely either on a coarsegrained or a finegrained strategy. The first strategy allows to simultaneously run a massive number of independent simulations (each one characterized by, e.g., a different model parameterization); on the contrary, the second strategy consists in the parallelization of all the calculations required by a single simulation, an approach that is more suitable for largescale models. By virtue of the novel finegrained parallelization strategy used to implement LASSIE, our GPUpowered simulator achieves up to 92× speedup with respect to LSODA.
Coarsegrained parallelizations of deterministic simulations were presented in [27–29]. The simulators proposed in these works allow to reach a speedup ranging from 28× to 86× with respect to the corresponding CPUbased simulators. Finegrained parallelizations of stochastic simulations were presented in [30, 31]. Komarov and D’Souza proposed GPUODM [30], a finegrained simulator of largescale models based on the Stochastic Simulation Algorithm (SSA) [32]. This tool uses special data structures and functionalities to efficiently distribute all calculations over the multiple cores of GPUs. These optimizations allow GPUODM to outperform the most advanced CPUbased implementations of SSA. Komarov et al. also proposed a GPUpowered finegrained implementation of τleaping [31], an approximate but accurate stochastic algorithm that is, in general, faster than SSA [33]. This tool was shown to be more efficient than its sequential counterpart in the case of extremely large biochemical networks (i.e., characterized by more than 10^{5} reactions). Notably, to the best of our knowledge, no examples of finegrained deterministic simulators, such as LASSIE, have been proposed so far.
LASSIE was developed using the most widespread GPU computing library, namely, Nvidia Compute Unified Device Architecture (CUDA). CUDA allows programmers to exploit the GPUs for generalpurpose computational tasks (GPGPU computing). Nevertheless, the direct porting of an application to the GPU is usually unfeasible, so that the full exploitation of the computational power and of the massive parallelism of GPUs still represent the main challenges of GPGPU computing. To exploit the CUDA architecture, the programmer implements C/C++ functions (called kernels), which are loaded from the CPU (the host) to one or more GPUs (the devices), and replicated in many copies named threads. CUDA organizes threads in threedimensional structures called blocks, which belong to threedimensional structures named grids, as shown in Fig. 2 (left side). CUDA combines the Single Instruction Multiple Data (SIMD) architecture and a flexible multithreading in order to handle any conditional divergence between threads. Figure 2 (right side) also shows a schematic representation of CUDA’s memory hierarchy: the global memory (accessible from all threads), the shared memory (accessible from threads belonging to the same block), the local memory (each thread has its registers and arrays) and the constant memory (cached and read only). The global memory is large (a few GBs) but suffers from high access latencies; this problem, anyway, was mitigated thanks to the use of L1 cache since the introduction of the Fermi architecture. On the contrary, the constant memory is much smaller (i.e, up to 10 KB for each multiprocessor) but faster than the global memory, as well as the shared memory (i.e., up to 112 KB for each multiprocessor limited to 48 KB for each block); in particular, the latter should be exploited as much as possible in order to obtain the best performances. Though, the size of the shared memory and its scope restrict the possibility to use it, as only threads belonging to the same block can communicate through the shared memory.
Given the peculiar features of CUDA architecture, GPU programmers should be able to optimize both threads partitioning and memory usage, as well as to redesign the algorithm with appropriate kernels, in order to fully leverage the computational power of these multicore devices. For instance, in the implementation of LASSIE, the shared memory is not used because several blocks are exploited to solve the system of ODEs, and threads do not communicate data with each other. Moreover, the data structures employed by LASSIE are larger than the total size of the shared memory, thus preventing the possibility to exploit it. In what follows, we show how GPU programming and CUDA features—including builtin support for vector types, which extend the standard C data types to vector—have been exploited to optimize the execution workflow of LASSIE.
The paper is structured as follows. In the next section we briefly introduce the formalism of reactionbased models and provide a general description of LASSIE’s implementation. Then, we discuss the computational performance of LASSIE, showing the speedup it achieves with respect to LSODA for the simulation of reactionbased models of different sizes. We also analyze how the number of reactions and the number of species affect the performances of LASSIE. We conclude the work with some final remarks about CUDA’s architecture and LASSIE, proposing future improvements of the simulator. LASSIE is available on the GITHUB repository https://github.com/aresio/LASSIE.
Methods
LASSIE is designed to be a “blackbox” deterministic simulator, created to be easily used without any GPU programming or ODEs modeling skills. In this section we describe how LASSIE allows to perform deterministic simulations of largescale biochemical models, distributing all required calculations on the cores of the GPU. It is worth noting that the parallelization strategy exploited by LASSIE represents one of the novelties of this work, and allowed to achieve the remarkable performance results presented in the next sections. In particular, LASSIE has been developed to solve systems of coupled ODEs specified in the form \(\frac {d\mathbf {X}}{dt} = f(t, \mathbf {X})\), where X≡X(t) represents the vector of concentration values at time t of all chemical species occurring in the system.
Reactionbased models and ODEs generation
Reactionbased modeling is a mechanistic, quantitative and parametric formalism to describe and simulate networks of biochemical reactions [22], which was exploited to analyze different signal transduction pathways (see, e.g., [34–39]). A reactionbased model is defined by specifying the set of N molecular species {S _{1},…,S _{ N }} and the set of M biochemical reactions {R _{1},…,R _{ M }} which appear in the cellular process under investigation [22]. A generic reaction is described as follows:
where a _{ ij }, \(b_{ij} \in \mathbb {N}\) are the stoichiometric coefficients and \(k_{i} \in \mathbb {R}^{+}\) is the kinetic constant associated with R _{ i }.
The set of reactions {R _{1},…,R _{ M }} can be written compactly in the matrixvector form \(\mathbf A \mathbf S \xrightarrow {\mathbf K} \mathbf B \mathbf S\), where S=[S _{1}⋯S _{ N }]^{T} is the Ndimensional column vector of molecular species, K=[k _{1}⋯k _{ M }]^{T} is the Mdimensional column vector of kinetic constants, and A,B∈N ^{M×N} are the socalled stoichiometric matrices whose (nonnegative) elements [A]_{ i,j } and [B]_{ i,j } correspond to the stoichiometric coefficients a _{ ij } and b _{ ij } of the reactants and the products of all reactions, respectively. Since a reaction simultaneously involving more than two reactants has a probability to take place almost equal to zero, here we consider only first and secondorder reactions (i.e., at most two reactant molecules of the same or different species can appear in the left hand side of Eq. 1). For this reason, the matrices A and B are sparse.
Given an arbitrary reactionbased model and assuming the law of massaction [24, 40], it is possible to derive the corresponding system of coupled ODEs that describes the variation in time of the species concentrations. Specifically, by denoting the concentration of species S _{ j } at time t as X _{ j }, where X _{ j }∈R ^{≥0} for j=1,…,N, the system of coupled ODEs can be obtained as follows:
where X is the Ndimensional vector of concentration values at time t (representing the state of the system at time t), the symbol ∘ denotes the entrybyentry matrix multiplication, and X ^{A} denotes the vectormatrix exponentiation form [40]. Formally, X ^{A} is a Mdimensional vector whose ith component is given by \(X_{1}^{Ai1} \cdots X_{N}^{AiN}\), for i=1,…,M.
We highlight that each ODE appearing in Eq. 2 is a polynomial function, consisting in at least one monomial that is associated with a specific kinetic constant.
Data structures and CUDA memory usage
Given a reactionbased model as input, LASSIE automatically generates the systems of ODEs according to Eq. 2 and encodes the matrices A and H=(B−A)^{T} as two arrays of short4 CUDA vector types, named V A and V H, respectively.
CUDA vector types are multidimensional data ranging from 1 to 4 components, addressed by .x, .y, .z, and .w. Since the matrices A and H are sparse, LASSIE uses compressed data structures created by removing all zero elements from A and H, in order to save memory and avoid unnecessary readings from the global memory. Namely, let h _{ ji } be the element of H at row j and column i, and a _{ ij } the element of A at row i and column j, for i=1,…,M and j=1,…,N. For each nonzero element of H, we store into the .x and .y components of V H the values j and i, respectively; the .z component of V H is used to store the element h _{ ji }, while the .w component stores the index of the kinetic constant associated with that monomial. Similarly, for each nonzero element of A, the .x and .y components of V A contain the values i and j, respectively. The value a _{ ij } is stored into the .z component of V A, while the .w component is left unused. Note that we exploited the short4 CUDA vector type rather than the short3 CUDA vector type, because the former is 8aligned and requires a single instruction to fetch a whole entry, while the latter is 2aligned and thus takes three memory operations to read each entry. In order to parse these arrays inside the GPU, we use two additional arrays of short2 CUDA vector types, named O _{ H } and O _{ A }, which store the offsets used to correctly read the entries of the V H and V A structures, respectively. The .x and .y components of each row of O _{ H } contain, respectively, the first index and the last index to access the V H structure. Each thread uses its own pair of indexes to read the rows of the V H structure between the first index and the last one. Similarly, O _{ A } stores the indexes that allow to correctly access the V A structure. Finally, the values of the kinetic constants are stored into an array of type double, named K. Figure 3 shows an example of the matrix encoding used in LASSIE.
Thanks to these CUDA structures, we obtain a twofold performance improvement: (i) at the instruction level, a single instruction is enough to either load or store a multiword vector. So doing, the total instruction latency for a particular memory transaction is lower and also the bytes per instruction ratio is higher; (ii) at the memory controller level, by using vector types a transfer request from a warp has a larger net memory throughput per transaction, yielding a higher bytes per transaction ratio. With a fewer number of transfer requests, the memory controller is able to reduce contentions producing a higher overall memory bandwidth utilization. The only limitation due to short data type is that indices are limited to 2^{2×8}−1, which means that LASSIE cannot simulate systems larger than 65 536 chemical species and reactions.
Execution workflow and CUDA kernels
Once that the system of ODEs is generated by reading the input files (see Additional file 2) and appropriately stored according to the CUDA vector types, LASSIE solves it by automatically switching between the RungeKuttaFehlberg (RKF) method [25] in the absence of stiffness, and the Backward Differentiation Formulae (BDF) methods [20] in presence of stiffness. The integration of the systems of ODEs is carried out from an initial time instant t _{0}, up to a given maximum simulation time t _{ max }. In order to reproduce the dynamics of the cellular process described by the ODEs, the concentration values of the molecular species appearing in the reactionbased model are saved at specified time steps within the interval [t _{0},t _{ max }] (such time steps might correspond, e.g., to the sampling times of laboratory experiments).
LASSIE’s workflow consists in 6 distinct phases, as represented in Fig. 4. Note that phases P _{1}, P _{4} and P _{6} are executed by the host (yellow boxes in Fig. 4), while P _{2}, P _{3} and P _{5} are executed by the device (green boxes in Fig. 4). Overall, phases P _{2}, P _{3} and P _{5} rely on 25 different lightweight kernels, which were specifically developed to fully leverage the parallel architecture of the GPU for the implementation of the aforementioned numerical integration methods. We describe hereafter the main design and implementation choices of each phase and their related CUDA kernels, which result in a novel parallelization strategy with respect to stateoftheart methodologies (see, e.g., [41]).
Phase P _{ 1 } . It implements the generation of all data structures used to encode the ODEs, as described in the previous section. This phase is executed on the host.
Phase P _{ 2 } . It is used to sample and save the system dynamics and it is implemented by means of a single CUDA kernel (kernel K _{1}). In particular, if the current simulation time t corresponds to one of the specified sampling time instants, LASSIE saves the concentration values of (possibly, a subset of) all molecular species into an array defined on the GPU. Otherwise, the execution proceeds to the next phase.
Phase P _{ 3 } . It implements the RKF method [42], an explicit integration algorithm with variable stepsize used by each thread j to solve the jth ODE, for j=0,…,N−1. This phase is implemented as 9 CUDA kernels.
During this phase, two different approximated states u(t+d t) and w(t+d t) of the state X(t+d t) of the system are generated at each step, thanks to the evaluation of six supplementary values l _{1},…,l _{6} (see details in Additional file 3). To evaluate the accuracy of u and w at the current stepsize dt, LASSIE exploits a userdefined vector tolerance ε∈R ^{N} (with ε _{ j }>0 for all j=1,…,N), and two additional arrays, E R,δ∈R ^{N}, defined as follows:
If E R _{ j }≤ε _{ j } for all j=1,…,N, then u is accepted as new state of the system, that is, X(t+d t)=u(t+d t); otherwise, the solutions u and w are rejected and recalculated by using a new stepsize. The new stepsize is computed as d t=d t· min{δ _{1},…,δ _{ N }}, being δ _{1},…,δ _{ N } the components of vector δ (note that the new value of dt has to be chosen in order to satisfy the requested error tolerance for all ODEs).
Overall, phase P _{3} is implemented by means of the following kernels:

kernel K _{2}: used to evaluate each ODE at the current state X of the system;

kernels K _{3} – K _{8}: each thread j, for j=0,…,N−1, computes the components l _{1j },…,l _{6j } of l _{1},…,l _{6}, by invoking kernel K _{2};

kernel K _{9}: each thread j, for j=0,…,N−1, computes the components w _{ j } and u _{ j } of the approximated states u and w, respectively;

kernel K _{10}: each thread j, for j=0,…,N−1, calculates the components E R _{ j } and δ _{ j } of E R and δ, respectively.
Phase P _{ 4 } . It is used to verify the RKF solutions calculated during phase P _{3} and, accordingly, to choose the next phase to be executed: (i) if the solutions are rejected and the new stepsize dt is acceptable (that is, d t≥ε _{ s }, for some ε _{ s }>0, e.g., ε _{ s }=10^{−6}), phase P _{3} is executed again exploiting a smaller stepsize dt; (ii) if the solutions are rejected and the new stepsize dt becomes too small (that is, d t<ε _{ s }), LASSIE executes phase P _{5}; (iii) if all solutions do not violate the specified RKFtolerance vector ε, then LASSIE executes phase P _{6}.
Note that point (ii) implicitly states that the system of ODEs is considered to be stiff, so that LASSIE automatically switches to phase P _{5}, where BDF methods are used for the numerical integration. Phase P _{4} is executed on the host.
Phase P _{ 5 } . It implements the BDF methods, the most widely used implicit multistep numerical integration algorithms [43]. LASSIE switches to this phase if and only if the RKF solutions u and w evaluated during phase P _{4} are rejected, and the RKF stepsize dt becomes smaller than ε _{ s }.
The general formula for a BDF can be written as
where the coefficients α _{ i } (with α _{0}=1) and β _{0} are chosen according to the order q of BDF [43], and dt is userdefined. Note that, for q>6, the absolute stability region of the resulting BDF methods is too small and such BDFs are numerical unstable [44]. Therefore, BDFs with an order q greater than 6 are not used. Since each BDF is an implicit method, at each time step it requires the solution of a nonlinear system of equations, which can be solved by using the iterative Newton—Raphson method [45]. This algorithm allows to find successively better approximations z of the zeros of a realvalued function f(z)=0 by using the derivative of f(z), and it is repeated until a sufficiently accurate value is reached. This idea can be extended to a system of nonlinear equations, by using the Jacobian matrix J(t,X(t)) of f(t,X(t)), which is the matrix of all firstorder partial derivatives. Since the evaluation of the Jacobian matrix at each iteration is computationally expensive, LASSIE actually exploits: (i) a modified Newton—Raphson method [46]; (i i) the LU factorization method [47] (we refer the interested reader to Additional file 3 for technical details). During phase P _{5}, the NewtonRaphson method is iterated until a userdefined maximum number of iterations m a x _{ it } is reached, or a sufficiently accurate value is achieved (i.e., smaller than a userdefined tolerance value ε _{ NR }).
Overall, phase P _{5} is implemented by means of the following kernels:

kernel K _{11}: each thread j, for j=0,…,N−1, derives the jth row of the Jacobian matrix and evaluates it on the current state of the system X;

kernel K _{12}: the Jacobian matrix is transposed in order to exploit the LU factorization method (accelerated on GPU by the cuBLAS library [48]);

kernels K _{13} – K _{18}: based on the order q of the BDF, LASSIE invokes one of these kernels (i.e., kernel K _{13} for q=1, kernel K _{14} for q=2,..., kernel K _{18} for q=6) to calculate the known terms of the linear system;

kernels K _{19} – K _{24}: each kernel K _{(18+q)}, q=1,…,6, performs the calculations of the qth order BDF;

kernel K _{25}: it updates the iteration vector needed to execute the NewtonRaphson method (see Additional file 3).
Phase P _{ 6 } . It is used to verify the termination criterion: if the maximum time t _{ max } is reached, then the simulation ends. On the contrary, the execution iterates from phase P _{2}. This phase is executed on the host.
All the temporary results computed by LASSIE are stored on the GPU, since data transfers between the host and the device are very time consuming. For the same reason, the output data (i.e., the concentration values of molecular species sampled at fixed time instants) are transferred to the host as soon as the whole simulation is completed.
Results and discussion
In this section we compare the computational performance of LASSIE against LSODA [19], which is generally considered one of the best numerical integration algorithms for deterministic simulations of biological systems, thanks to its capability of dealing with stiff and nonstiff systems. In particular, we exploited the LSODA implementation provided by SciPy library [49] (version 0.15.1), written in C language. LASSIE was run on a machine with a GPU Nvidia GeForce Titan GTX, based on the Kepler architecture and equipped with 2×15 streaming multiprocessors for a total of 5760 cores (clock 837 MHz) and a theoretical peak processing power of 1.3 TFLOPS in double precision. Instead, LSODA was run on GALILEO, a supercomputer created by the Italian consortium CINECA. GALILEO consists of 516 compute nodes, each one equipped with 2 CPUs octacore Intel Xeon Haswell E52630 v3 (clock 2.40 GHz) for a total of 8256 cores, and 128 GB of RAM. Each CPU is capable of about 300 GFLOPS in double precision. In our tests, we exploited one node with 120 GB of RAM distributed over 5 cores.
The computational performance was evaluated by simulating a set of synthetic reactionbased models of increasing size, that is, having a number of reactions and species M×N arbitrarily chosen in the range from 64×64 to 8192×8192. The models were generated considering the methodology used in [30, 50], which was modified in order to randomly sample the initial concentration of each species with a uniform distribution in the range [0,1), and the kinetic constant of each reaction with a logarithmic distribution in the range [10^{−8},1).
For each model size M×N, we generated and simulated 30 different synthetic reactionbased models to the aim of measuring the average running time of both LASSIE and LSODA. The simulation of each reactionbased model was performed multiple times, using different settings for the sampling of the timeseries. Specifically, in each repetition, we saved either 10,50,100,500 or 1000 samples of the system dynamics of all chemical species, at regular intervals. All simulations were halted at time t _{ max }=50 (arbitrary units).
All simulations were executed—independently from the size of the model and the number of samples saved—by setting the following parameters of LASSIE:

tolerance of RKF method ε _{ j } = 10^{−12}, j=1,…,N;

first–order BDF method (q=1);

BDF integration step d t=0.1;

tolerance of NewtonRaphson method ε _{ NR } = 10^{−6};

maximum number of iterations allowed during each call of the NewtonRaphson method m a x _{ it }=10^{4};

initial integration step of RKF method equal to 10^{−3};

tolerance value to switch between RKF and Backward Euler methods ε _{ s }=10^{−6}.
The following parameters of LSODA were used to run the simulations:

relative tolerance equal to 10^{−6};

absolute tolerance equal to 10^{−12};

maximum number of internal steps equal to 10^{4}.
Table 1 reports the values of the average running times (given in seconds) of LSODA and LASSIE, required for the execution of each set of 30 different synthetic reactionbased models of size M×N, each time considering 10,50,100,500,1000 samples of the system dynamics of all chemical species. The speedup values achieved by LASSIE with respect to LSODA are given in Table 1 and graphically represented in Fig. 5, for each tested case; note that when the speedup value is greater than one, LASSIE is faster than LSODA, and vice versa. The breakeven (blue line in Fig. 5) between the performances of LASSIE and LSODA is observed when the number of reactions and chemical species is between 128 and 256. Specifically, in the case of 256×256 model size and 10 samples, the running time of LSODA is almost twice with respect to LASSIE: 1.28 s vs. 0.67 s. In particular, we emphasize that the execution of the simulations for models characterized by 4096 reactions and 4096 species with 10 samples takes, on average, 249.8 s with LSODA and just 2.71 s with LASSIE, resulting in around 92× speedup. Furthermore, LASSIE allows the simulation of largescale models (e.g., 8192×8192) thanks to its smaller memory footprint with respect to LSODA, taking just 14.13 s to simulate the model characterized by 8192 reactions and 8192 species with 10 samples. Conversely, the version of LSODA implemented in SciPy library has a high memory footprint that does not allow to simulate models of this size on GALILEO, the supercomputer employed to perform the simulations.
Figure 5 also points out how the number of samples of the dynamics affects the performances of LASSIE, due to the different number of accesses to the highlatency global memory. For instance, the speedup achieved with the model characterized by 4096 reactions and 4096 species decreases to 14.6× with 1000 samples, meaning that the simulations with 1000 samples are around 6× slower than the simulations with 10 samples. In models characterized by 2048 reactions and 2048 species, the speedup obtained with 10 samples (37.2×) is around 3× larger compared to the one achieved with 1000 samples (11.4×), while in models characterized by 1024 reactions and 1024 species, the speedup obtained with 10 samples (16.5×) is around 2× larger compared to the one achieved with 1000 samples (6.9×). Finally, Fig. 6 shows that the running time of LASSIE increases with the number of samples, while LSODA is characterized by an almost constant running time, irrespective of the number of samples. It is worth noting that CPUbound integration methods like LSODA can be more efficient in the case of smallscale models. This is due to two concomitant circumstances. On the one hand, the clock frequency of CPUs is higher than the clock frequency of GPU (2.4 GHz with respect to 837 MHz, in the case of the hardware used to execute our tests). On the other hand, the communication and synchronization between threads can introduce a significant overhead, which is mitigated only when the calculations are distributed over a relevant number of threads; therefore, LASSIE becomes profitable for medium/largescale models characterized by hundreds of species. Notably, the bigger the model, the greater the speedup.
As an additional test, we investigated whether the relationship between the number of reactions and the number of species could affect the overall performances of LASSIE. As the number of chemical species corresponds to the number of ODEs, the length of each ODE is roughly proportional to the number of reactions. Since GPUs have a lower clock frequency than CPUs (e.g., in the case of the hardware used for the tests, 837 MHz with respect to 2.4 GHz, respectively), each GPU core is slower than the CPU core to perform a single instruction^{1}. For this reason, in order to obtain the highest performances, the calculations on the GPU should be spread across threads as much as possible, while the number of operations performed by each thread should be reduced.
Indeed, as reported in Table 2 and shown in Fig. 7, when the number of chemical species involved in a model is greater than the number of reactions, LASSIE achieves better performances than those obtained in the case of models with a number of chemical species smaller than the number of reactions. For instance, considering the models with M×N equal to 171×512, the running time of LASSIE is smaller than in the case of the models with size 512×171, irrespective of the number of samples of the system dynamics, thanks to the higher number of threads that are concurrently launched on the GPU in the first case. This is in general valid in all cases with the exception of the models characterized by 2048 chemical species with 500 and 1000 samples of the system dynamics. Here, the average running time of LASSIE is greater than in the case of models with 2048 reactions, since the required number of accesses to the highlatency global memory of the GPU impairs the performances of the simulations.
In order to assess the scalability of LASSIE, and of CUDA applications in general, we executed additional tests on different GPUs. Figure 8 shows a comparison of LASSIE’s performance using three different GPU models (Table 3): a notebook video card (Nvidia GeForce 960M, red bars), the Nvidia GeForce Titan Z used throughout the paper (green bars), and a Teslaclass GPU (the Nvidia K20c, blue bars). To compare the speedup provided by these GPUs we generated 30 different synthetic models (characterized by size M×N equal to 1024×1024, 2048×2048 and 4096×4096) and calculated the average running time.
Our results highlight the importance of two distinct factors on LASSIE’s performances: the GPU’s clock frequency and the amount of available resources (in this case, the cores). As a matter of fact, despite the lower amount of CUDA cores, the GeForce 960M turns out to be competitive on models of moderately large size thanks to its higher clock rate, with respect to the Titan Z and the K20c. When the ODEs largely outnumber the available cores (e.g., for 4096 reactions and chemical species), the GeForce 960M is no longer competitive. This is an example of transparent scalability of CUDA applications: the threads are automatically distributed over the available cores, improving the overall performances, without any user intervention. Moreover, as described in the Background section, threads are organized in blocks that are scheduled on the available multiprocessors. Thanks to this characteristic, when the overall number of threads outnumbers the available cores, CUDA automatically creates a queue of blocks that are scheduled on the streaming multiprocessors as soon as they become available for computation. Thus, LASSIE can, in principle, simulate any model on any GPU, as long as there is enough memory to store the data structures.
The Tesla K20c is characterized by a large amount of cores that, in the case of 1024×1024 models, are fully exploited only during the simulation of the stiff parts of the dynamics. For the remaining parts of the simulation, half of its cores are actually used for computation with a slower clock rate with respect to the clock rate of the GeForce GPUs. Moreover, Tesla cards exploit Error Correcting Codes (ECC) on memories, ensuring additional checks of correctness to the data against potential corruption from electrical or magnetic interference, at the price of a significant overhead [51]. The ECC was enabled during all tests, partly explaining the reduced performance of the Tesla K20c on very largescale models with respect to the Titan Z.
We assessed the accuracy of LASSIE by simulating the dynamics of the model of the Ras/cAMP/PKA signaling pathway in yeast presented in [26], and comparing the outcome of LASSIE with the result of the simulation performed with LSODA. We also investigated the influence of LASSIE parameters (e.g., tolerance values) on the running times and quality of the simulated solutions, by exploiting a model representing a chain of isomerizations. The accuracy results—which show an identical dynamics with respect to LSODA using default settings—are presented in the Additional file 4.
As a final remark, we highlight that a fair comparison of GPUs and CPUs is a difficult task, in general, due to their deep architectural differences. The theoretical peak performances of both architectures are difficult to achieve: indeed, developers must implement code to the aim of maximizing the parallelism and the occupancy of the multiprocessors, adhering as much as possible to the underlying SIMD computational model in the case of the GPU and exploiting vector instructions in the case of the CPU. However, GPUs allow the temporary divergence of the execution flow of threads, that is, a part of the threads can execute different portions of the code (e.g., the branches of an IF/THEN/ELSE statement). When this situation occurs, some threads get stalled waiting for reconvergence. This mechanism provides the programmer with a certain degree of freedom to abandon the SIMD paradigm, but at the same time it can potentially lead to the complete serialization of the execution affecting the overall performances. Hence, conditional branches should be avoided as much as possible. We also highlight that the usage of registers and shared memory influences the occupancy of the GPU, as these resources are scarce on each streaming multiprocessor. All these circumstances can prevent the achievement of the peak computational power of a GPU.
To this aim, we developed kernels that maximize the parallelism and the occupancy of the multiprocessors avoiding threads divergence as much as possible. Moreover, we optimized data structures to store the matrices A and H that encode the system of ODEs, and CUDA vector types that allow to increase the memory throughput and to reduce the number of memory accesses, all precautions that explain the performance boost achieved with LASSIE.
Conclusions
In this work we presented LASSIE, a GPUpowered simulator of largescale biochemical systems based on massaction kinetics. LASSIE is a “blackbox” simulator able to automatically convert reactionbased models of biological systems into the corresponding systems of ODEs. Reactionbased models defined according to the law of massaction do not hinge upon the use of any approximate kinetics functions (e.g., MichaelisMenten rate law for enzymatic processes [23], Hill functions for cooperative binding [52], etc.), which are frequently used in Systems Biology for the definition of mathematical models based on differential equations. Although MichaelisMenten kinetics or Hill functions can be useful in biological modeling, they rely on chemical assumptions that are valid only in certain conditions [53]. Therefore, the reason why we rely on massaction based models is manifold. On the one hand, since the biological function and biochemical kinetics of all molecular species and all reactions appearing in the model are not approximated nor lumped together in any way, they can be analyzed independently from each other. As a consequence, this allows to determine the influence of every single species and reaction on the overall functioning of the system. On the other hand, the law of massaction allows to derive a first order ODE for each species appearing in the model: it is worth noting that such ODE is a polynomial function that describes how the concentration of that species changes in time, according to all the reactions where it appears either as reactant or product [24]. The presence of polynomial functions simplifies the symbolic derivation that is needed to calculate the Jacobian matrix associated with the ODEs and exploited by the BDF. In addition, as described in the Methods section, polynomials can be efficiently encoded in the memory and parsed GPUside. As a result, all GPU threads can perform the same task (i.e., polynomial decoding and evaluation), strongly reducing warps’ divergence and the consequent stalling of threads due to serialization, a circumstance that would instead happen if each thread calculated an ODE characterized by an arbitrary kinetics. In order to solve systems of ODEs characterized by stiffness, LASSIE automatically switches between the RKF and the BDF integration methods. LASSIE’s execution flow is partitioned into 25 CUDA kernels, overall distributing the calculations over the available cores in order to fully exploit the massive parallel capabilities of modern GPUs, therefore achieving a relevant reduction of the running time in case of largescale models.
In order to assess the computational performance of LASSIE, we performed a set of simulation tests using synthetic reactionbased models of increasing size, and we compared LASSIE’s running time with respect to the LSODA numerical integration algorithm implemented in the SciPy library. The breakeven between the performances of LASSIE and LSODA was observed when both the numbers of reactions and chemical species is in between 128 and 256. This result indicates that, for biological systems consisting in more than 256 reactions and 256 species, the GPUpowered simulator becomes more convenient than the LSODA algorithm running on CPU. Indeed, in the case of largescale models, characterized by 4096 reactions and 4096 species, we obtained a considerable 92× speedup. Moreover, thanks to its smaller memory footprint with respect to LSODA, LASSIE allows the simulation of even larger models, taking just an average of 14.13 s to simulate models characterized by 8192 reactions and 8192 species. On the contrary, LSODA did not allow the simulation of models of this size on the computer we used for the tests, as it crashed because of its very high memory footprint. We also highlight that COPASI [54], one of the most used software in Systems Biology, requires in general longer execution times with respect to the SciPy implementation of LSODA exploited in this work. In addition, COPASI fails when trying to simulate largescale models. We provide an example of such model—characterized by 4096 reactions and 4096 species—as SBML file in the GITHUB repository.
BDFs are the most used integration algorithms to solve systems of ODEs in case of stiffness. The first–order BDF is a singlestep implicit integration method, meaning that the next state of the system depends only on the current state of the system. Higher–order BDFs are multistep methods, so that the next state of the system relies on multiple previous states of the system (i.e., the number of previous states is equal to the BDF order). This implies that the integration stepsize should be the same for all previous states to ensure the correctness of the solution. For this reason, LSODA uses the multistep Adams methods as explicit methods in addition to BDFs. Conversely, LASSIE uses the RKF method, which is a singlestep explicit algorithm with variable stepsize, and the Backward Euler method. Other singlestep implicit methods belonging to the family of RungeKutta methods exist [55], the most known being the families of Lobatto and Radau methods [56]. These methods have been proven to be suitable for stiff systems, thanks to their accuracy and stability [56]. As a future development of LASSIE, we will investigate the feasibility and efficacy of replacing the Backward Euler and, more generally, the BDFs with implicit RungeKutta methods [57], in particular Lobatto and Radau methods.
In order to fully exploit the CUDA architecture, the memory hierarchy must be exploited as much as possible. Because of the peculiar sequential structure of both explicit and implicit integration algorithms, LASSIE’s kernel are lightweight and rarely reuse any variables. For this reason, the current implementation only leverages the global memory (characterized by high latencies) and registers to manipulate the mutable data. The shared memory has not been exploited in any way, leaving room for potential future improvements of performances. However, on the GPUs where the L1 cache and the shared memory share the same resources, CUDA allows to express a preference to assign a larger amount of memory to the caching mechanisms. This functionality is enabled by default in LASSIE using the CUDA cudaFuncSetCacheConfig primitive, executed with the cudaFuncCachePreferL1 argument. Also the constant memory has not been used, since all data structures are larger than the total size of this memory. In a future release, we plan to leverage these memories, for instance to store the array of the kinetic constants and the structures containing the offsets used to correctly decode the ODEs. LASSIE currently exploits only a single GPU, even on multiGPU systems; as a future improvement of this work we plan to extend it in order to support multiGPU systems, to further increase the size of the models to be simulated.
Finally, an additional goal in the development of LASSIE is to integrate and accelerate the investigation of rulebased models. In the next future, we plan to develop a set of tools that will leverage the functionalities offered by rulebased modeling frameworks [9–11], to convert a rulebased model into a set of reactions. Although rulebased tools already provide internal simulation methods (e.g., PySB allows to perform deterministic simulations using advanced integrators like LSODA), LASSIE can represent a valuable alternative for largescale models, enabling the investigation of more detailed biological systems, paving the way to potential new discoveries in Systems Biology. LASSIE will be also integrated in COSYS, a free webbased platform for Systems Biology investigation available at http://www.sysbio.it/cosys [58].
Endnote
^{1} The advances in GPU’s technology will progressively reduce this gap. In middle 2016—with the introduction of the novel Pascal architecture and the 16 nm FinFET manufacturing process—Nvidia presented a GPU with a clock frequency of 1.7 GHz that, theoretically, is expected to double LASSIE’s performances.
Abbreviations
 BDF:

Backward Differentiation Formulae
 CPU:

Central Processing Unit
 CUDA:

Compute Unified Device Architecture
 ECC:

Error Correcting Codes
 GPGPU:

Generalpurpose GPU
 GPU:

Graphics Processing Unit
 LASSIE:

LArgeScale SImulator
 ODE:

Ordinary Differential Equation
 RKF:

RungeKuttaFehlberg
 SIMD:

Single Instruction Multiple Data
 SSA:

Stochastic Simulation Algorithm
References
Aldridge BB, Burke JM, Lauffenburger DA, Sorger PK. Physicochemical modelling of cell signalling pathways. Nat Cell Biol. 2006; 8(11):1195–203.
Chou IC, Voit EO. Recent developments in parameter estimation and structure identification of biochemical and genomic systems. Math Biosci. 2009; 219(2):57–83.
Joubert W, Archibald R, Berrill M, Brown WM, Eisenbach M, Grout R, Larkin J, Levesque J, Messer B, Norman M, Philip B, Sankaran R, Tharrington A, Turner J. Accelerated application development: The ORNL Titan experience. Comput Electr Eng. 2015; 46:123–38.
Nobile MS, Cazzaniga P, Tangherloni A, Besozzi D. Graphics processing units in bioinformatics, computational biology and systems biology. Brief Bioinform. 2016;2016(bbw058).
Chylek LA, Harris LA, Tung CS, Faeder JR, Lopez CF, Hlavacek WS. Rulebased modeling: a computational approach for studying biomolecular site dynamics in cell signaling systems. Wiley Interdisci Rev Syst Biol Med. 2014; 6(1):13–36.
Chylek LA, Stites EC, Posner RG, Hlavacek WS In: Prokop A, Csukás B, editors. Innovations of the rulebased modeling approach. Dordrecht: Springer: 2013. p. 273–300.
Blinov ML, Faeder JR, Goldstein B, Hlavacek WS. A network model of early events in epidermal growth factor receptor signaling that accounts for combinatorial complexity. Biosystems. 2006; 83(2):136–51.
Chen WW, Schoeberl B, Jasper PJ, Niepel M, Nielsen UB, Lauffenburger DA, Sorger PK. Input–output behavior of ErbB signaling pathways as revealed by a mass action model trained against dynamic data. Mol Syst Biol. 2009; 5(1):239.
Blinov ML, Faeder JR, Goldstein B, Hlavacek WS. BioNetGen: software for rulebased modeling of signal transduction based on the interactions of molecular domains. Bioinformatics. 2004; 20(17):3289–91.
Lopez CF, Muhlich JL, Bachman JA, Sorger PK. Programming biological models in Python using PySB. Mol Syst Biol. 2013; 9(1):646.
Feret J, Danos V, Krivine J, Harmer R, Fontana W. Internal coarsegraining of molecular systems. Proc Natl Acad Sci USA. 2009; 106(16):6453–458.
Wilkinson D. Stochastic modelling for quantitative description of heterogeneous biological systems. Nat Rev Genet. 2009; 10(2):122–33.
Székely Jr T, Burrage K. Stochastic simulation in systems biology. Comput Struct Biotechnol J. 2014; 12(20–21):14–25.
Harris LA, Clancy P. A “partitioned leaping” approach for multiscale modeling of chemical reaction dynamics. J Chem Phys. 2006; 125(14):144107.
Eldar A, Elowitz MB. Functional roles for noise in genetic circuits. Nature. 2010; 467(7312):167–73.
Butcher JC. Numerical Methods for Ordinary Differential Equations. Chichester West Sussex: Wiley; 2008.
Higham DJ, Trefethen LN. Stiffness of ODEs. BIT Numer Math. 1993; 33(2):285–303.
Gillespie DT. Stochastic simulation of chemical kinetics. Annu Rev Phys Chem. 2007; 58:35–55.
Petzold LR. Automatic selection of methods for solving stiff and nonstiff systems of ordinary differential equations. SIAM J Sci Stat Comp. 1983; 4:136–48.
Cash JR. Backward Differentiation Formulae In: Engquist B, editor. Encyclopedia of Applied and Computational Mathematics. Berlin Heidelberg: Springer: 2015. p. 97–101.
Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J Comput Phys. 1976; 22:403–34.
Besozzi D. Reactionbased models of biochemical networks In: Beckmann A, Bienvenu L, Jonoska N, editors. Pursuit of the Universal. 12th Conference on Computability in Europe, CiE 2016, Proceedings. LNCS, vol. 9709. Switzerland: Springer: 2016. p. 24–34.
Nelson DL, Cox MM. Lehninger Principles of Biochemistry. New York: W. H. Freeman Co; 2004.
Voit EO, Martens HA, Omholt SW. 150 years of the mass action law. PLoS Comput Biol. 2015; 11(1):1004012.
Fehlberg E. Classical fifth, sixth, seventh, and eighthorder RungeKutta formulas with stepsize. NASA Tech Rep R287, NASA. 1968.
Cazzaniga P, Pescini D, Besozzi D, Mauri G, Colombo S, Martegani E. Modeling and stochastic simulation of the Ras/cAMP/PKA pathway in the yeast Saccharomyces cerevisiae evidences a key regulatory function for intracellular guanine nucleotides pools. J Biotechnol. 2008; 133(3):377–85.
Ackermann J, Baecher P, Franzel T, Goesele M, Hamacher K. Massivelyparallel simulation of biochemical systems. In: Proceedings of Massively Parallel Computational Biology on GPUs, Jahrestagung der Gesellschaft Für Informatik e.V: 2009. p. 739–50.
Nobile MS, Cazzaniga P, Besozzi D, Mauri G. GPUaccelerated simulations of massaction kinetics models with cupSODA. J Supercomput. 2014; 69(1):17–24.
Zhou Y, Liepe J, Sheng X, Stumpf MP, Barnes C. GPU accelerated biochemical network simulation. Bioinformatics. 2011; 27(6):874–6.
Komarov I, D’Souza RM. Accelerating the Gillespie exact stochastic simulation algorithm using hybrid parallel execution on graphics processing units. PLoS ONE. 2012; 7(11):46693.
Komarov I, D’Souza RM, Tapia J. Accelerating the Gillespie τleaping method using graphics processing units. PLoS ONE. 2012; 7(6):37370.
Gillespie DT. Exact stochastic simulation of coupled chemical reactions. J Phys Chem. 1977; 81(25):2340–361.
Gillespie DT, Petzold LR. Improved leapsize selection for accelerated stochastic simulation. J Chem Phys. 2003; 119:8229–234.
Amara F, Colombo R, Cazzaniga P, Pescini D, CsikászNagy A, Muzi Falconi M, Besozzi D, Plevani P. In vivo and in silico analysis of PCNA ubiquitylation in the activation of the Post Replication Repair pathway in S. cerevisiae. BMC Syst Biol. 2013; 7(1):24.
Besozzi D, Cazzaniga P, Pescini D, Mauri G, Colombo S, Martegani E. The role of feedback control mechanisms on the establishment of oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae. EURASIP J Bioinform Syst Biol. 2012;2012(10).
Cazzaniga P, Nobile MS, Besozzi D, Bellini M, Mauri G. Massive exploration of perturbed conditions of the blood coagulation cascade through GPU parallelization. BioMed Res Int. 2014;2014. Article ID 863298.
Intosalmi J, Manninen T, Ruohonen K, Linne ML. Computational study of noise in a large signal transduction network. BMC Bioinforma. 2011; 12(1):1–12.
Pescini D, Cazzaniga P, Besozzi D, Mauri G, Amigoni L, Colombo S, Martegani E. Simulation of the Ras/cAMP/PKA pathway in budding yeast highlights the establishment of stable oscillatory states. Biotechnol Adv. 2012; 30:99–107.
Petre I, Mizera A, Hyder CL, Meinander A, Mikhailov A, Morimoto RI, Sistonen L, Eriksson JE, Back RJ. A simple massaction model for the eukaryotic heat shock response and its mathematical validation. Nat Comput. 2011; 10(1):595–612.
Chellaboina V, Bhat SP, Haddad WM, Bernstein DS. Modeling and analysis of massaction kinetics. IEEE Control Syst. 2009; 29(4):60–78.
Jackson KR. A survey of parallel numerical methods for initial value problems for ordinary differential equations. IEEE Trans Magn. 1991; 27(5):3792–797.
Mathews JH, Fink KD. Numerical Methods Using MATLAB. Upper Saddle River: PrenticeHall Inc; 2004.
Thohura S, Rahman A. Numerical approach for solving stiff differential equations: A comparative study. J Sci Front Res Math Decision Sci. 2013; 13:7–18.
Gear CW. The control of parameters in the automatic integration of ordinary differential equations. University of Illinois UrbanaChampaign. Int Rep File 757. 1968.
BenIsrael A. A NewtonRaphson method for the solution of systems of equations. J Math Anal Appl. 1966; 15(2):243–52.
Smooke MD. Error estimate for the modified Newton method with applications to the solution of nonlinear, twopoint boundaryvalue problems. J Optim Theory Appl. 1983; 39(4):489–511.
Bartels RH, Golub GH. The simplex method of linear programming using LU decomposition. Commun ACM. 1969; 12(5):266–8.
Nvidia: cuBLAS library 7.5. 2015.
Jones E, Oliphant T, Peterson P, et al.SciPy: Open source scientific tools for Python. 2001. http://www.scipy.org/.
Nobile MS, Cazzaniga P, Besozzi D, Pescini D, Mauri G. cuTauLeaping: A GPUpowered tauleaping stochastic simulator for massive parallel analyses of biological systems. PLoS ONE. 2014; 9(3):91963.
Wilt N. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Upper Saddle River: AddisonWesley; 2013.
Weiss JN. The Hill equation revisited: uses and misuses. FASEB J. 1997; 11(11):835–41.
Le Novère N. Quantitative and logic modelling of molecular and gene networks. Nat Rev Genet. 2015; 16(3):146–58.
Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P, Kummer U. COPASI  a COmplex PAthway SImulator. Bioinformatics. 2006; 22(24):3067–074.
Butcher JC. Implicit RungeKutta processes. Math Comput. 1964; 18(85):50–64.
Prothero A, Robinson A. On the stability and accuracy of onestep methods for solving stiff systems of ordinary differential equations. Math Comput. 1974; 28(125):145–62.
Butcher JC. On the implementation of implicit RungeKutta methods. BIT Numer Math. 1976; 16(3):237–40.
Cumbo F, Nobile MS, Damiani C, Colombo R, Mauri G, Cazzaniga P. COSYS: A Computational Infrastructure for Systems Biology. LNCS, Springer,in press.
Acknowledgements
We acknowledge the CINECA award under the ISCRA initiative, for the availability of high performance computing resources and support. Authors would like to thank the SYSBIO.IT Centre of Systems Biology for the support.
Funding
Not applicable.
Availability of data and materials
LASSIE is a crossplatform software, i.e., it can be compiled and executed on the main operating systems: GNU/Linux, Microsoft Windows, Apple OS/X. LASSIE is written in CUDA, hence it requires a Nvidia GPU, with CUDA version 7.5 or higher. LASSIE’s source files and binary executable files, as well as examples of reactionbased models, are available on GITHUB: https://github.com/aresio/LASSIE.
Authors’ contributions
Conceived the idea: MSN. Designed the code: AT, MSN. Implemented the code and performed the experiments: AT. Analyzed the data: AT, MSN, DB, PC. Wrote the manuscript: AT, MSN, DB, PC. Critically read the manuscript and contributed to the discussion of the whole work: GM. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Additional files
Additional file 1
LASSIE Graphical User Interface. (PDF 397 kb)
Additional file 2
LASSIE input files and command line arguments. (PDF 221 kb)
Additional file 3
Implementation of CUDA kernels for LASSIE execution workflow. (PDF 240 kb)
Additional file 4
Simulation accuracy of LASSIE. (PDF 172 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Tangherloni, A., Nobile, M., Besozzi, D. et al. LASSIE: simulating largescale models of biochemical systems on GPUs. BMC Bioinformatics 18, 246 (2017). https://doi.org/10.1186/s1285901716660
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901716660
Keywords
 Graphics Processing Unit
 GPU computing
 Reactionbased model
 Deterministic simulation
 Numerical integration method
 LSODA
 Nvidia CUDA
 Finegrained parallelization
 Systems biology
 Rulebased model