Automatising the analysis of stochastic biochemical time-series

Background Mathematical and computational modelling of biochemical systems has seen a lot of effort devoted to the definition and implementation of high-performance mechanistic simulation frameworks. Within these frameworks it is possible to analyse complex models under a variety of configurations, eventually selecting the best setting of, e.g., parameters for a target system. Motivation This operational pipeline relies on the ability to interpret the predictions of a model, often represented as simulation time-series. Thus, an efficient data analysis pipeline is crucial to automatise time-series analyses, bearing in mind that errors in this phase might mislead the modeller's conclusions. Results For this reason we have developed an intuitive framework-independent Python tool to automate analyses common to a variety of modelling approaches. These include assessment of useful non-trivial statistics for simulation ensembles, e.g., estimation of master equations. Intuitive and domain-independent batch scripts will allow the researcher to automatically prepare reports, thus speeding up the usual model-definition, testing and refinement pipeline.


Model specification
We denote with Y 1 (t) and Y 2 (t) preys and predators at time t, respectively, we model that: (i) preys reproduce and die at rates α and α/γ, respectively; (ii) predators eat preys at rate β; (iii) predators die at rate δ.
A mean-field deterministic representation of the model is constituted by the following set of coupled differential equations In this case the growth of preys is logistic, αY 1 (1−Y 1 /γ), meaning that the preys grow within an environment with a carrying capacity (or plateau) proportional to γ. Term δY 2 models the death of the predators. The eating of preys by predators is given by the non-linear term βY 1 Y 2 ; such a process, which increases (resp. decrease) the predators population (resp. preys), abstracts the evolutionary cycle "hunt-eat-reproduce" of each individual. It is possible to represent this model as a non-linear birth-death stochastic process, and simulate it with the standard Gillespie algorithm [2]. In this case it is convenient to use the following reaction notation: which is an equivalent representation of model (1). In the area of stochastic modelling this representation is often written algebraically with its stoichiometry matrix and the set of propensity functions. These are and The dataset analysed in the main text is generated by simulating the above stochastic model with NoisySIM [3] with parameters (time units are days) The dataset consists of 100 independent simulations evaluated by independent NoisySIM runs starting from the initial number of 100 preys and 100 predators. Each simulation is represented by NoisySIM as a csv (comma separated values) file.

Dataset analysis with pyTSA
We show and comment, line-by-line, the result of executing a slightly extended version of the pyTSA's script presented in the main text, with the input dataset described above. For explanatory purposes, we show the kind of questions one might raise when analysing data with pyTSA.
To start using the tool is sufficient to use a standard python environment and load the library with the following command > import pytsa as tsa The output returned to console by processing the script is omitted.

How do I load my data?
pyTSA assumes that • every model simulation is stored in a separate csv/tsv/sbrml/. . . file; • if the files is csv, as it is the case here, each column is a variable of the model, so in this case one column is time, the free-variable, and two others are the time-evolution of preys and predators.
These are the basic assumptions to run the tool, and further options for data-loading are explained in detail in the pyTSA manual. The input files, in this case, look like the following Notice that there is no requirements that the time-series are discretised, in time, with the same granularity. pyTSA will take care of sampling uniformly from each input time-series.
The input dataset can be loaded (we assume the dataset to be stored in the current folder), and mnemonic names time Preys and Predators can be assigned to the columns. In this case we are loading a simple set of .csv file, so no other options are required. Once data is loaded, pyTSA can be set to export the plots we show below in different types of formats (csv/tsv/sbrml/. . ., as discussed in the manual (not shown here).
Can I view the set of traces in my dataset?
Of course you can plot the data as-it-is. In particular one cane plot all the time series, as a function of time, and as 2D/3D phase spaces. To optimise visualisation, plots can be restricted to consider only a subset of the loaded columns or, for instance, a small time-interval.
For the sake of readability, in the following plot we show a subset of the input datasets (obtained by loading only a small percentage of the 100 input files), we restrict to time t ≤ 1000, for the plain time-series, and to t < 100 for the state-space, and we plot the preys/predators in two separate panels (via flag merge). These plots suggest an oscillatory behaviour predicted by the model, which can be further investigated with PyTSA.
Which is the behaviour of the average number of preys/predators? By averaging, over time, many time-series PyTSA evaluates the expectation E[·] of a system variable Y i , as well as its standard deviation µ Yi at time t. These measures can be plot with bars or lines, restricted to some columns or time interval and evaluated in the phase space, as shown below. These plots suggest that the average number of preys doubles the average number of predators. This might suggest that, in general, predators oscillate -with high probability -within lower values than preys. This fact can be investigated by using the empirical estimation of the prey/predator probabilities with PyTSA.
What is the probability of finding "k" preys/predators over time?
Since the model was repeatedly simulated from its initial state PyTSA can estimate (empirically) the probability distribution for each system variable Y i to take some value k, at a specific time-point or for a whole time-interval. In a stochastic model as the one described here, this is equivalent to estimating -numerically -the solution of the master equation of the system.
For instance, we might want to assess, at t = 100, the probability distribution of the number of preys and predators, and we might want to fit it (if unimodal) with a Gaussian distribution. Also, we might be interested in assessing the same kind of information for preys at various time-points, say t ∈ {25, 50, 100}. Results of the Gaussian fit (µ and σ 2 ) for the left plot are returned, via console, by pyTSA. Now, by looking at these distributions it should be evident that after 100 days the number of preys and predators seems robust, meaning that the probability of any Y i to reach 0 seems 0. However, we might be interested in understanding whether and how the system oscillates, over-time, and if it reaches any critical boundary region with, e.g., a very low number of individuals. Assessing this particular fact is of special interest in this case since, even if model seems to forecast sustainability, unpredictable events (e.g., an epidemic or a drought, which are not explicitly accounted for within the model equations) happening when the species are in the critical regions might lead to individuals' extinction.
In PyTSA you can evaluate a time-varying probability distribution (heatmap or 3D surface) to answer such questions. For instance, we show such a distribution in the interval 0 ≤ t ≤ 100, as heatmap (in two separate panels for both species), and for Preys as 3D surface plot.

Empirical Probability Distribution in [0,100]
These two plots are obtained by executing the following commands > mydata.meq2d(100, merge=False, start=0, stop=100) > mydata.meq3d('Preys', start=0, stop=100) For instance, by analysing the above plots it can be observed that the system oscillates with period around 100 and that the number of predators reaches a critical region (Y 2 < 20) with non-negligible probability.