Automatising the analysis of stochastic biochemical timeseries
 Giulio Caravagna^{1}Email author,
 Luca De Sano^{1} and
 Marco Antoniotti^{1}
https://doi.org/10.1186/1471210516S9S8
© Caravagna et al.; licensee BioMed Central Ltd. 2015
Published: 1 June 2015
Abstract
Background
Mathematical and computational modelling of biochemical systems has seen a lot of effort devoted to the definition and implementation of highperformance mechanistic simulation frameworks. Within these frameworks it is possible to analyse complex models under a variety of configurations, eventually selecting the best setting of, e.g., parameters for a target system.
Motivation
This operational pipeline relies on the ability to interpret the predictions of a model, often represented as simulation timeseries. Thus, an efficient data analysis pipeline is crucial to automatise timeseries analyses, bearing in mind that errors in this phase might mislead the modeller's conclusions.
Results
For this reason we have developed an intuitive frameworkindependent Python tool to automate analyses common to a variety of modelling approaches. These include assessment of useful nontrivial statistics for simulation ensembles, e.g., estimation of master equations. Intuitive and domainindependent batch scripts will allow the researcher to automatically prepare reports, thus speeding up the usual modeldefinition, testing and refinement pipeline.
Keywords
Background
The study of biological systems witnessed a prominent crossfertilisation between experimental investigation and computational methods, thanks to many different modelling approaches developed within different research areas (e.g, biophysics, computational biology, logics) to describe natural systems [1]. Nowadays, many such studies can be ascribed to the area of Systems Biology, an approach characterised by studying complex interactions in natural systems and their properties as wholes, not as collections of parts [2, 3]. This holistic discipline requires us hence to consider a system at multiple simultaneous abstraction levels, e.g., from RNA to proteins up to abundant chemical signals.
In mathematical sciences various theoretical frameworks for mechanistic modelling, which can be ascribed to the broad categories of meanfield equations or stochastic processes, have been introduced to capture the physics underlying a target system, at some level of abstraction. These frameworks provide fundamental ingredients to achieve successful results in Systems Biology. These ingredients are, for instance, the ability to naturally describe multiscale interactions and components by combining different mathematical frameworks (more often termed hybrid) in a sort of orthogonal approach [4, 5].
In the specific context of natural systems, another key feature of successful frameworks is the ability to consider stochasticity. This, meant either as the random fluctuations intrinsic to the model components present in few copies or as the fluctuations induced by extrinsic sources, is recognised to have a fundamental role in many living processes [6, 7]. Stochastic phenomena arise, for instance, in the bursts of protein transcription (molecular level), in cellfate decision processes of differentiation (cellular level) and in evolutionary transitions (population level) [8–13]. Thus, interest has increased towards modelling frameworks which support one or more forms of stochasticity, e.g., Markovian Gillespielike approaches, stochastic differential equations or nonMarkovian hybrid automata to name but a few [14–21].
In general, regardless of the way in which stochasticity is embedded in a framework, many analyses rely on the evaluation of simulation ensembles, under different model configurations and parameters. Then, an efficient data analysis pipeline is set up to produce reports of the model predictions usually in the form of, e.g., probability measures which emerge "naturally" by the intrinsic random nature of the model itself. Evaluating such quantities sometimes requires a noteworthy computing time, a price that we need to pay to assess whether a model reaches an equilibrium behaviour, or if stochastic bifurcations, resonances or multistable phenomena arise.
All in all, according to the physics of the system under study (e.g., number of components, the type of interactions, and thermodynamic setting) a suitable modelling framework can be chosen and, quite likely, a simulation tool can be found in the literature. Sometimes these are coded within prepackaged scientific tools such as Matlab, R or Mathematica. However, these generalpurpose frameworks offer many functions beyond the scope of plain simulation, and thus one often codes tools in a lowerlevel programming environment to boost performance, a key issue [22–27].
Historically most of the efforts have been devoted to the definition of highperformance simulation frameworks, while less has been done to automate the inherently timeconsuming and definitely errorprone task of data analysis. Thus, adhoc analysis tools are often implemented to process the output of a variety of simulators/experiments and produce the required statistics. Obviously, when this pipeline is either inefficient or incorrectly implemented hours of computationintensive calculations can be lost, and a whole research activity slowed down. Of course, the generalpurpose tools mentioned above, e.g., Matlab, R or Mathematica can be used to perform analyses of timecourse simulation results. However, they require extra learning time to master and, in general, provide many complex functions actually unused for this purpose.
PYTSA: a Python TimeSeries Analyser
The current PYTSA version introduces the notion of dataset as a set of files in some folder, and assumes that each file represents a independent timeevolution of a model. To boost loading of huge datasets, a multiprocess implementation is adopted. In each data file, columns represent the values of a model variable to which a mnemonic name can be assigned, for clarity (see the Example below). Notice that, in principle, this input format is domainindependent, thus nonbiological data can be analysed as well. The library supports standard plotting routines and output formats, which either manipulate or not data. In the former case one might just visualise data as, e.g., single traces or 2D/3D phasespaces for some time window or model variable. In the latter case one could estimate averages, standard deviations, or probabilities of any variable. Note that probability densities of a variable either at some specific time point, or over a time interval, provide an estimation of the (solution of the) master equation for the model.
This equation, central to stochastic models, see e.g. [14], describes the probability of observing a configuration of the model over time and, in general, its solution is assessable solely via numerical simulation ensembles. If, for example, one were to model a gene regulatory network, questions such as "what is the probability of having a certain RNA concentration after some time units of network activity" could be answered by using this equation  in the Supplementary Material we show a simple example of its estimation for the case study described in the main text.
PYTSA plots these statistics with multipanel visualisation, barplots, heatmaps, 3D surfaces, normalisation and gaussianfit. All in all, despite the statistics implemented in PYTSA being rather intuitive, we believe that researchers would benefit from a tool specifically tailored to automatise, in a easy fashion and with a domainspecific language, data analysis of stochastic timetraces. Hopefully, this should eventually allow the researcher to focus on research tasks beyond data analyses, thus speeding up the usual pipeline of modeldefinition, testing and refinement.
Finally, efforts towards a more standardised definition of what a "simulation" result are under way. For instance, the SBML community has been working on several standards that aim to connect models, datasets and simulations (cfr., SEDML [29], SBRML and Teddy [30]). PYTSA fits into these frameworks by already supporting SBRML in its current version, and it will be further developed to adhere to other standards. To conclude, PYTSA was conceived to analyse systems where the variation of concentration of some molecules is reported, without spatial information. Spatial models which describe, e.g., tissues and organs, have usually more complex output formats according to the notion of "space" they implement, e.g. [24, 31–35]. If proved successful, our tool could be extended to support spatial outputs in the next versions.
Implementation
Dataprocessing PYTSA functions
Function  Synopsis 

timeseries  load a timeseries from a single file (output of a single simulation) 
dataset  load a dataset of timeseries (output of repeated simulations) 
splot  plot plain timeseries (without any processing) 
aplot, sdplot, asdplot  plot average, standarddeviation and both of them for a dataset 
phspace2d, phspace3d  plot 2D/3D phasespaces for plain timeseries 
aphspace2d, aphspace3d  plot 2D/3D phasespaces for the average of a dataset 
pdf, pdf3d  plot the probability density of a model variable at one or more (3D) timepoints (requires a dataset) 
meq2d, meq3d  estimates the master equation solution for a model variable in the form of timevarying probability density for a timeinterval (heatmap 2D or surface 3D, requires a dataset) 
The PYTSA implementation consists of 4 classes (see the technical specifications at [36]). A forkbased implementation ensures an optimal parallel dataloading based on the number of available cores, and a cachebased implementation avoids the repetition of a computation already performed on some data, to optimise speed. Preliminary testing shows that on a 4core machine parallel dataloading is 300% faster than with a sequential one. Also, loading of datasets of size exceeding 10 GB seems prohibitively slow without a parallel implementation. Further memoryoptimisation is given by NUMPY and PANDAS to efficiently store arrays and timeseries, meanwhile processing data. Also, statistical routines provided by SCIPY allow one to efficiently evaluate fits of histograms and binning (i.e., a form of quantisation used to aggregate values which fall in a given small interval, a bin, to obtain a value representative of that interval) and, finally, by using MATPLOTLIB, advanced plotediting capabilities are possible (i.e., a visualised plot can be modified by using the underlying PYTSA tool and its imported libraries, allowing the advanced user to benefit from the power of all the tools interfaced with PYTSA).
Results
We denote with X the model of a system under study, whose timeevolution can be either deterministic, stochastic or hybrid (see Figure 1). The point is to gather information from a dataset of realisations of X which describe its variables in the form of a timetrace. These datasets are often stored to disks and processed offline, and sometimes several hours are required to set up the analysis framework, load, process and interpret data. This process, very sensitive to errors, is eased by PYTSA.
Aside from plain traces visualisation, numerical statistics can be evaluated:

the average$\u27e8\mathsf{\text{X}}t\u27e9$ of n traces, and its standard deviation${\sigma}_{X}$;

the probability density at a time t of X's variables, $\mathcal{P}\left[\mathsf{\text{X}}\left(t\right)=\mathsf{\text{X}}\right]$, i.e. the probability of the system to be in a configuration × at time t;

the above probability in a time interval. This, is analogous to estimating the solution of the master equation$\partial t\mathcal{P}\left[\mathsf{\text{x}},t\right]$ which rules the timeevolution of the probability of the system to be in some configuration × [42].
Example report generation
The LotkaVolterra predatorprey models are a family of models describing the dynamics of competitive populations living in an environment. These are based on the competition between species together with their evolution and, because of their generality, these equations are often used to model, e.g., microbial population dynamics. When one analyses these kind of models is often interested in finding parameter settings which guarantee the environmental sustainability, i.e. the situation in which the two species oscillate in time, without extinction.
Here we show a simple analysis of a dataset describing 100 independent simulations of the model realised with NOISYSIM [26]. In the Additional file 1 we provide details on the model specification, the parameters used for its simulation and other kind of questions which can be easily answered with PYTSA's analysis capabilities. To analyse stability of the ecosystem we use, as rough measure, the estimation of the timevarying probability of preys/predators in the time interval [0, 100]. We show how this can be easily assessed with the following 5line PYTSA script:
1 import pytsa as tsa # load pyTSA
# import a dataset and name the columns
3 mydata = tsa.dataset(".", colnames =["time","Preys","Predators"])
# just visualize the loaded traces for t< = 1000
5 mydata.splot(columns =["Preys","Predators"], stop = 1000)
# plot the species probabilities (normalized and fit) at t = 100
7 mydata.pdf(100, columns =["Preys","Predators"], normed=True, fit=True)
# estimate the master equation in [0, 100], show it as a 2D heatmap
9 mydata.meq2d(start = 0, stop = 100)
Notice in that case the definition of the mnemonic names Preys and Predators (at dataloading time) and in the successive plot commands splot, pdf and meq2d which produce the plots shown in the Figure 2 (script processing time approx. 20 sec). When this report is generated further information is also returned within the Python environment, e.g., the minimum and maximum values for each species, the parameters µ and σ^{2} of the Gaussian fit (as performed by the scipy library with a numerical algorithm which uses explicit formulas for the maximum likelihood estimation of the parameters).
The estimation of the solution of the master equation for this system allows the modeller to make inferences about stability of the populations. In particular, as it can be observed in the figure, for the parameters used to carry out the simulations and for the considered timewindow the species oscillate with a period of about 80 days. Also, predators reach low values with nonnegligible probability, i.e., observe the area where Y_{2} < 20, which might threaten predators' survival. With this in mind it is possible to raise and answer questions such as: "What is the probability of observing less than 10 predators in the first 100 days of simulation?". Similar questions become immediate once a graphical representation of the involved probabilities is available and, in a modelrefinement loop as the one depicted in Figure 1, might lead to further questions such as "Can we find parameter values which guarantee that the probability of observing less than 10 predators is less than .001?".
Table 2
Availability and requirements  
Project name:  PYTSA (Python Timeseries Analyzer) 
Version:  0.3.8 
Homepage:  http://bimib.disco.unimib.it/ 
Operating systems:  platform independent 
Programming language:  Python 
Requirements:  Python (≥ v. 2.7), NUMPY (≥ v. 1.6.1), 
SCIPY (≥ v. 0.10.1), MATPLOTLIB (≥ v. 1.3.0),  
PANDAS (≥ v. 0.12.0) and PYTABLES (≥ v. 2.3.1).  
License:  BSD 3clause ("BSD new", 1999) 
Conclusions
In the light of automatising the common approaches to the analysis of models of biological systems, we introduced the reader to PYTSA, a novel Python TimeSeries Analyser to analyse synthetic/wet biological timecourse simulation experiments. This lightweight, simulationframeworkindependent and focused library combines several analyses, with intuitive commands, which can be pipelined with any simulation tool, allowing to generate reports in a very intuitive way. Also, PYTSA supports SBRML to load/export analysis results in a format likely to become a standard, possibly allowing the tool to be pipelined further.
Despite being in its infancy, this tool sets the basis of a novel open source dataanalysis framework for the community. PYTSA relies on standard Python APIsforscientific computing to provide an offtheshelf reusable component for the analysis of timeseries, that fits between integrated simulation/analysis environments and "general" tools like Matlab and Mathematica. Its future implementations might enlarge the class of supported input timeseries to account, for instance, for spatial models [31] which are generally characterised by more complex output types. Similarly, web services might be implemented for remoteanalysis capabilities, and support of dataexchange standard formats beyond SBRML will be considered, so to fit PYTSA within the"ecosystem" established by ongoing standardisation efforts [29, 30].
Declarations
Acknowledgements
This project was partially supported by the ASTIL program, project "RetroNet", grant n. 124514800040; U.A 053, and by NEDD Project [ID14546A Rif SAL7] Fondo Accordi Istituzionali 2009.
Declarations
Publications costs for this article were funded by Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di MilanoBicocca, Italy.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 9, 2015: Proceedings of the Italian Society of Bioinformatics (BITS): Annual Meeting 2014: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S9.
Authors’ Affiliations
References
 Bower J, Bolouri H: Computational Modeling of Genetic and Biochemical Networks. 2001, MIT press, MassachussetsGoogle Scholar
 Kitano H: Foundations of Systems Biology. 2001, MIT Press, MassachussetsGoogle Scholar
 Kitano H: Computational systems biology. Nature. 2002, 420 (6912): 206210. 10.1038/nature01254.View ArticlePubMedGoogle Scholar
 Southern J, PittFrancis J, Whiteley J, Stokeley D, Kobashi H, Nobes R, et al: Multiscale computational modelling in biology and physiology. Progress in Biophysics and Molecular Biology. 2008, 96 (13): 6089. 10.1016/j.pbiomolbio.2007.07.019.View ArticlePubMedGoogle Scholar
 Dada JO, Mendes P: Multiscale modelling and simulation in systems biology. Integr Biol (Camb). 2011, 3 (2): 8696. 10.1039/c0ib00075b.View ArticleGoogle Scholar
 Swains PS, Elowitz MB, Siggia ED: Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc Natl Acad Sci U S A. 2002, 99 (20): 1279512800. 10.1073/pnas.162041399.View ArticleGoogle Scholar
 Wilkinson DJ: Stochastic modelling for quantitative description of heterogeneous biological systems. Nature Reviews Genetics. 2009, 10 (2): 122133. 10.1038/nrg2509.View ArticlePubMedGoogle Scholar
 Mc Adams HH, Arkin A: Stochastic mechanisms in gene expression. Proc Natl Acad Sci U S A. 1997, 94 (3): 814819. 10.1073/pnas.94.3.814.View ArticleGoogle Scholar
 Blake WJ, Kaern M, Cantor CR, Collins JJ: Noise in eukaryotic gene expression. Nature. 2003, 422 (6932): 633637. 10.1038/nature01546.View ArticlePubMedGoogle Scholar
 Lestas I, Paulsson J, Vinnicombe G: Noise in gene regulatory networks. IEEE Transactions on Automatic Control. 2008, 53 (Special Issue): 189200.View ArticleGoogle Scholar
 Hoffman M, Chang HH, Huang S, Ingber DE, Loeffler M, Galle J: Noise driven stem cell and progenitor population dynamics. PLoS ONE. 2008, 3 (8): e292210.1371/journal.pone.0002922.View ArticleGoogle Scholar
 Losick R, Desplan C: Stochasticity and cell fate. Science. 2008, 320 (5872): 6568. 10.1126/science.1147888.PubMed CentralView ArticlePubMedGoogle Scholar
 Eldar A, Elowitz MB: Functional roles for noise in genetic circuits. Nature. 2010, 467 (7312): 167173. 10.1038/nature09326.PubMed CentralView ArticlePubMedGoogle Scholar
 Gillespie DT: A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of Computational Physics. 1976, 22 (4): 403434. 10.1016/00219991(76)900413.View ArticleGoogle Scholar
 Gillespie DT: Exact stochastic simulation of coupled chemical reactions. J Phys Chem. 1977, 81 (25): 23402361. 10.1021/j100540a008.View ArticleGoogle Scholar
 Gillespie DT: Approximate accelerated stochastic simulation of chemically reacting systems. J Chem Phys. 2001, 115 (4): 17161711. 10.1063/1.1378322.View ArticleGoogle Scholar
 Cao Y, Gillespie D, Petzold L: The slowscale stochastic simulation algorithm. J Chem Phys. 2005, 122: 01411610.1063/1.1824902.View ArticleGoogle Scholar
 Horsthemke W, Lefever R: NoiseInduced Transitions: Theory and Applications in Physics, Chemistry, and Biology. 2006, Springer, BerlinHeidelbergGoogle Scholar
 Salis H, Kaznessis YN: Accurate hybrid stochastic simulation of a system of coupled chemical or biochemical reactions. J Chem Phys. 2005, 122 (5): 5410310.1063/1.1835951.View ArticlePubMedGoogle Scholar
 Caravagna G, Mauri G, d'Onofrio A: The interplay of intrinsic and extrinsic bounded noises in biomolecular networks. PLoS ONE. 2013, 8 (2): e5117410.1371/journal.pone.0051174.PubMed CentralView ArticlePubMedGoogle Scholar
 Caravagna G, d'Onofrio A, Antoniotti M, Mauri G: Stochastic hybrid automata with noninstantaneous transitions to model biochemical systems with delays. Information and Computation. 2014, 236: 1934.View ArticleGoogle Scholar
 Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, et al: COPASI: a COmplex PAthway SImulator. Bioinformatics. 2006, 22 (24): 30673074. 10.1093/bioinformatics/btl485.View ArticlePubMedGoogle Scholar
 Ramsey S, Orrell D, Bolouri H: Dizzy: stochastic simulation of largescale genetic regulatory networks. Journal of Bioinformatics and Computational Biology. 2005, 3 (2): 415436. 10.1142/S0219720005001132.View ArticlePubMedGoogle Scholar
 Miller A, Marsh J, Reeve A, Garny A, Britten R, Halstead M, et al: An overview of the CellML API and its implementation. BMC Bioinformatics. 2010, 11 (1): 17810.1186/1471210511178.PubMed CentralView ArticlePubMedGoogle Scholar
 Sanft KR, Wu S, Roh M, Fu J, Lim RK, Petzold LR: Stochkit2: software for discrete stochastic simulation of biochemical systems with events. Bioinformatics. 2011, 27 (17): 24572458. 10.1093/bioinformatics/btr401.PubMed CentralView ArticlePubMedGoogle Scholar
 Caravagna G, Mauri G, d'Onofrio A: NoisySim: exact simulation of stochastic chemically reacting systems with extrinsic noises. Proc of the Symposium on Theory of Modeling and Simulation, Society for Computer Simulation International. 2013, 12:Google Scholar
 Hogg JS, Harris LA, Stover LJ, Nair NS, Faeder JR: Exact hybrid particle/population simulation of rulebased models of biochemical systems. PLoS Computational Biology. 2014, 10 (4): e100354410.1371/journal.pcbi.1003544.PubMed CentralView ArticlePubMedGoogle Scholar
 Dada J, Spasic I, Paton NW, Mendes P: SBRML: a markup language for associating systems biology data with models. Bioinformatics. 2010, 26 (7): 932938. 10.1093/bioinformatics/btq069.View ArticlePubMedGoogle Scholar
 Waltemath D, Adams R, Bergmann F, Hucka M, Kolpakov F, Miller A, et al: Reproducible computational biology experiments with SEDML  the simulation experiment description markup language. BMC Systems Biology. 2011, 5: 19810.1186/175205095198.PubMed CentralView ArticlePubMedGoogle Scholar
 Courtot M, Juty N, Knüpfer C, Waltemath D, Zhukova A, Dräger A, Dumontier M, et al: Controlled vocabularies and semantics in systems biology. Molecular Systems Biology. 2011, 7: 543PubMed CentralView ArticlePubMedGoogle Scholar
 Cowan AE, Moraru II, Schaff JC, Slepchenko BM, Loew LM: Spatial modeling of cell signaling networks. Methods Cell Biol. 2012, 195221. 110Google Scholar
 Bartocci E, Cherry EM, Glimm J, Grosu R, Smolka SA, Fenton FH: Toward realtime simulation of cardiac dynamics. Proceedings of the 9th International Conference on Computational Methods in Systems Biology. 2011, 103112.Google Scholar
 Graudenzi A, Caravagna G, De Matteis G, Antoniotti M: Investigating the relation between stochastic differentiation, homeostasis and clonal expansion in intestinal crypts via multiscale modeling. PLoS ONE. 2014, 9 (5): e9727210.1371/journal.pone.0097272.PubMed CentralView ArticlePubMedGoogle Scholar
 Klipp E, Liebermeister W, Helbig A, Kowald A, Schaber J: Standards in computational systems biology. 2007Google Scholar
 Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, et al: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics. 2003, 19 (4): 524531. 10.1093/bioinformatics/btg015.View ArticlePubMedGoogle Scholar
 The PYTSA Project Is Hosted at. [http://bimib.disco.unimib.it/]
 The NUMPY Project Is Hosted at. [http://www.numpy.org]
 The SCIPY Project Is Hosted at. [http://www.scipy.org]
 The MATPLOTLIB Project Is Hosted at. [http://matplotlib.org]
 The PANDAS Project Is Hosted at. [http://pandas.pydata.org]
 The PYTABLES Project Is Hosted at. [http://www.pytables.org]
 Gardiner C: Stochastic Methods: A Handbook for the Natural and Social Sciences. 2009, Springer, BerlinHeidelbergGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.