A Bayesian approach for parameter estimation in the extended clock gene circuit of Arabidopsis thaliana
© Higham and Husmeier; licensee BioMed Central Ltd. 2013
Published: 12 August 2013
The circadian clock is an important molecular mechanism that enables many organisms to anticipate and adapt to environmental change. Pokhilko et al. recently built a deterministic ODE mathematical model of the plant circadian clock in order to understand the behaviour, mechanisms and properties of the system. The model comprises 30 molecular species (genes, mRNAs and proteins) and over 100 parameters. The parameters have been fitted heuristically to available gene expression time series data and the calibrated model has been shown to reproduce the behaviour of the clock components. Ongoing work is extending the clock model to cover downstream effects, in particular metabolism, necessitating further parameter estimation and model selection. This work investigates the challenges facing a full Bayesian treatment of parameter estimation. Using an efficient adaptive MCMC proposed by Haario et al. and working in a high performance computing setting, we quantify the posterior distribution around the proposed parameter values and explore the basin of attraction. We investigate if Bayesian inference is feasible in this high dimensional setting and thoroughly assess convergence and mixing with different statistical diagnostics, to prevent apparent convergence in some domains masking poor mixing in others.
Parameter values used in these studies were either constrained (based on the available data) or fitted to multiple time series data sets (for full details see Supplementary Table S1 in ). Previously, Locke et al. constructed a cost function to quantify the agreement between an earlier version of the model and various key experimental features . They undertook an efficient global search of parameter space and showed that this optimized solution fits several but not all the experimental features.
The aim of this work is to set the scene for full Bayesian inference of the model parameters using state-of-the-art MCMC techniques. We explore the landscape of the posterior distribution around the model solution and investigate the challenges in finding the posterior distribution from increasingly distant initial starting positions in parameter space. We consider the implications of our findings on the design of efficient parameter estimation schemes for components of the circadian clock.
Mathematical model and data
where n13, m34,35, p21,23,25 and g2,6 represent the rate constants of transcription, degradation, protein translation /modification /complex formation /translocation between nucleus and cytoplasm and Michaelis-Menton constants, respectively. The full set of differential equations can be found in the supplementary material in Pokhilko et al. .
The use of synthetic data in this way enables us to assess the accuracy of the inference prediction. White Gaussian noise was added to this time series data to obtain a signal to noise ratio (SNR) of around 100. Fifty time points, one every hour over a two day period was considered realistic and suitably rich to capture key features of the data, namely the oscillatory period, amplitude and phase.
We derived posterior distributions for the model parameters under a Bayesian framework using the efficient adaptive Markov chain Monte Carlo (MCMC) algorithm described by Haario et al., implemented using their MATLAB code . This method uses a multivariate Gaussian proposal to move the exploratory chains through posterior spaces which may contain ridges or other challenging features - very likely when, as in our case, the number of parameters is high and/or parameters are correlated.
where ψn(θ) is the model output corresponding to the n th data point, yn. This score is equivalent to the negative log likelihood under a homoscedastic additive i.i.d. Gaussian noise model . For the prior, we assumed an improper uniform distribution. This is the worst-case scenario corresponding to the complete absence of complementary biological information, which was chosen deliberately so as to obtain a conservative lower bound on both the parameter estimation accuracy as well as the rate of convergence.
The MCMC chains should be run until they have satisfactorily converged. A suitable number of steps is not typically known in advance and hence convergence diagnostics are used to monitor convergence. Our chains were run for approximately 106 iterations. Four sub chains were used to establish the potential scale reduction factor (PSRF). The PSRF is a ratio reflecting the between chain variance and the within chain variance. This and other methods are discussed in . If the PSRF is large then either our estimate of variance can be decreased by more simulations or the within chain variance will increase since the simulated sequences have not yet made a full tour of the target distribution. Generally a value of 1.1 is taken to indicate reasonable confidence that the chains have converged. PSRF is estimated in parameter space but we also use it to consider whether the sum-of-squares function, equation 3 has converged in data space. The multivariate analogue (MPSRF), essentially an upper bound of PSRF, was used to consider convergence at the higher order . Posterior measures, including mean and variance are based on every 100th iteration of the last 105 iterations, pruned to reduce levels of autocorrelation.
Obtaining the numerical solution of the ODE, ψ n (θ), is an expensive component of the overall computational task so we accelerated its calculation using high speed ODE simulations available from SBTOOLBOX2 and SBPD . On average this made the calculations between 10 and 20 times faster than those based on MATLAB's built-in ODE solvers. Computations were designed to run on a multi-nodal cluster using MATLAB's parallelization facilities.
Quantification of parameter posteriors starting at true parameter values
In our first test, denoted Experiment 1, we start the MCMC exploratory chains at the model's true values and observe how far the chains will travel before they reach a stationary phase. The Euclidean distance between the posterior mean and the true value, for each parameter, is indicative of the attractive pull of the model solution. The variance of the posterior distribution summarizes the inherent uncertainty in the system. Noise in the system, arising from measurement errors, may mean that the maximum a posteriori does not lie directly over the true value.
Summary of convergence diagnostics in parameter and data space
Euclidean distance from true values at start ED0
Species Analysis PSRF computed from sum-of-squares trace plots in data space
Parameter Analysis PSRF computed from MCMC chains in parameter space
Euclidean distance between true values and posterior mean EDm
%True parameter values lying outside 5th - 95th percentile of posterior distribution
%Species PSRF <1.1
%Parameters PSRF <1.1
1. No perturbation
2. Perturbed variance = 0.0052
3. Perturbed variance = 0.012
4. Perturbed variance= 0.0152
5. Perturbed variance= 0.022
6. Perturbed variance= 0.0252
7. Perturbed variance = 0.032
8. Perturbed variance = 0.0352
9. Perturbed variance = 0.052
10. Perturbed variance = 0.12
Perturbance of the starting position further defines attraction of the posterior basin
Next, we perturbed the starting point of the exploratory adaptive MCMC chains by sampling from a Normal distribution centred on the true value with variance increasing in small steps from variance= 0.0052 (corresponding to ED0 = 0.06) to variance= 0.12 (ED0 = 0.98). We denote these experiments as Experiments 2 to 10, see Table 1. We then tried choosing initial starting parameters from a Gamma distribution, Γ (1, 2) and Γ (2, 4), denoted Experiments 11 to 13. For all but two experiments where variance were less than 0.0152 (Experiments 1 to 3) we took the initial start from inferring the initial condition to be the observed level in our data, see Table 1.
Generally as the initial starting value is perturbed the mean PSRF value for the sum-of-squares trace plots increases and the number of species converging reduces (Columns 3 and 4, Table 1). Except for the first three experiments, with no or little perturbation, ED m (Column 10, Table 1) is comparable with the ED0 (Column 2, Table 1), suggesting that the MCMC parameter chains are not venturing far from the true values. However by Experiment 9 the perturbation is such that over 10% of the parameters are converging to posterior means that are significantly far away from the true value (i.e. true value lies outside the 5 th -95 th percentile posterior distribution). This could be indicative of alternative parameter regimes giving rise to the observed data and can be tested by inspecting the log likelihood.
Parameter convergence is not correlated to Euclidean distance between starting value and true value
The reason why some parameters are not converging appears parameter specific and possibly related to individual levels of inter-correlation with other parameters or species. To test this hypothesis we investigated what happens if we remove highly correlated parameters from the analysis. We saw a reduction in the MPSRF (e.g. removing parameters with a correlation coefficient of 0.25 or above would reduce the MPSRF from 3.38 to 1.02 in Experiment 2) but more work will be required to systematically expose patterns of inter-correlations in this complex network of 30 species and over 100 parameters.
The aim of this work is to set the scene for full parameter estimation and model comparison in a Bayesian context for the circadian clock model. Previously, model parameters have been fitted heuristically [2, 4, 10]. Those model calibration exercises reproduce features of the data but cannot rule out other parameter regimes. As the clock continues to be extended to other species and to downstream activities such as metabolism it becomes increasingly important to evaluate competing scenarios, and the Bayesian approach extends naturally to model comparison. This work represents the first full Bayesian treatment of parameter estimation for the circadian clock in Arabidopsis Thaliana and informs future work for tackling this complex problem.
Our initial investigation highlights two main areas. First, we have shown that modern MCMC techniques, when implemented in a high performance computing environment, make it feasible to attempt Bayesian inference in this high-dimensional setting.
Second, apparent convergence in either data or parameter space, using diagnostic techniques, may mask poor mixing, both pairwise and at higher orders of the exploratory chains. This issue requires further investigation of the proposal function but also better coverage of the prior parameter space with a population of chains. For the simulation experiments described in the present article, we took the most unfavourable scenario of complete absence of prior information about the chemical kinetic parameter values, for which we chose improper uniform prior distributions. For most practical applications, more informative priors are usually available, derived from expert elicitation, the biological literature, databases such as KEGG (Kyoto Encyclopedia of Genes and Genomes), and complementary experiments. We note that more informative priors can not only potentially lead to an improvement in the parameter estimation accuracy, but also to an improvement in the convergence of the Markov chains, due to the fact that they render the posterior distributions less diffuse. The estimates presented in the present study can therefore be regarded as conservative, providing performance indicators that, in practice, can potentially be improved on. Exploration of the parameter space could also be directed by introducing auxiliary information, in a systematic fashion, specific to the circadian model, such as period/amplitude or phase of the data. These considerations will allow for greater confidence in the predictions and fuller understanding of the model performance in different parameter regimes.
This work was funded by the EU under the FP7 scheme as part of the TiMeT project. We would like to thank Andrew Millar and Daniel Seaton for helpful discussions.
The publication costs for the article are covered by the TiMeT project.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 10, 2013: Selected articles from the 10th International Workshop on Computational Systems Biology (WCSB) 2013: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S10
- Zhang EE, Kay SA: Clocks not winding down: unravelling circadian networks. Nature Reviews. Molecular Cell Biology. 2010, 11: 764-776. 10.1038/nrm2995.View ArticlePubMedGoogle Scholar
- Pokhilko A, Hodge SK, Stratford K, Knox K, Edwards KD, Thomson AW, Mizuno T, Millar AJ: Data assimilation constrains new connections and components in a complex, eukaryotic circadian clock model. Molecular Systems Biology. 2010, 6: 416-PubMed CentralView ArticlePubMedGoogle Scholar
- Nusinow DA, Helfer A, Hamilton EE, King JJ, Imaizumi T, Schultz TF, Farré EM, Kay SA: The ELF4-ELF3-LUX complex links the circadian clock to diurnal control of hypocotyl growth. Nature. 2011, 475: 398-402. 10.1038/nature10182.PubMed CentralView ArticlePubMedGoogle Scholar
- Pokhilko A, Fernández AP, Edwards KD, Southern MM, Halliday KJ, Millar AJ: The clock gene circuit in Arabidopsis includes a repressilator with additional feedback loops. Molecular Systems Biology. 2012, 8: 574-PubMed CentralView ArticlePubMedGoogle Scholar
- Locke JCW, Millar AJ, Turner MS: Modelling genetic networks with noisy and varied experimental data: the circadian clock in Arabidopsis thaliana. Journal of Theoretical Biology. 2005, 234: 383-393. 10.1016/j.jtbi.2004.11.038.View ArticlePubMedGoogle Scholar
- Haario H, Laine M, Mira A, Saksman E: DRAM: Efficient adaptive MCMC. Statistics and Computing. 2006, 16: 339-354. 10.1007/s11222-006-9438-0.View ArticleGoogle Scholar
- Bishop CM: Pattern Recognition and Machine Learning. 2006, SpringerGoogle Scholar
- Brooks SP, Gelman A: General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics. 1998, 7 (4): 434-Google Scholar
- Schmidt H, Jirstrand M: Systems Biology Toolbox for MATLAB: a computational platform for research in systems biology. Bioinformatics. 2006, 22: 514-515. 10.1093/bioinformatics/bti799.View ArticlePubMedGoogle Scholar
- Locke JCW, Southern MM, Kozma-Bognár L, Hibberd V, Brown PE, Turner MS, Millar AJ: Extension of a genetic network model by iterative experimentation and mathematical analysis. Molecular Systems Biology. 2005, 1: 1-View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.