Skip to main content

BayesianSSA: a Bayesian statistical model based on structural sensitivity analysis for predicting responses to enzyme perturbations in metabolic networks

Abstract

Background

Chemical bioproduction has attracted attention as a key technology in a decarbonized society. In computational design for chemical bioproduction, it is necessary to predict changes in metabolic fluxes when up-/down-regulating enzymatic reactions, that is, responses of the system to enzyme perturbations. Structural sensitivity analysis (SSA) was previously developed as a method to predict qualitative responses to enzyme perturbations on the basis of the structural information of the reaction network. However, the network structural information can sometimes be insufficient to predict qualitative responses unambiguously, which is a practical issue in bioproduction applications. To address this, in this study, we propose BayesianSSA, a Bayesian statistical model based on SSA. BayesianSSA extracts environmental information from perturbation datasets collected in environments of interest and integrates it into SSA predictions.

Results

We applied BayesianSSA to synthetic and real datasets of the central metabolic pathway of Escherichia coli. Our result demonstrates that BayesianSSA can successfully integrate environmental information extracted from perturbation data into SSA predictions. In addition, the posterior distribution estimated by BayesianSSA can be associated with the known pathway reported to enhance succinate export flux in previous studies.

Conclusions

We believe that BayesianSSA will accelerate the chemical bioproduction process and contribute to advancements in the field.

Peer Review reports

Background

Chemical production using microbes, known as chemical bioproduction, has attracted attention as a key technology in a decarbonized society. Chemical bioproduction is expected to be essential for sustainable development [1], such as the production of medicines [2, 3], fuels [4], and foods [5], and the absorption of CO2 [6, 7]. For efficient chemical bioproduction, computational designs of metabolic networks are typically employed to reduce the cost of comprehensive wet lab experiments [8, 9].

One powerful strategy for computational design is to predict changes in metabolic fluxes when up-/down-regulating enzymatic reactions, that is, responses to enzyme perturbations [10, 11]. Up-/down-regulation of enzymatic reactions through genetic manipulations, such as modification, over-expression, and knockout [12, 13], can alter the metabolic fluxes. Increasing the flux of the target chemical leads to efficient chemical bioproduction, and prediction is an essential step in this process.

There are two types of computational methods for predicting responses to enzyme perturbations. The first is flux balance analysis (FBA)-based methods [14,15,16,17], which maximize an objective function, instead of considering kinetics. When applied to unfamiliar strains, FBA-based methods suffer from dependence on the objective function, which is typically the biomass objective function [18]. Specifically, extensive wet lab experiments need to be conducted to measure phenotypes, such as biomass, of the strain of interest and to identify the objectives of the strain in biological activities [19]. The second is kinetics-based methods, such as sensitivity analysis in kinetic models [20,21,22] and structural sensitivity analysis (SSA) [10, 23]. Sensitivity analysis in kinetic models allows predicting quantitative responses to enzyme perturbations. There are global and local sensitivity analysis [22, 24], and the local sensitivity analysis is the more typical method for applying to kinetic models, which is called metabolic control analysis [25] and has been successfully used in many applications [26,27,28]. To construct a kinetic model, it is necessary to obtain parameters and functional forms of reaction rates of all reactions in the metabolic network of interest under specific environmental conditions [20, 21]. Therefore, parameter estimation is typically essential for unfamiliar strains, whose known information is rarely available [29]. Even though many parameter estimation methods have been developed [30,31,32,33], it can still be a cumbersome process due to the high dimensionality of the parameter space [30, 34]. In contrast, SSA can predict qualitative responses, which are the signs of responses, to enzyme perturbations only from structural information of the metabolic network. SSA does not need to determine the functional forms and parameters of the reaction rates. In other words, SSA removes the burden of parameter determination entirely. Because of its parameter-free nature, SSA could be widely applicable to chemical bioproduction.

SSA is originally a method to predict qualitative responses of a chemical reaction system to perturbations in enzyme amounts (or activities) only from structural information of the reaction network. In SSA, change in the chemical concentrations/fluxes to a perturbation is given in the form of a rational function of “SSA variables,” which are defined as derivatives of reaction rates with respect to chemical concentrations. From their definitions (cf. the “Our model” section), the SSA variables vary depending on environmental conditions, such as aerobic/anaerobic states, pH levels, and nutrient availability, as well as individual differences among microbes. By the SSA theory, whether the rational function is zero or non-zero, indicating the absence or presence of a response, is unambiguously determined from the structural information of the network alone. Furthermore, the sign of the rational function, indicating a positive or negative response, could potentially be determined by considering the signs of SSA variables, which are usually deducible from general considerations or biological knowledge. However, if the range of the rational function happens to include zero, its sign becomes undeterminable. For example, when a rational function is a subtraction of two positive variables, its sign depends on the quantitative values of these two variables. We refer to such structurally undeterminable response predictions as “indefinite predictions.” Indefinite predictions are true as the general consequences of responses drawn from structural information alone, which SSA aims to obtain rather than specific consequences.

However, these indefinite predictions in SSA can present practical challenges in the application to specific species and culture environments for chemical bioproduction, due to the necessity of precisely detecting reactions to up-/down-regulate. They often arise in complex and intertwined metabolic networks. Indeed, the central metabolic pathway analyzed in this study exhibits many indefinite predictions (Supplementary Table S1). As mentioned above, the SSA variables vary depending on the environmental conditions, fluctuating in accordance with the individual differences among microbes. For specific microbial species and culture environments, constraining the possible values of SSA variables on the basis of environmental information may decrease the number of indefinite predictions. This implies that application methods of SSA using environmental information are practically beneficial for chemical bioproduction.

In this study, we propose BayesianSSA, a Bayesian statistical model that extracts environmental information from perturbation datasets collected in environments of interest and integrates it into SSA predictions. BayesianSSA considers the variables of SSA as stochastic variables, and they are estimated using the perturbation data. Although BayesianSSA introduces new parameters to estimate, it still requires fewer parameters per reaction than kinetic modeling requires. For example, for a one-substrate reaction, BayesianSSA requires one parameter while kinetic modeling with the Michaelis-Menten equation requires two parameters, \(V_{\text{max}}\) and \(K_{\text{m}}\). In addition, BayesianSSA does not need to explore the functional forms of the reaction rates. The introduction of stochastic variables in BayesianSSA brings two additional advantages. First, it allows for the consideration of the uncertainty caused by individual differences among microbes. The variables of SSA may depend not only on environmental conditions but also on individual differences among microbes, and considering their uncertainty may contribute to predictive performance. Second, it enables a probabilistic interpretation of indefinite prediction by positivity confidence values. These values are defined as the probability that the predictive response is positive. We report the results of applying BayesianSSA to synthetic and real datasets of the central metabolic pathway of E. coli. To validate the practicability of BayesianSSA and assess whether BayesianSSA can integrate environmental information into SSA predictions, we compared predictive performances between BayesianSSA and a base method, which is the same as the BayesianSSA model but utilizes an initial prior distribution without incorporating perturbation datasets, as well as a naive Bayes model on these datasets. Utilizing environmental information may enhance predictions for out-of-sample perturbations, where chemical reactions to be perturbed are not included in the sample used to fit BayesianSSA for a given target chemical. To examine this effect, we evaluated predictions of out-of-sample perturbations by BayesianSSA.

Theoretical background

Structural sensitivity analysis

Algorithm

We explain the SSA algorithm [10, 11] briefly in this section. Consider the following ordinary differential equation (ODE):

$$\begin{aligned} \frac{dx_m}{dt} = \sum ^{J}_{j=1} \nu _{m,j} F_j (k_j, \textbf{x}), \end{aligned}$$

where \(x_m\) is the concentration of the m-th metabolite, J is the number of different reactions, \(\nu _{m,j}\) is the (mj) element of the stoichiometric matrix \(\varvec{\nu }\) that indicates metabolites as rows and reactions as columns, \(F_j(\cdot )\) is the j-th reaction rate function, which shows the reaction flux, \(k_j\) is the reaction rate constant of the j-th reaction, \(\textbf{x}=(x_1,\ldots ,x_M)^{\text{T}}\) is the vector of metabolite concentrations, and M is the number of different metabolites. Under the situation where the system obeys this ODE and the assumption that \(F_j\) monotonically increases with respect to \(k_j\), the SSA algorithm can derive the rational function of a response to a perturbation of the j-th reaction. Here, the perturbation and response are represented as an operation changing \(k_j\) to \(k_j + \delta k_j\) [11] and the change in metabolite concentrations/reaction fluxes from the initial steady-state to the eventual steady-state after the perturbation, respectively.

The first step of the SSA algorithm is to make a matrix \(\textbf{R}(\textbf{r})\). The (jm) element of \(\textbf{R}(\textbf{r})\), denoted by \(r_{j,m}\), is equal to \(\partial F_j / \partial x_m\). We write non-zero elements of \(\textbf{R}(\textbf{r})\) collectively as \(\textbf{r}\in \mathbb {R}^P\), with P being the total number of non-zero values in \(\textbf{R}(\textbf{r})\). That is, the matrix \(\textbf{R}(\textbf{r})\) indicates the dependence of reactions to metabolites. For example, \(r_{j,m}>0\) if the m-th metabolite is a substrate of the j-th reaction and \(r_{j,m}=0\) otherwise. Using the matrix \(\textbf{R}(\textbf{r})\), the matrix \(\textbf{A}(\textbf{r})\in \mathbb {R}^{(J+L)\times (M+K)}\) is defined as

$$\begin{aligned} \textbf{A}(\textbf{r}):=\begin{pmatrix} \begin{array}{ccc|ccc} & & & & & \\ & \textbf{R}(\textbf{r})& & & -\textbf{C} & \\ & & & & & \\ \hline & & & & & \\ & -\textbf{D}^{\text{T}} & & & \textbf{0}_{L \times K} & \\ & & & & & \end{array} \end{pmatrix} , \end{aligned}$$

where \(\textbf{0}_{L \times K} \in \mathbb {R}^{L \times K}\) is a zero matrix, whose elements are all zero, \(\textbf{C}=(\textbf{c}_1,\ldots ,\textbf{c}_K) \in \mathbb {R}^{J \times K}\), \(\textbf{c}_k\) is the k-th basis of \(\text{ker}\,\varvec{\nu }\), which indicates the right null space of \(\varvec{\nu }\), K is the number of the bases of \(\text{ker}\,\varvec{\nu }\), \(\textbf{D}=(\textbf{d}_1,\ldots ,\textbf{d}_L) \in \mathbb {R}^{M \times L}\), \(\textbf{d}_l\) is the l-th basis of \(\text{ker}\,\varvec{\nu }^{\text{T}}\), which indicates the right null space of \(\varvec{\nu }\), and L is the number of the bases of \(\text{ker}\,\varvec{\nu }^{\text{T}}\). Note that \(\textbf{A}(\textbf{r})\) is in general a square matrix, i.e., \(J+L=M+K\). A sensitivity matrix is defined as

$$\begin{aligned} \textbf{S}(\textbf{r}):=-\textbf{A}(\textbf{r})^{-1}. \end{aligned}$$
(1)

Here, the (mj) element of \(\textbf{S}(\textbf{r})\in \mathbb {R}^{(M+K)\times (J+L)}\) is proven to be equal to a constant multiple of a quantitative response of the m-th metabolite concentration/flux to a perturbation of the j-th reaction/conserved quantity [11]. Even though we calculate the sensitivity matrix, we cannot determine the quantitative response value. However, each element of the sensitivity matrix has the same sign (positive, negative, or zero) as the corresponding quantitative response value, enabling us to discuss qualitative responses. Although flux responses are represented as sets of reactions that are non-zero in \(\textbf{c}_k\), we can obtain a flux response corresponding to each reaction by calculating linear combinations. Let \(\textbf{T}(\textbf{r})\in \mathbb {R}^{J \times (J+L)}\) be a matrix indicating the responses of each reaction flux, which can be written as

$$\begin{aligned} \textbf{T}(\textbf{r}):=\textbf{C} \textbf{S}(\textbf{r})_{M+1:M+K,1:J+L}, \end{aligned}$$

where \(\textbf{S}(\textbf{r})_{M+1:M+K,1:J+L} \in \mathbb {R}^{K \times (J+L)}\) is a block matrix of \(\textbf{S}(\textbf{r})\), which is extracted from the \((M+1)\)-th to \((M+K)\)-th rows.

Qualitative response prediction

SSA can predict qualitative responses (positive, negative, zero, or indefinite) to perturbations using only structural information, which is the metabolic network and the constraints on the \(\textbf{r}\) values. Here, we describe this in a precise way.

Let \(\textbf{Q}(\textbf{r})\in \{-1, 0, 1\}^{(M+J) \times (J+L)}\) be the qualitative response matrix, which is given by

$$\begin{aligned} \textbf{Q}(\textbf{r}):=\text{sign}{\left( { \begin{pmatrix} \begin{array}{ccc} & & \\ & \textbf{S}(\textbf{r})_{1:M,1:J+L} & \\ & & \\ \hline & & \\ & \textbf{T}(\textbf{r})& \\ & & \end{array} \end{pmatrix} }\right) } , \end{aligned}$$

where \(\text{sign}(\cdot )\) is the element-wise sign function, which returns a matrix whose element equals 1, 0, and \(-1\) if the sign of the corresponding element of the given matrix is positive, zero, and negative, respectively. The responses of the m-th metabolite concentration and the j-th reaction flux to perturbations correspond to the m-th and \((M + j)\)-th rows of \(\textbf{Q}(\textbf{r})\), respectively. We refer to the m-th row and the j-th column of \(\textbf{Q}(\textbf{r})\) as the m-th “observation target” and the j-th “perturbation target”, respectively. In addition, we call the perturbation experiment/prediction/response for the m-th observation and j-th perturbation targets the (mj) experiment/prediction/response.

Since SSA is concerned with general results of qualitative responses, the elements of \(\textbf{Q}(\textbf{r})\) are examined across all possible values of \(\textbf{r}\). It is important to note that there are typically constraints on the \(\textbf{r}\) values, such as \(r_{j,m}>0\), and \(\textbf{Q}(\textbf{r})\) is evaluated within these constraints. Let \(q_{m,j}(\textbf{r})\) denote the (mj) element of \(\textbf{Q}(\textbf{r})\) (\(m = 1,\ldots , (M+J)\), \(j = 1,\ldots ,(J+L)\)). The qualitative response for each \(q_{m,j}(\textbf{r})\) can be classified into one of four categories; (1) \(q_{m,j}(\textbf{r})\) is zero for any \(\textbf{r}\). This case is a consequence of structural properties, as explained by a theorem known as the law of localization and buffering structures [11, 35]. (2) \(q_{m,j}(\textbf{r})\) is positive for any \(\textbf{r}\). (3) \(q_{m,j}(\textbf{r})\) is negative for any \(\textbf{r}\). (4) \(q_{m,j}(\textbf{r})\) varies depending on the quantitative values of \(\textbf{r}\), making the sign of the response indefinite. For example, including a term \(r_1-r_2\) in the symbolic expression corresponding to \(q_{m,j}(\textbf{r})\) makes \(q_{m,j}(\textbf{r})\) positive if \(r_1>r_2\) and negative if \(r_1<r_2\), where \(r_i\) is the i-th element of \(\textbf{r}\). These four categories are referred to as zero, positive, negative, and indefinite, respectively.

There are several methods for evaluating \(q_{m,j}(\textbf{r})\) for all possible \(\textbf{r}\). One approach is to perform symbolic calculations, regarding \(\textbf{r}\) as symbolic variables [10]. Although this method is rigorous, it is only practical for relatively small metabolic networks due to its computational complexity. An alternative, more computationally tractable method is to draw a sufficient number of \(\textbf{r}\) samples from an arbitrary probabilistic distribution and numerically evaluating \(\textbf{Q}(\textbf{r})\) [36]. This latter approach has inspired us to introduce stochastic variables into SSA in this study.

Methods

Our model

In SSA, the predictive response is obtained by considering all cases of \(\textbf{r}\). \(\textbf{r}\) depends on \({\left\{ {k_j}\right\} }_j\) and \({\left\{ {x_m}\right\} }_m\), which, in turn, depend on environmental conditions, such as aerobic/anaerobic states, pH levels, and nutrient availability, as well as individual differences among microbes. The SSA qualitative response prediction described in the previous section thus focuses on the general consequences of responses drawn from structural information alone. We propose a Bayesian statistical model based on SSA, named BayesianSSA, to integrate environmental information extracted from perturbation data into SSA predictions. We consider a probability of each \(\textbf{r}\) value, regarding \(\textbf{r}\) as a stochastic variable. The BayesianSSA posterior probability of \(\textbf{r}\) reflects perturbation data and thus extracts environmental information. Predicting responses on the basis of the posterior distribution of \(\textbf{r}\) and positivity confidence values enables us to integrate the environmental information into SSA predictions. A positivity confidence value, which we proposed in this study, indicates the probability that the predictive response is positive. The positivity confidence values enable us to interpret indefinite predictions stochastically. We will describe the details in the “Positivity confidence value” section.

Prior distribution and likelihood

In this section, we describe the likelihood function and the prior distribution of BayesianSSA. Suppose that we have a perturbation dataset \(\textbf{y}\), which shows the signs of experimentally observed responses. We define the perturbation record, an element of \(\textbf{y}\), as follows:

$$\begin{aligned} y_i :={\left\{ \begin{array}{ll} 1 & \text{if} \; \text{increase} \\ -1 & \text{if} \; \text{decrease} \end{array}\right. } , \end{aligned}$$
(2)

where \(y_i\) is the i-th perturbation record, obtained from the \((m_i, j_i)\) experiment, and “\(\text{increase}\)” and “\(\text{decrease}\)” indicate the cases where the observation target increases and decreases when the perturbation target is perturbed, respectively (see the “Qualitative response prediction” section for the definition of observation and perturbation targets). We consider the probability that \(y_i\ne q_{m_i,j_i}(\textbf{r})\) because experimental errors may occur even if \(q_{m,j}(\textbf{r})\) makes the correct prediction, assuming that the likelihood function is the following:

$$\begin{aligned} p(y_i | \textbf{r}, \varvec{\rho }) = {\left\{ \begin{array}{ll} \rho _{m_i,j_i}& \text{if} \; y_i = q_{m_i,j_i}(\textbf{r})\\ 1 - \rho _{m_i,j_i}& \text{if} \; y_i \ne q_{m_i,j_i}(\textbf{r})\end{array}\right. } , \end{aligned}$$
(3)

where \(\rho _{m,j}\in (0, 1)\) is a parameter indicating the reliability of the (mj) experiment, and \(\varvec{\rho }\) is a matrix whose (mj) element is \(\rho _{m,j}\). The likelihood function means that the probability of \(y_i\) is \(\rho _{m_i,j_i}\) if \(\textbf{r}\) can accurately predict \(y_i\), and is \(1-\rho _{m_i,j_i}\) otherwise. In other words, BayesianSSA assumes that the result of the i-th experiment can stochastically vary in accordance with the probability of \(\rho _{m_i,j_i}\). \(\rho _{m,j}\) is different for each (mj), and each (mj) experiment is assumed to have different reliability in BayesianSSA. This assumption is reasonable because the distribution of measured values is supposed to be different for each (mj) experiment.

We consider the prior distribution of \(\rho _{m,j}\) as

$$\begin{aligned} p(\rho _{m,j}) = \mathcal {B}\left( \rho _{m,j} | a, b \right) , \end{aligned}$$

where \(\mathcal {B}\left( \cdot | a, b \right)\) denotes the beta distribution probability density function with parameters \(a \in \mathbb {R}_{>0}\) and \(b \in \mathbb {R}_{>0}\). The prior distribution of \(\textbf{r}\) should be chosen in accordance with the constraint on \(\textbf{r}\). A typical constraint on \(r_{j,m}\) is \(r_{j,m}>0\) with the m-th metabolite being the substrate of the j-th reaction, and we consider only such type of constraints in this study. We used a weighted empirical distribution with samples drawn from a log-normal distribution as the prior distribution. Specifically, the prior distribution of \(\textbf{r}\) is the following:

$$\begin{aligned} p(\textbf{r}= \textbf{r}^{(v)})&= \mathcal{W}\mathcal{E}\left( \textbf{r}=\textbf{r}^{(v)} | \textbf{w} \right) \nonumber \\&= w_v, \end{aligned}$$
(4)

where \(\mathcal{W}\mathcal{E}\left( \textbf{r}=\textbf{r}^{(v)} | \textbf{w} \right)\) denotes the weighted empirical distribution probability mass function with a stochastic variable \(\textbf{r}\), a weight parameter \(\textbf{w}={\left( {w_1, \ldots , w_V}\right) }^{\text{T}}\), and a parameter sample set \({\left\{ {\textbf{r}^{(v)}}\right\} }_{v=1}^{V}\), \(w_v>0\) and \(\sum _{v=1}^{V} w_v = 1\), and V is the size of the sample set. Here, we generated the parameter sample set \({\left\{ {\textbf{r}^{(v)}}\right\} }_{v=1}^{V}\) as follows:

$$\begin{aligned} \textbf{r}^{(v)}&\sim \mathcal{L}\mathcal{N}\left( \varvec{\mu }, \varvec{\Sigma } \right) , \end{aligned}$$

where \(\mathcal{L}\mathcal{N}\left( \varvec{\mu }, \varvec{\Sigma } \right)\) denotes the log-normal distribution with parameters \(\varvec{\mu }\in \mathbb {R}^{P}\) and \(\varvec{\Sigma }\in \mathbb {R}^{P \times P}\) (\(\varvec{\mu }\) and \(\varvec{\Sigma }\) correspond to the mean and covariance matrix parameter of the normal distribution, respectively), and \(a \sim \mathcal {D}\) indicates that a stochastic variable a is drawn from a distribution \(\mathcal {D}\).

The purpose of BayesianSSA is to obtain a better distribution \(p(\textbf{r})\) for calculating positivity confidence values (Eq. (7)). Therefore, we need the marginal posterior distribution \(p(\textbf{r}|\textbf{y})\). The marginal likelihood is obtained as

$$\begin{aligned} p(\textbf{y}|\textbf{r}=\textbf{r}^{(v)})&=\int p(\textbf{y}|\textbf{r}=\textbf{r}^{(v)}, \varvec{\rho })p(\varvec{\rho }) d\varvec{\rho }\nonumber \\&= \prod _{m=1}^{M+J}\prod _{j=1}^{J+L}\frac{\text{Beta}{\left( {\hat{a}_{m,j,v}, \hat{b}_{m,j,v}}\right) } }{\text{Beta}{\left( {a, b}\right) }}, \end{aligned}$$

where \(\text{Beta}{\left( {\cdot , \cdot }\right) }\) is the beta function, \(\hat{a}_{m,j,v}=t_{m,j,v}+ a\), \(\hat{b}_{m,j,v}=f_{m,j,v}+ b\), and \(t_{m,j,v}\) and \(f_{m,j,v}\) are the number of true and false (mj) predictions based on \(\textbf{r}^{(v)}\) in \(\textbf{y}\), respectively. Here, \(t_{m,j,v}=f_{m,j,v}=0\) for a (mj) experiment that has not been conducted.

If a continuous prior and posterior distribution is desired, one can use them and obtain samples from the posterior distribution using the Markov chain Monte Carlo (MCMC) method. Since we discretized the log-normal distribution (Eq. (4)), we can calculate the posterior distribution without approximation using MCMC (details are described in the “Calculating posterior distribution” section).

Calculating posterior distribution

The marginalized posterior distribution \(p(\textbf{r}| \textbf{y})\) in BayesianSSA can be calculated by normalizing the following formula:

$$\begin{aligned} p(\textbf{r}= \textbf{r}^{(v)}| \textbf{y})&\propto p(\textbf{y}|\textbf{r}=\textbf{r}^{(v)}) p(\textbf{r}=\textbf{r}^{(v)}) \\&= {\left( {\prod _{m=1}^{M+J}\prod _{j=1}^{J+L}\frac{\text{Beta}{\left( {\hat{a}_{m,j,v}, \hat{b}_{m,j,v}}\right) } }{\text{Beta}{\left( {a, b}\right) }}}\right) } w_v, \end{aligned}$$

We can omit calculating the constant of this formula, and we derive that

$$\begin{aligned} p(\textbf{r}= \textbf{r}^{(v)}| \textbf{y}) = \frac{g(\textbf{y}, \textbf{r}^{(v)})}{\sum _{v^{\prime }=1}^{V} g(\textbf{y}, \textbf{r}^{(v^{\prime })})}, \end{aligned}$$
(5)

where

$$\begin{aligned} g(\textbf{y}, \textbf{r}^{(v)})&:=w_v \prod _{(m, j) \in \Lambda }\text{Beta}{\left( {\hat{a}_{m,j,v}, \hat{b}_{m,j,v}}\right) } , \\ \Lambda&:={\left\{ {(m, j) \mid t_{m,j,v}+ f_{m,j,v}\ne 0}\right\} }. \end{aligned}$$

We implemented these calculations using the log-sum-exp trick.

Calculating predictive distribution

To evaluate predictive performance, we calculate the predictive probability \(p(\textbf{y}^{\text{new}}|\textbf{y})\), where \(\textbf{y}^{\text{new}}\) is a new sample that is not used in BayesianSSA fitting. Using the posterior distribution \(p(\textbf{r}, \varvec{\rho }| \textbf{y})\) (Supplementary Section S1), the predictive probability can be obtained as

$$\begin{aligned} p(\textbf{y}^{\text{new}}|\textbf{y})&= \sum _{v=1}^{V} \int p(\textbf{y}^{\text{new}}|\textbf{r}=\textbf{r}^{(v)}, \varvec{\rho }) \nonumber \\&\hspace{20mm}p(\textbf{r}=\textbf{r}^{(v)}, \varvec{\rho }|\textbf{y}) d\varvec{\rho }\nonumber \\&= \frac{\sum _{v=1}^{V} w_v\prod _{(m, j) \in \Lambda ^{\text{new}}}\text{Beta}{\left( {\hat{a}_{m,j,v}^{\text{new}}, \hat{b}_{m,j,v}^{\text{new}}}\right) }}{ \sum _{v = 1}^{V} w_{v} \prod _{(m, j) \in \Lambda ^{\text{new}}}\text{Beta}{\left( {\hat{a}_{m,j,v}, \hat{b}_{m,j,v}}\right) }} \end{aligned}$$
(6)

where

$$\begin{aligned} \hat{a}_{m,j,v}^{\text{new}}&:=\hat{a}_{m,j,v}+ t_{m,j,v}^{\text{new}}, \\ \hat{b}_{m,j,v}^{\text{new}}&:=\hat{b}_{m,j,v}+ f_{m,j,v}^{\text{new}}, \\ \Lambda ^{\text{new}}&:=\Lambda \cup {\left\{ {(m, j) \mid t_{m,j,v}^{\text{new}}+ f_{m,j,v}^{\text{new}}\ne 0}\right\} }, \end{aligned}$$

and \(t_{m,j,v}^{\text{new}}\) and \(f_{m,j,v}^{\text{new}}\) are the number of true and false (mj) predictions based on \(\textbf{r}^{(v)}\) in \(\textbf{y}^{\text{new}}\), respectively. The detailed derivation is described in Supplementary Section S2.

Bayesian updating

Bayesian updating, a procedure where the posterior distribution is used as a prior distribution for the next estimation, can be easily applied to BayesianSSA. In BayesianSSA, updating \(\hat{a}_{m,j,v}\) and \(\hat{b}_{m,j,v}\) to \(\hat{a}_{m,j,v}^{\text{new}}\) and \(\hat{b}_{m,j,v}^{\text{new}}\) is equivalent to Bayesian updating (Supplementary Section S3). This update is also derived from the fact that the final updated posterior distribution in Bayesian updating does not depend on the order of the given perturbation data due to Bayes’ theorem [37].

Positivity confidence value

The introduction of stochastic variables \(\textbf{r}\) enables interpreting indefinite predictions in SSA. We define positivity confidence values as the probabilities that the responses are positive for each indefinite prediction. The positivity confidence value of the (mj) prediction for a distribution \(p^{\star }(\textbf{r})\) is written as

$$\begin{aligned} c_{p}(m,j)&:=\mathbb {E}_{p^{\star }(\textbf{r})}[\mathbb {I}_{q_{m,j}(\textbf{r})=1}], \end{aligned}$$
(7)

where \(\mathbb {E}_{p^{\star }(\textbf{r})}[\cdot ]\) denotes the expectation with respect to \(p^{\star }(\textbf{r})\), and \(\mathbb {I}_{a=b}\) is an indicator that is equal to 1 if \(a=b\) and 0 otherwise. A higher positivity confidence value indicates that the qualitative response is more likely to be positive, and the probability of being negative is higher than that of being positive when the positivity confidence value is below 0.5. Note that we use the predictive distributions derived in the “Calculating predictive distribution” section rather than positivity confidence values to evaluate predictive performance on perturbation datasets. This is because positivity confidence values do not consider experimental error.

Used data

Metabolic network information

We utilized the central metabolic pathway of E. coli MG1655 used in a previous study [38]. This metabolic network was originally from the EcoCyc database [39] and modified by Trinh et al. [40] and Toya et al. [38]. We preprocessed this dataset as follows:

  1. 1.

    Remove the biomass objective function.

  2. 2.

    Remove metabolites that have no reactions that produce or use them.

  3. 3.

    Remove reactions that no longer have substrates or products due to the previous processes.

  4. 4.

    Integrate cytoplasm and extracellular metabolites.

The second step is necessary to apply SSA to the network because it is typically impossible to calculate the inverse of the matrix \(\textbf{A}(\textbf{r})\) (Eq. (1)) when metabolites not involved in the flow are included in the network. Here, metabolites in flow refer to metabolites that serve as both inputs and outputs of reactions included in the network. The fourth step aims to reduce computation time, and this procedure does not alter the results from SSA. After these preprocessing steps, we converted the resulting network into the stoichiometric matrix \(\varvec{\nu }\). Constraints on \(\textbf{R}(\textbf{r})\) were determined solely by the metabolic network, where we set \(r_{j,m}> 0\) if the j-th metabolite is a substrate of the m-th reaction and \(r_{j,m}= 0\) otherwise. The resulting metabolite and reaction lists are shown in Supplementary Table S2 and Supplementary Table S3. We use the abbreviations in Supplementary Table S2 and Supplementary Table S3 in the following.

Synthetic data generation

We generated synthetic data to compare BayesianSSA with random and base methods. The synthetic perturbation dataset is generated in accordance with a Bernoulli distribution with parameters generated in accordance with beta distributions. Then, we replaced all 0 with \(-1\) to match the support of the Bernoulli distribution and the range of the perturbation record \(y_i\) (Eq. (2)). We used GND, PTS, and PPC as the perturbation targets and succinate export (SUCCt) as the observation target. We used a beta distribution with an expectation is close to 0 for GND and PTS and a beta distribution with an expectation is close to 1 for PPC. These distribution settings were in accordance with previous reports [41,42,43,44]. We used \((a, b)=(0.1, 0.3)\) and \((a, b)=(0.3, 0.1)\) as parameters of beta distributions with expectations are close to 0 and 1, respectively. Note that these parameters are used only for the data generation and different from the parameters of the prior distributions, which are weakly informative. We used 30 as the number of records for each perturbation target. The obtained synthetic dataset is shown in Table 1.

Table 1 Obtained synthetic perturbation dataset

Wet lab experiments

We conducted perturbation experiments for the reactions CS, FBP, ICL, LDH, ME1, ME2, PCK, PPC, PPS, and PTA. The genes corresponding to the reactions, listed in Table 2, were introduced into the plasmid vector pLEAD5 (NIPPON GENE CO., LTD.). We amplified the gene sequences using PrimeSTAR GXL DNA Polymerase (Takara Bio Inc.) with E. coli DH5\(\alpha\) (NIPPON GENE CO., LTD.) as a template sequence. The primer sequences were derived from NCBI Genes [45]. The nucleotide sequences of DNAs were analyzed using the BigDye\(\circledR\) Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems) with SeqStudio\(^{\text{TM}}\) Genetic Analyzer (Applied Biosystems). The nucleotide sequence data were processed using GENETYX-Mac NETWORK software, version 15 (GENETYX CORPORATION). We introduced the constructed plasmid vectors into E. coli JM109 (NIPPON GENE CO., LTD.). The modified strains were aerobically cultured and anaerobically fermented at 37\(^{\circ }\)C in M9 minimal medium (Thermo Fisher Scientific Inc.). To measure succinate concentrations, we used EnzyChrom Succinate Assay Kit (#ESNT-100, BioAssay Systems) and the absorbance meter SH-8000(CORONA ELECTRIC Co.,Ltd.) at 570 nm. The resulting absorbance, which is a constant multiple of the succinate concentrations, was normalized by the optical density of the bacterial liquid. We calculated the differences between the resulting values and the absorbances of controls, which are of a strain with only pLEAD5, and the signs of the values were used as the real perturbation data. We constructed and measured three replicates for each perturbed reaction. The obtained dataset is shown in Table 2. To validate the overexpression, we performed real-time PCR using QuantStudio5 Real-time PCR system (Thermo Fisher SCIENTIFIC) and confirmed the genes were successfully overexpressed (Supplementary Table S4).

Table 2 Obtained real perturbation dataset

Evaluation

Naive Bayes model

To evaluate the effectiveness of incorporating SSA, we constructed a naive Bayes model as follows:

$$\begin{aligned} p(y_i | \varvec{\eta }) = {\left\{ \begin{array}{ll} \eta _{m_i,j_i}& \text{if} \; y_i = 1\\ 1 - \eta _{m_i,j_i}& \text{if} \; y_i \ne 1 \end{array}\right. } , \end{aligned}$$

where \(\eta _{m,j}\in (0, 1)\) is a parameter indicating the probability that \(y_i=1\) where the i-th experiment is of (mj), and \(\varvec{\eta }\) is a matrix whose (mj) element is \(\eta _{m,j}\). The predictive distribution of one new perturbation record \(y^{\text{new}}\) is as follows:

$$\begin{aligned} p(y^{\text{new}}|\textbf{y}) = {\left\{ \begin{array}{ll} \frac{n_{m,j}^{(\text{p})}+1}{n_{m,j}^{(\text{p})}+n_{m,j}^{(\text{n})}+2} & \text{if} \; y^{\text{new}}=1 \\ \frac{n_{m,j}^{(\text{n})}+1}{n_{m,j}^{(\text{p})}+n_{m,j}^{(\text{n})}+2} & \text{if} \; y^{\text{new}}\ne 1 \end{array}\right. } , \end{aligned}$$
(8)

where \(n_{m,j}^{(\text{p})}\) and \(n_{m,j}^{(\text{n})}\) are the numbers of the (mj) experiments in \(\textbf{y}\) where \(y_i=1\) and \(y_i=-1\), respectively. We used \(\mathcal {B}\left( \eta _{m,j} | 1,1 \right)\) as the prior distribution.

Cross-entropy loss

To evaluate the performance of BayesianSSA, we used the cross-entropy loss as follows:

$$\begin{aligned} \text{CE}(N) = - \sum _{i=1}^{N} \log p_i(y^{\text{new}}= y_i), \end{aligned}$$

where N is the number of trials, which means how many times data are added, and \(p_i(\cdot )\) is the i-th predictive distribution of BayesianSSA or the naive Bayes model. The i-th predictive distribution is calculated using \({\left\{ {y_{i^\prime }}\right\} }_{i^\prime = 1}^{i-1}\). We used \(p_t(y^{\text{new}}=1) = 0.5\) and \(p_t(y^{\text{new}}= y_i) = p_1(y^{\text{new}}= y_i)\) as the “random method” and “base method” to calculate the cross-entropy loss, respectively, to be compared with BayesianSSA. Here, \(p_1(y^{\text{new}}= y_i)\) indicates the initial distribution of BayesianSSA.

Results

\(m_{\text{name}}\) denotes the index of the observation target in the following. In other words, the \(m_{\text{name}}\)-th observation target is a metabolite or a reaction whose name is “name.” Similarly, \(j_{\text{name}}\) denotes the index of the perturbation target “name” in the following. Unless otherwise stated, we used \(V=10000\), \(w_v=\frac{1}{V}\), \(\varvec{\mu }=\textbf{0}_{P}\), \(\varvec{\Sigma }=\textbf{I}_{P \times P}\), and \((a, b)=(3, 1)\) as the hyper-parameters of BayesianSSA where \(\textbf{0}_{P}\) is the P-dimensional zero vector, and \(\textbf{I}_{P \times P}\) is the \(P \times P\) identity matrix.

Performance evaluation on synthetic dataset

To compare predictive performance between BayesianSSA and the base method under an ideal situation that the experimental error is low, we examined the cross-entropy loss \(\text{CE}(N)\) (cf. the “Cross-entropy loss” section) on the synthetic dataset shown in Table 1. Figure 1 shows the cross-entropy loss trajectory for each method. While cross-entropy loss values at the last trial of BayesianSSA with \((a,b)=(9,1)\), \((a,b)=(6,2)\), \((a,b)=(3,1)\), and \((a,b)=(2,1)\) are 19.6, 23.3, 21.4, and 22.3, respectively, those of the random and base methods are 62.4 and 66.6, respectively. BayesianSSA outperformed the random and base method from the perspective of prediction accuracy under the ideal situation.

Fig. 1
figure 1

Cross-entropy loss trajectory for each method on the synthetic dataset. The \(x\)- and \(y\)-axes indicate the number of trials and cross-entropy loss, respectively

Positivity confidence value transition on pseudo data

To examine the transition of positivity confidence values, we applied BayesianSSA to a pseudo perturbation dataset (shown in Table 3) based on the previous studies [41,42,43]. Figure 2 shows the transition of positivity confidence values for increasing SUCCt, i.e., \(c_p(m_{\text{SUCCt}}, \cdot )\), when fitting BayesianSSA to the pseudo perturbation dataset. We found that the positivity confidence values were updated for multiple reactions rather than one reaction. The first update (from Fig. 2a to b) shows the EDA (6PG \(\rightarrow\) G3P + PYR) positivity confidence value for the production of succinate becomes higher. The second update (from Fig. 2b to c) shows the reactions included by the flux from FBP to PEP become lower. One of the reaction candidates to increase the SUCCt flux was PPC, whose reaction formula was PEP + CO2 \(\rightarrow\) OAA. While the initial PPC positivity confidence value was 0.714 (Supplementary Table S6), the updated PPC positivity confidence value was 0.869 (Supplementary Table S7). A previous report showed the succinate production of E. coli increases when PPC was up-regulated [44], and the PPC positivity confidence value estimated by BayesianSSA is consistent with the report. Similarly, those of GND and PTS were updated from 0.530 and 0.779 to \(7.84\times 10^{-3}\) and \(3.23\times 10^{-2}\), respectively. Although these results are also consistent with the reports [41,42,43], they are naive because the information of these reports was used directly to fit BayesianSSA. The initial positivity confidence values, which were calculated on the basis of the prior distribution, and positivity confidence values updated with the pseudo perturbation dataset are shown in Supplementary Table S6 and S7, respectively.

Table 3 Pseudo perturbation dataset
Fig. 2
figure 2

Transition of positivity confidence value for increasing the succinate export flux \(c_p(m_{\text{SUCCt}}, \cdot )\). Each square indicates a metabolite in the network. Each arrow indicates a reaction, and its color shows the positivity confidence value (red and blue) or zero response (grey). The reactions corresponding to the edges in this figure are shown in Supplementary Table S5. a Values given by the initial model. b Values given by the model updated by 10 perturbation records that \(y_i = -1\) where \(m_i=m_{\text{SUCCt}}\) and \(j_i=j_{\text{GND}}\). c Values given by the model updated by 10 perturbation records that \(y_i = -1\) where \(m_i=m_{\text{SUCCt}}\) and \(j_i=j_{\text{PTS}}\) in addition to the perturbation records of b

Performance evaluation on real data

To compare the performances between BayesianSSA and the naive Bayes model, which uses only data without SSA, we examined the cross-entropy loss \(\text{CE}(N)\) (cf. the “Cross-entropy loss” section). Figure 3 shows the cross-entropy loss trajectory for each method. The perturbation records were randomly shuffled and used in each trial. The loss of BayesianSSA is comparable to that of the base method in the early trials but smaller in the late trials. While the cross-entropy loss values at the last trial of BayesianSSA with \((a,b)=(9,1)\), \((a,b)=(6,2)\), \((a,b)=(3,1)\), and \((a,b)=(2,1)\) are 14.8, 15.7, 15.4, and 16.1, respectively, those of the random method, the base method, and the naive Bayes model are 20.8, 19.0, and 17.2, respectively. BayesianSSA outperformed the random and base methods and the naive Bayes model from the perspective of prediction accuracy. While BayesianSSA made a prediction on the basis of the perturbation dataset and SSA, the prediction of the base method is only based on SSA. This result suggests that BayesianSSA can integrate environmental information of the real dataset into SSA predictions. Similarly, the difference between BayesianSSA and the naive Bayes is whether or not SSA is incorporated. This result also indicates the practicability of BayesianSSA and that incorporating SSA into statistical models improves predictive performance.

Fig. 3
figure 3

Cross-entropy loss trajectory for each method on the real dataset. The \(x\)- and \(y\)-axes indicate the number of trials and cross-entropy loss, respectively

The main difference between BayesianSSA and the naive Bayes model is in the predictions for out-of-sample perturbations, which are of new (mj) experiments. To calculate the predictive distribution of a (mj) experiment, the naive Bayes model uses only the results of (mj) experiments (Eq. (8)) while BayesianSSA uses the results of all the \((m, j) \in \Lambda\) experiments (Eq. (6)). That is, BayesianSSA can leverage data to predict the responses to out-of-sample perturbations through the \(\textbf{r}\) posterior distribution. To validate the predictive performance for out-of-sample perturbations, we examined the predictive probabilities by splitting the real dataset. For example, the predictive probabilities of the three replicates for the reaction CS were calculated by fitting BayesianSSA to the real dataset except for CS. Figure 4 shows the distribution of predictive probabilities calculated by BayesianSSA for out-of-sample perturbations. Here, each replicate was evaluated separately. The median of the distribution was 0.67, which is better than random (0.5), and this result indicates that fitting BayesianSSA contributes to the predictive performance for out-of-sample perturbations.

Fig. 4
figure 4

Distribution of predictive probabilities calculated by BayesianSSA for out-of-sample perturbations. Each value was calculated for the real dataset for new (mj) perturbation. The white dot indicates the median of the distribution

Positivity confidence values on real data

To interpret the BayesianSSA estimation results, we examined the positivity confidence values after BayesianSSA was fitted to the real data whose observation target is the succinate export flux. Figure 5 shows the positivity confidence values in the metabolic network. All positivity confidence values updated by the real perturbation dataset are shown in Supplementary Table S8. We found high positivity confidence values of THD_r and NDH (\(c_p(m_{\text{SUCCt}}, j_{\mathrm {THD\_r}})=0.95\), \(c_p(m_{\text{SUCCt}}, j_{\text{NDH}})=0.92\)), which both use NADH. NDH produces Q8H2, which is required for FRD. In the tricarboxylic acid (TCA) cycle, ICL, MDH, FRD, FUM_r, and CS show high positivity confidence values (\(>0.85\)). We also applied BayesianSSA with another prior distribution of \(\textbf{r}\), which is based on log-normal distributions with random parameters, to the real dataset (Supplementary Figure S1). All positivity confidence values using this prior distribution are shown in Supplementary Table S9. The reactions with \(c_{p}(m,j)>0.85\) are almost shared in the two BayesianSSA results. The only difference is the presence or absence of NDH (cf. Supplementary Table S8 and Supplementary Table S9). These results suggest that BayesianSSA is robust to changes in the prior distributions of \(\textbf{r}\). We have also examined other observation targets besides succinate export flux and found that they can also be updated to high positivity confidence values (Supplementary Figure S2 and S3).

Positivity confidence values calculated by the BayesianSSA posterior distribution were consistent with previous reports. The positivity confidence values of CS and ICL, which are included in the glyoxylate pathway, were high. The glyoxylate pathway was reported as an essential pathway for succinate production [46, 47]. Similarly, the reductive pathway, which includes FUM_r and FRD, was reported as another essential pathway [46, 47]. Despite these consistencies, other previous reports for several reactions, such as PPC [44] and PTS [42, 43] are inconsistent with our results.

Fig. 5
figure 5

Positivity confidence values by BayesianSSA to increase the succinate export flux on metabolic networks. The positivity confidence values were calculated by BayesianSSA fitted to real data. Each square indicates a metabolite in the network. Each arrow indicates a reaction, and its color shows the positivity confidence value (red and blue) or zero response (grey). The reactions corresponding to the edges in this figure are shown in Supplementary Table S5

Posterior distribution of \(\textbf{r}\)

To examine the differences between the prior and posterior distribution of \(\textbf{r}\), we compared \(p(\textbf{r}| \textbf{y})\) with \(p(\textbf{r})\). Figure 6 shows \(p(\textbf{r})\) and \(p(\textbf{r}| \textbf{y})\) on the first and second principal components of \({\left\{ {\textbf{r}^{(v)}}\right\} }_{v=1}^{V}\). We found that the posterior distribution had several peaks where the prior distribution only had one peak. This result indicates that the distribution of \(\textbf{r}\) is tailored to several cases of environments in which the used dataset was collected.

Fig. 6
figure 6

Contour plots of the prior (a) and posterior (b) distributions of \(\log {\textbf{r}}\). The x- and y-axes indicate the first and second principal components, respectively

Discussion

We demonstrated the effectiveness of BayesianSSA on the basis of predictive performance using the real perturbation dataset where the observation target is the succinate export flux (SUCCt). BayesianSSA outperformed the base method (Fig. 3), which is not fitted to the dataset, and this result showed BayesianSSA could integrate environmental information into SSA predictions. For validation of practicability, BayesianSSA also outperformed the random method and the naive Bayes model, and the performance of BayesianSSA for out-of-sample perturbations, which are of new (mj) experiments, was better than random. These results show that BayesianSSA can consider the relationships between different (mj) perturbations through \(\textbf{r}\), and that this consideration contributes to predictive performance.

We considered \(\rho _{m,j}\) as a parameter and set the prior distribution in this study. As another option, \(\rho _{m,j}\) can be given by an error rate of the measurement equipment used for the perturbation experiment. For example, consider a case where the error distribution of the used measurement equipment is a normal distribution with a mean parameter that equals the true value and a variance parameter \(\sigma ^2 = 1\) as an error distribution. If the experimental value obtained by the perturbation is 1, the probability that the true value is less than zero is approximately 16%. Therefore, setting \(1-\rho _{m,j}=0.16\) can make BayesianSSA consider the error distribution of the measurement equipment. In this way, we can set a certain value of \(\rho _{m,j}\). Note that we can easily calculate the posterior distribution of this model (Supplementary Section S4).

There are two directions for future work related to \(\textbf{r}\). First, the updated \(p(\textbf{r})\) may be used for response predictions in another metabolic network. When the reaction rate function \(F_j\) and the probability distribution of \(\textbf{x}\) are equal between the two metabolic networks, \(p(r_{j,m})\) can be used in another system that includes the i-th reaction and the m-th metabolite. Second, as previously discussed [10], we can consider allosteric regulation. Allosteric regulation is a type of regulation that increases/decreases reaction rates as a metabolite concentration increases [48, 49]. We can easily consider allosteric regulation by setting \(r_{j,m}\ne 0\). Technically, \(r_{j,m}< 0\) can be implemented by reversing the sign of \(r_{j,m}\) after sampling \(\textbf{r}^{(v)}\). However, we need to know the (jm) pairs that have allosteric regulation in advance, and we omitted considering allosteric regulations in this study.

There is room for choice regarding the prior distribution of \(\textbf{r}\). First, continuous distributions can be adopted. We used the empirical distribution with samples from log-normal distributions as the \(\textbf{r}\) prior distribution for all experiments (cf. Eq. (4)). As long as the constraint on \(\textbf{r}\) is satisfied, other distributions can be adopted. However, the likelihood function changes discretely (cf. Eq. (3)), and the advantage of adopting a continuous distribution with employing MCMC methods may be limited. Second, ensemble approaches for several types of prior distributions may be effective. As shown in Fig. 5 and Supplementary Figure S1, the effect of the prior distribution is not negligible when dealing with a limited sample size. Using several types of prior distributions may contribute to making robust predictions.

Although we omitted the biomass production processes in this study, considering them can improve the representation of real biological activity. One commonly used approach in FBA involves optimizing the biomass objective function [18, 19]. However, since the biomass objective function depends on the specific strains and environmental conditions [50], it is difficult to use the biomass objective function when analyzing unfamiliar strains. Another difficulty is that the biomass objective function is a pseudo-reaction that contains multiple reactions and cannot be treated kinetically. Unlike FBA-based methods, which can consider biomass production processes as a single reaction, kinetics-based methods including BayesianSSA need to faithfully model the biomass production process. That is, it is necessary to define the biomass production rate equations as in a previous study [21].

Utilizing the positivity confidence value calculation (the “Positivity confidence value” section) and Bayesian updating (the “Bayesian updating” section) in BayesianSSA, we can construct an iterative design-build-test-learn (DBTL) cycle [51] on the basis of BayesianSSA for proposals of reactions to be perturbed. Specifically, the procedure is as follows:

  1. 1.

    Calculate positivity confidence values by Eq. (7).

  2. 2.

    Obtain a proposal of which perturbation and observation targets are validated in accordance with positivity confidence values.

  3. 3.

    Conduct perturbation experiments in accordance with the proposal.

  4. 4.

    Update the posterior distributions by Eq. (5).

  5. 5.

    Return to the first step.

One advantage of this scheme is the high efficiency because the experimental validation of proposals obtained by BayesianSSA is also a process collecting data for updating the BayesianSSA posterior distribution.

Conclusions

In this study, we proposed BayesianSSA, a Bayesian statistical model based on SSA. SSA was previously developed as a method to predict qualitative responses to enzyme perturbations on the basis of the structural information of the reaction network. However, the network structural information can sometimes be insufficient to predict qualitative responses unambiguously, which is a practical issue in bioproduction applications. To address this, BayesianSSA extracts environmental information from perturbation datasets collected in environments of interest and integrates it into SSA predictions. We applied BayesianSSA to synthetic and real datasets of the central metabolic pathway of E. coli. As a result, BayesianSSA outperformed the base method, which is the same as the BayesianSSA model but utilizes an initial prior distribution without incorporating perturbation datasets. This result shows that BayesianSSA can successfully integrate environmental information extracted from perturbation data into SSA predictions. In addition, the positivity confidence values estimated by BayesianSSA for increasing the succinate export flux were consistent with the known pathways reported to enhance the flux in previous studies. We believe that BayesianSSA will accelerate the chemical bioproduction process and contribute to advancements in the field.

Availability of data and materials

Supplementary material is available from the journal website. The implementation of the algorithm is available on GitHub (https://github.com/shion-hosoda-hitachi/BayesianSSA).

References

  1. Ögmundarson Ó, Sukumara S, Herrgård MJ, Fantke P. Combining environmental and economic performance for bioprocess optimization. Trends Biotechnol. 2020;38(11):1203–14. https://doi.org/10.1016/j.tibtech.2020.04.011.

    Article  CAS  PubMed  Google Scholar 

  2. Tsopanoglou A, Jiménez del Val I. Moving towards an era of hybrid modelling: advantages and challenges of coupling mechanistic and data-driven models for upstream pharmaceutical bioprocesses. Curr Opin Chem Eng. 2021;32: 100691. https://doi.org/10.1016/j.coche.2021.100691.

    Article  Google Scholar 

  3. Rathore AS, Mishra S, Nikita S, Priyanka P. Bioprocess control: current progress and future perspectives. Life. 2021;11(6):557. https://doi.org/10.3390/life11060557.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Behera B, Unpaprom Y, Ramaraj R, Maniam GP, Govindan N, Paramasivan B. Integrated biomolecular and bioprocess engineering strategies for enhancing the lipid yield from microalgae. Renew Sustain Energy Rev. 2021;148: 111270. https://doi.org/10.1016/j.rser.2021.111270.

    Article  CAS  Google Scholar 

  5. Voigt CA. Synthetic biology 2020–2030: six commercially-available products that are changing our world. Nat Commun. 2020;11(1):6379. https://doi.org/10.1038/s41467-020-20122-2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Keasling J, Garcia Martin H, Lee TS, Mukhopadhyay A, Singer SW, Sundstrom E. Microbial production of advanced biofuels. Nat Rev Microbiol. 2021;19(11):701–15. https://doi.org/10.1038/s41579-021-00577-w.

    Article  CAS  PubMed  Google Scholar 

  7. Zhuang X, Zhang Y, Xiao A-F, Zhang A, Fang B. Applications of synthetic biotechnology on carbon neutrality research: a review on electrically driven microbial and enzyme engineering. Front Bioeng Biotechnol. 2022;10: 826008.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Wang L, Dash S, Ng CY, Maranas CD. A review of computational tools for design and reconstruction of metabolic pathways. Synth Syst Biotechnol. 2017;2(4):243–52. https://doi.org/10.1016/j.synbio.2017.11.002.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Sveshnikova A, MohammadiPeyhani H, Hatzimanikatis V. Computational tools and resources for designing new pathways to small molecules. Curr Opin Biotechnol. 2022;76: 102722. https://doi.org/10.1016/j.copbio.2022.102722.

    Article  CAS  PubMed  Google Scholar 

  10. Mochizuki A, Fiedler B. Sensitivity of chemical reaction networks: a structural approach. 1. Examples and the carbon metabolic network. J Theor Biol. 2015;367:189–202. https://doi.org/10.1016/j.jtbi.2014.10.025.

    Article  CAS  PubMed  Google Scholar 

  11. Okada T, Mochizuki A. Sensitivity and network topology in chemical reaction systems. Phys Rev E. 2017;96(2): 022322. https://doi.org/10.1103/PhysRevE.96.022322.

    Article  PubMed  Google Scholar 

  12. Fisher AK, Freedman BG, Bevan DR, Senger RS. A review of metabolic and enzymatic engineering strategies for designing and optimizing performance of microbial cell factories. Comput Struct Biotechnol J. 2014;11(18):91–9. https://doi.org/10.1016/j.csbj.2014.08.010.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Guo W, Sheng J, Feng X. Mini-review: in vitro metabolic engineering for biomanufacturing of high-value products. Comput Struct Biotechnol J. 2017;15:161–7. https://doi.org/10.1016/j.csbj.2017.01.006.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Pharkya P, Maranas CD. An optimization framework for identifying reaction activation/inhibition or elimination candidates for overproduction in microbial systems. Metab Eng. 2006;8(1):1–13. https://doi.org/10.1016/j.ymben.2005.08.003.

    Article  CAS  PubMed  Google Scholar 

  15. Ranganathan S, Suthers PF, Maranas CD. OptForce: an optimization procedure for identifying all genetic manipulations leading to targeted overproductions. PLoS Comput Biol. 2010;6(4):1000744. https://doi.org/10.1371/journal.pcbi.1000744.

    Article  CAS  Google Scholar 

  16. McAnulty MJ, Yen JY, Freedman BG, Senger RS. Genome-scale modeling using flux ratio constraints to enable metabolic engineering of clostridial metabolism in silico. BMC Syst Biol. 2012;6(1):42. https://doi.org/10.1186/1752-0509-6-42.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Jiang S, Otero-Muras I, Banga JR, Wang Y, Kaiser M, Krasnogor N. OptDesign: identifying optimum design strategies in strain engineering for biochemical production. ACS Synth Biol. 2022;11(4):1531–41. https://doi.org/10.1021/acssynbio.1c00610.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Orth JD, Thiele I, Palsson BØ. What is flux balance analysis? Nat Biotechnol. 2010;28(3):245–8. https://doi.org/10.1038/nbt.1614.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Feist AM, Palsson BO. The biomass objective function. Curr Opin Microbiol. 2010;13(3):344–9. https://doi.org/10.1016/j.mib.2010.03.003.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Usuda Y, Nishio Y, Iwatani S, Van Dien SJ, Imaizumi A, Shimbo K, Kageyama N, Iwahata D, Miyano H, Matsui K. Dynamic modeling of Escherichia coli metabolic and regulatory systems for amino-acid production. J Biotechnol. 2010;147(1):17–30. https://doi.org/10.1016/j.jbiotec.2010.02.018.

    Article  CAS  PubMed  Google Scholar 

  21. Kurata H, Sugimoto Y. Improved kinetic model of Escherichia coli central carbon metabolism in batch and continuous cultures. J Biosci Bioeng. 2018;125(2):251–7. https://doi.org/10.1016/j.jbiosc.2017.09.005.

    Article  CAS  PubMed  Google Scholar 

  22. Yasemi M, Jolicoeur M. Modelling cell metabolism: a review on constraint-based steady-state and kinetic approaches. Processes. 2021;9(2):322. https://doi.org/10.3390/pr9020322.

    Article  CAS  Google Scholar 

  23. Mochizuki A. A structural approach to understanding enzymatic regulation of chemical reaction networks. Biochem J. 2022;479(11):1265–83. https://doi.org/10.1042/BCJ20210545.

    Article  CAS  PubMed  Google Scholar 

  24. Di Maggio J, Diaz Ricci JC, Soledad Diaz M. Global sensitivity analysis in dynamic metabolic networks. In: Jeżowski J, Thullie J, editors. Computer aided chemical engineering. 19 European symposium on computer aided process engineering, vol. 26. Hoboken: Elsevier; 2009. p. 1075–80. https://doi.org/10.1016/S1570-7946(09)70179-8.

    Chapter  Google Scholar 

  25. Kacser H. The control of flux. Symp Soc Exp Biol. 1973;28:65–101.

    Google Scholar 

  26. Wildermuth MC. Metabolic control analysis: biological applications and insights. Genome Biol. 2000;1(6):1031–1. https://doi.org/10.1186/gb-2000-1-6-reviews1031.

    Article  Google Scholar 

  27. Rizk ML, Liao JC. Ensemble modeling for aromatic production in Escherichia coli. PLoS ONE. 2009;4(9):6903. https://doi.org/10.1371/journal.pone.0006903.

    Article  CAS  Google Scholar 

  28. Nishiguchi H, Hiasa N, Uebayashi K, Liao J, Shimizu H, Matsuda F. Transomics data-driven, ensemble kinetic modeling for system-level understanding and engineering of the cyanobacteria central metabolism. Metab Eng. 2019;52:273–83. https://doi.org/10.1016/j.ymben.2019.01.004.

    Article  CAS  PubMed  Google Scholar 

  29. Strutz J, Martin J, Greene J, Broadbelt L, Tyo K. Metabolic kinetic modeling provides insight into complex biological questions, but hurdles remain. Curr Opin Biotechnol. 2019;59:24–30. https://doi.org/10.1016/j.copbio.2019.02.005.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Saa PA, Nielsen LK. Construction of feasible and accurate kinetic models of metabolism: a Bayesian approach. Sci Rep. 2016;6(1):29635. https://doi.org/10.1038/srep29635.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. John PCS, Strutz J, Broadbelt LJ, Tyo KEJ, Bomble YJ. Bayesian inference of metabolic kinetics from genome-scale multiomics data. PLoS Comput Biol. 2019;15(11):1007424. https://doi.org/10.1371/journal.pcbi.1007424.

    Article  CAS  Google Scholar 

  32. Gopalakrishnan S, Dash S, Maranas C. K-FIT: an accelerated kinetic parameterization algorithm using steady-state fluxomic data. Metab Eng. 2020;61:197–205. https://doi.org/10.1016/j.ymben.2020.03.001.

    Article  CAS  PubMed  Google Scholar 

  33. Zhang X, Su Y, Lane AN, Stromberg AJ, Fan TWM, Wang C. Bayesian kinetic modeling for tracer-based metabolomic data. BMC Bioinform. 2023;24(1):108. https://doi.org/10.1186/s12859-023-05211-5.

    Article  CAS  Google Scholar 

  34. Choudhury S, Moret M, Salvy P, Weilandt D, Hatzimanikatis V, Miskovic L. Reconstructing kinetic models for dynamical studies of metabolism using generative adversarial networks. Nat Mach Intell. 2022;4(8):710–9. https://doi.org/10.1038/s42256-022-00519-y.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Okada T, Mochizuki A. Law of localization in chemical reaction networks. Phys Rev Lett. 2016;117(4): 048101. https://doi.org/10.1103/PhysRevLett.117.048101.

    Article  CAS  PubMed  Google Scholar 

  36. Hishida A, Okada T, Mochizuki A. Patterns of change in regulatory modules of chemical reaction systems induced by network modification. PNAS Nexus. 2023. https://doi.org/10.1093/pnasnexus/pgad441.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis. Boca Raton: CRC Press; 2013. p. 9.

    Book  Google Scholar 

  38. Toya Y, Shiraki T, Shimizu H. SSDesign: computational metabolic pathway design based on flux variability using elementary flux modes. Biotechnol Bioeng. 2015;112(4):759–68. https://doi.org/10.1002/bit.25498.

    Article  CAS  PubMed  Google Scholar 

  39. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M, Karp PD. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res. 2005;33(Database Issue):334–7. https://doi.org/10.1093/nar/gki108.

    Article  Google Scholar 

  40. Trinh CT, Unrean P, Srienc F. Minimal Escherichia coli cell for the most efficient production of ethanol from hexoses and pentoses. Appl Environ Microbiol. 2008;74(12):3634–43. https://doi.org/10.1128/AEM.02708-07.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Mienda BS, Shamsir MS, Illias RM. Model-guided metabolic gene knockout of gnd for enhanced succinate production in Escherichia coli from glucose and glycerol substrates. Comput Biol Chem. 2016;61:130–7. https://doi.org/10.1016/j.compbiolchem.2016.01.013.

    Article  CAS  PubMed  Google Scholar 

  42. Chatterjee R, Millard CS, Champion K, Clark DP, Donnelly MI. Mutation of the ptsG gene results in increased production of succinate in fermentation of glucose by Escherichia coli. Appl Environ Microbiol. 2001;67(1):148–54. https://doi.org/10.1128/AEM.67.1.148-154.2001.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Liang Q, Zhang F, Li Y, Zhang X, Li J, Yang P, Qi Q. Comparison of individual component deletions in a glucose-specific phosphotransferase system revealed their different applications. Sci Rep. 2015;5(1):13200. https://doi.org/10.1038/srep13200.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Millard CS, Chao YP, Liao JC, Donnelly MI. Enhanced production of succinic acid by overexpression of phosphoenolpyruvate carboxylase in Escherichia coli. Appl Environ Microbiol. 1996;62(5):1808–10. https://doi.org/10.1128/aem.62.5.1808-1810.1996.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. ...Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, Connor R, Funk K, Kelly C, Kim S, Madej T, Marchler-Bauer A, Lanczycki C, Lathrop S, Lu Z, Thibaud-Nissen F, Murphy T, Phan L, Skripchenko Y, Tse T, Wang J, Williams R, Trawick BW, Pruitt KD, Sherry ST. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2021;50(D1):20–6. https://doi.org/10.1093/nar/gkab1112.

    Article  CAS  Google Scholar 

  46. Vemuri GN, Eiteman MA, Altman E. Effects of growth mode and pyruvate carboxylase on succinic acid production by metabolically engineered strains of Escherichia coli. Appl Environ Microbiol. 2002;68(4):1715. https://doi.org/10.1128/AEM.68.4.1715-1727.2002.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. van Heerden CD, Nicol W. Continuous and batch cultures of Escherichia coli KJ134 for succinic acid fermentation: metabolic flux distributions and production characteristics. Microb Cell Fact. 2013;12(1):80. https://doi.org/10.1186/1475-2859-12-80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Laskowski RA, Gerick F, Thornton JM. The structural basis of allosteric regulation in proteins. FEBS Lett. 2009;583(11):1692–8. https://doi.org/10.1016/j.febslet.2009.03.019.

    Article  CAS  PubMed  Google Scholar 

  49. Changeux J-P. 50th anniversary of the word “allosteric’’. Prot Sci. 2011;20(7):1119–24. https://doi.org/10.1002/pro.658.

    Article  CAS  Google Scholar 

  50. Simensen V, Schulz C, Karlsen E, Bråtelund S, Burgos I, Thorfinnsdottir LB, García-Calvo L, Bruheim P, Almaas E. Experimental determination of Escherichia coli biomass composition for constraint-based metabolic modeling. PLoS ONE. 2022;17(1):0262450. https://doi.org/10.1371/journal.pone.0262450.

    Article  CAS  Google Scholar 

  51. Liao X, Ma H, Tang YJ. Artificial intelligence: a solution to involution of design-build-test-learn cycle. Curr Opin Biotechnol. 2022;75: 102712. https://doi.org/10.1016/j.copbio.2022.102712.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We would like to thank Y. Mizunuma for her technical assistance with the wet lab experiments. We also thank Dr. K. Yokoyama for technical guidance. We appreciate the advice and technical support with the wet lab experiments from Dr. T. Takeya. We are grateful to Dr. A. Kandori, Dr. K. Watanabe, and Dr. S. Yabuuchi for their guidance throughout our project. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

SH, TO, AM, and MS conceptualized this study. SH devised the model, designed the algorithms, implemented the software, and performed all the computational experiments. HI performed all the wet lab experiments. TM, HI, and MT established the wet lab experimental procedures. SH, MS, TO, and AM interpreted the computational results. SH and MS investigated previous researches for biological insights. SH wrote the draft. TO, AM, MS, MT, HI, and TM revised the manuscript critically. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Shion Hosoda.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hosoda, S., Iwata, H., Miura, T. et al. BayesianSSA: a Bayesian statistical model based on structural sensitivity analysis for predicting responses to enzyme perturbations in metabolic networks. BMC Bioinformatics 25, 297 (2024). https://doi.org/10.1186/s12859-024-05921-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-024-05921-4

Keywords