- Research article
- Open Access
- Published:

# Predicting state transitions in the transcriptome and metabolome using a linear dynamical system model

*BMC Bioinformatics***volume 8**, Article number: 343 (2007)

## Abstract

### Background

Modelling of time series data should not be an approximation of input data profiles, but rather be able to detect and evaluate dynamical changes in the time series data. Objective criteria that can be used to evaluate dynamical changes in data are therefore important to filter experimental noise and to enable extraction of unexpected, biologically important information.

### Results

Here we demonstrate the effectiveness of a Markov model, named the Linear Dynamical System, to simulate the dynamics of a transcript or metabolite time series, and propose a probabilistic index that enables detection of time-sensitive changes. This method was applied to time series datasets from *Bacillus subtilis* and *Arabidopsis thaliana* grown under stress conditions; in the former, only gene expression was studied, whereas in the latter, both gene expression and metabolite accumulation. Our method not only identified well-known changes in gene expression and metabolite accumulation, but also detected novel changes that are likely to be responsible for each stress response condition.

### Conclusion

This general approach can be applied to any time-series data profile from which one wishes to identify elements responsible for state transitions, such as rapid environmental adaptation by an organism.

## Background

Biochemical systems in living cells are robust and flexible. Investigating the responses of cells (and organisms) to environmental changes typically requires a system-level analysis of the interactions between the various molecular elements (genes, enzymes, and metabolites) that comprise the system. A key step to analyze system responses to environmental changes is identifying large state changes or "transitions". A statistical method that could detect such transitions would be a powerful analytical tool for finding important factors in large-scale profiles, such as variations in gene expression.

Previous analyses of gene expression profiles have often made use of graphical models, such as Bayesian Networks [1, 2], Graphical Gaussian Modelling [3], Boolean Networks [4, 5], and Auto-Regressive models [6]. However, not many approaches have explicitly modelled observational noise. In Auto-Regressive analyses, for example, observational vectors **y**_{
t
}are recursively defined by the following Equation [6]:

**y**_{
t
}= *A* **y**_{t-1}+ *ε*_{
t
} (1)

where *y*_{
t
}is an observational vector of genes or metabolites at time *t*, *A* is an observational transition matrix, and *ε*_{
t
}is Gaussian noise. Because this model does not distinguish observational and inherent (e.g. biological) noises, identification of transition states becomes difficult in the presence of substantial noise.

Here we propose an extension of the Auto-Regressive model [6], which has been modified by the addition of reduced set of internal states, as explained in the Results and Discussion section. We chose a mathematical model, the Linear Dynamical System (LDS), as the basis of our method because it does not impose any specific requirements on the data used. LDS is expected to eliminate the confounding influence of observational noise in time series data. The model was applied to detect cellular state transitions in transcriptome and metabolome time series datasets from *Bacillus subtilis* and *Arabidopsis thaliana* maintained under stress conditions.

## Results and discussion

### Overview of our method

Our method has two steps. First, transition time points for each time series are detected using LDS, which mathematically distinguishes transitional fluctuations from experimental noise. The transition point is detected by the logarithm of the likelihood values. Here "likelihood value" means the generative probability of current data based on the condition of the past datasets. If this value is low, then the current data cannot be adequately explained by past datasets. In other words, a transition has occurred. In the second step, relevant factors such as genes and/or metabolites related to the transitions are extracted by Batch-Learning Self Organizing Mapping (BL-SOM) using changes in expression levels [7]. In summary, the LDS uses compressed information called "internal states", defined as the degenerate parameters of gene expression/metabolite accumulation profiles, to detect transitions, and then BL-SOM generates a 'Feature map', which is a two-dimensional lattice reflecting the similarity among clusters, based on the gene expression/metabolite accumulation profiles in order to visually characterize each state.

### Linear Dynamical System (LDS) for time series analyses

LDS uses internal state variables in the generative model for cellular internal state changes. These internal states correspond to the compressed description of the observed biological system prior to adding noise factors.

The total experimental dataset of the time series and the corresponding internal state are denoted by *Y*_{1:T}= {**y**_{1}, **y**_{2},..., **y**_{
T
}} and *X*_{1:T}= {**x**_{1}, **x**_{2},..., **x**_{
T
}}, respectively. Each element in these vectors is defined as:

**y**_{
t
}= (**y**_{t 1}, **y**_{t 2}, ⋯, **y**_{
tD
})' ∈ *R*^{D} (2)

**x**_{
t
}= (**x**_{t 1}, **x**_{t 2}, ⋯, **x**_{
tN
})' ∈ *R*^{N} (3)

where *t* = 1, 2,..., T is the measurement order of the time series, *D* is the dimension of vector **y**_{
t
}representing expression levels of *D* genes or metabolites, and *N* is the dimension of vector **x**_{
t
}representing internal states. To distinguish observational noise from true information on cellular transitions, we focus on two probability densities: the density between internal state variables *p*(**x**_{
t
}|**x**_{t-1}), and the density for evaluation of observational noise *p*(**y**_{
t
}|**x**_{
t
}). The proposed model is further defined as follows:

Observational equation: **y**_{
t
}= *V* **x**_{
t
}+ *η*_{
t
} (4)

Transition equation: **x**_{
t
}= *W* **x**_{t-1}+ *ε*_{
t
} (5)

where *V* is a *D* × *N* observational matrix, *W* is an *N* × *N* internal state transition matrix, *D*-dimensional vector *η*_{
t
}is observational noise, *N*-dimensional vector *ε*_{
t
}is transition noise. The vectors **x**_{1}, *ε*_{
t
}, *η*_{
t
}are generated according to the following equations:

**x**_{1} ~ *N*_{
N
}(**x**_{1}|*μ*_{1}, *σ*_{1}^{2}*I*_{
N
}) (6)

*ε*_{
t
}~ *N*_{
N
}(*ε*_{
t
}|0_{
N
}, *σ*_{
ε
}^{2}*I*_{
N
}) (7)

*η*_{
t
}~ *N*_{
D
}(*η*_{
t
}|0_{
D
}, *σ*_{
η
}^{2}*I*_{
η
}) (8)

The next step is to define the relevant probability densities. *N*_{
p
}(**x**|**m**, Σ) is a probability density function when a *p*-dimensional probabilistic vector **x** obeys a Gaussian distribution whose mean vector is **m**, and covariance matrix Σ (Equation 9).

We assume that the observational and internal transition noises are both Gaussian, and therefore the relationship is a first-order Markov process (Equation 10).

*p*(**x**_{
t
}, **y**_{
t
}| *X*_{1:t-1}, *Y*_{1:t-1}) = *p*(**y**_{
t
}|**x**_{
t
})*p*(**x**_{
t
}|*x*_{t-1}) (10)

The model parameter of (4)–(8) is defined as the parameter set *θ*:

*θ* = {*μ*_{1}, *σ*_{1}, **W**, *σ*_{
ε
}, **V**, *σ*_{
η
}} (11)

Note that the model corresponds to a Kalman Filter when *θ* is known (see also Methods section) [8].

The initial state **x**_{1} is defined as:

*p*(**x**_{1}|*θ*) = *N*_{
N
}(**x**_{1}|*μ*_{1}, *σ*_{1}^{2}*I*_{
N
}) (12)

From Equations (5) and (7), the following function is obtained:

*p*(**x**_{
t
}|**x**_{t-1}, *θ*) = *N*_{
N
}(**x**_{
t
}|**Wx**_{t-1}, *σ*_{
ε
}^{2}*I*_{
N
}) (13)

From Equations (4) and (8), the following function is obtained:

*p*(**y**_{
t
}|**x**_{
t
}, *θ*) = *N*_{
D
}(**y**_{
t
}|**Vx**_{
t
}, *σ*_{
η
}^{2}*I*_{
D
}) (14)

Using these results, the following joint probability is obtained:

The parameter optimization follows a standard EM algorithm (see Methods section).

### Criterion for detecting internal state transitions

Using the resulting estimated parameter, the log-likelihood with respect to the present time point *t* when all time points are given is defined by Equation (16):

log *L*_{
t
}= log *p*(**y**_{
t
}|*Y*_{1:t-1}, *θ*) (16)

This is calculated using the E-step formula (see Equation 23 in Methods) after parameter estimation using the Kalman filter.

When the log-likelihood value log *L*_{
t
}becomes much lower than log *L*_{t-1}, then *y*_{
t
}cannot be explained by *Y*_{1:t-1}, i.e., the cellular internal state has changed at time *t*. In this study, the point at which the log-likelihood value becomes relatively low between whole time points is defined as the state transition point. If the log-likelihood value remains low over a certain period, then the cells are changing their states continuously during that period.

### Analysis of the *Bacillus subtilis* data

We first analyzed the relationship of cell population to state transition time on transcriptome data of *Bacillus subtilis* (Figure 1). Here, the exponential growth phase and stationary phase are commonly used microbiology terms referring to the state of the cellular population, as measured by the optical density (see also Methods section). The transition from exponential growth to the stationary phase was observed in 8 culture media: Lysogeny Broth (LB), Minimum Glucose Medium (MGM), Glucose Starvation (GS), Phosphate Starvation (PS), Competence Medium (CM), Difco Sporulation Medium (DSM), Competence Sporulation Medium (CSM) and DSM plus Glucose Glutamine (DGG). We confirmed that the log-likelihood index produced by LDS was smaller at the transition time between two phases. Next, we fitted the index calculated by the model to the phase transition data. For cell populations growing under two culture conditions, namely LB (control) and MGM (limited glucose), we found that BL-SOM yielded different classification results for gene expression (Figure 2a, b). This result indicates that expression of the genes responsible for the transition varied between the different environmental conditions, although their transitions appeared similar. For cells grown in either CSM or DSM, two transition points for sporulation were detected. The first was the well-known transition from exponential growth to the stationary phase. However, the second was a novel transition detected by this approach. At the first transition in CSM at around time point 3, log-likelihood values show a sustained drop. The analysis suggests that cells take a long time to adapt to the CSM culture environment. The second transition point in the sporulation media was further investigated by analysis of Feature maps generated by BL-SOM [7]. The candidate genes for the second transition were those activated just before the transition point and repressed soon after the transition point (Table 1). These genes are listed in Table 1 and include those related to lysis of the mother cell, such as *cwlH*. Thus, the second transition corresponded to mother cell lysis [9], a type of apoptosis.

Using the analytical approach described here, we not only succeeded in detecting the well-known transition from exponential growth to the stationary phase, but also identified another, novel transition point. This result suggests the possibility that, even in periods that are assumed to be eventless, cells may be invoking their adaptive systems.

### Analysis of the *Arabidopsis* data

As described in the Methods section, we analysed changes in gene expression and metabolite accumulation in *Arabidopsis* plants following their transfer to sulfur-deficient conditions. We detected a transition between 12 and 24 hours in both gene expression and metabolite accumulation profiles in both leaves and roots. In addition, we detected a second transition at the final time point (168 hr) in the metabolite accumulation profile in roots (see Figure 3). At the transition point of 12–24 hr, glucosinolate biosynthesis was decreased in leaves and anthocyanin biosynthesis was initiated in roots. The predicted transitions obtained by this analysis are consistent with those identified previously [10], indicating that our method can reliably identify candidate genes and metabolites involved in transition points. The transition time point detected for root metabolites at the end of the experiment (168 hr) showed that even after this period of time roots of *A. thaliana* continued to change in response to sulfur deprivation, at least in terms of metabolite accumulation.

On the basis of the estimated transition results, coupled with prior knowledge and the Feature map subtraction obtained by BL-SOM, we identified metabolites whose accumulation profiles showed changes that coordinated with the predicted transition point. These metabolites were found to be involved in biochemical pathways that are critical for the response to sulfur deprivation stress, for example, glucosinolate biosynthesis in leaf and anthocyanin biosynthesis in root [10].

Our results also suggested the presence of lipid metabolic responses in *Arabidopsis* to sulfur stress. The accumulation patterns of detected ion peaks whose mass-to-charge ratio (*m/z*) values corresponded to molecular species with various acyl groups, such as phosphatidylglycerol, phosphatidylethanolamine, phosphatidylcholine, phosphatidic acid, and sulfoquinovosyl diacylglycerol, are shown in Figure 4. Because the accumulation profiles of these compounds showed similar patterns, we predict that lipid biosynthesis was also co-ordinately repressed at the transition at 24 hr.

A profile of sulfate accumulation was generated using capillary electrophoresis (Figure 4f). In comparison with the control condition, the accumulation of sulfate was strongly repressed immediately after the shift to sulfur deficiency. Under sulfur deprivation, it was believed that sulfate levels (Figure 4f) would only decrease. During the transition period from 12 to 24 hr after the shift to sulfur deficiency, however, sulfate levels temporarily ceased declining and stayed relatively constant as compared to the control.

From these results, we hypothesize that sulfate is in an active form and is distributed throughout the plant at the transition time. During this period, in order to maintain the intracellular environment, membrane lipids are temporarily degraded and re-synthesized after the transition. This suggestion is consistent with the reported decrease in lipids under conditions of sulfur starvation [11].

## Conclusion

In summary, by using a linear dynamical system, we have identified transition times in the adaptation processes of *Bacillus subtilis* and *Arabidopsis thaliana* to environmental stresses. By focusing on transition information based on a well-defined probabilistic index, we obtained novel observations on apoptosis in *Bacillus subtilis* and the regulation of lipid metabolism connected with sulfur-stress responses in *Arabidopsis thaliana*. As this approach uses probabilistic values to detect the transitions, the results are objectively supported without the risk of misinterpretation due to experimental noise. The results of this approach will enable us to more effectively design experiments specifically tailored for functional identification of genes and metabolites. By obtaining time series data with higher temporal resolution around the transition time points, we can obtain more precise information on the details of the responses. The strategy described here was successful in identifying a small number of candidate genes and metabolites, from the vast number of genes and metabolites in comprehensive "omics" databases.

## Methods

### Time series data of *Bacillus subtilis*

The *Bacillus subtilis* time series data used in the present study were obtained from microarray analysis of cells sampled from 8 different experimental conditions. The data were produced using a two-colour fluorescence cDNA microarray that included 3100 *Bacillus subtilis* genes. The LB medium was developed to maximize cellular growth, and cells grown in this medium represented the control, unstressed population. In the initial phase of culture, the cell number increases by binary division – this is called the exponential growth phase in contrast to the stationary phase where the cell number has reached equilibrium. Data were collected from cells grown at 37°C in LB medium; the total length of culture was 12 hr and sampling was performed at 8 time points. Other culture conditions were also used with the aim of inducing stress responses in the cells. Cells were grown in Minimum Glucose Medium (MGM) at 37°C for 13 hr and sampled at 8 time points. Glucose starvation (GS) was achieved by eliminating the sugar from MGM; the cells were cultured in this medium for 10 hr and were sampled at 5 time points. Phosphate starvation (PS) was achieved by eliminating phosphoric acid from the MGM; the cells were cultured in this medium for 11 hr, and were sampled at 6 time points. Some cells were grown in Competence Medium (CM), which increases the ability of the cells to ingest DNA from the external environment. The cells were grown in CM for 9 hr and were sampled at 5 time points. Some cells were grown in Competence-Sporulation medium (CSM) for 15 hr and were sampled at 13 time points. A second sporulation medium, Difco sporulation medium (DSM), was also used. Cells were grown in this medium for 12 hr and were sampled at 19 time points. We also used Difco Glucose Glutamine (DGG) medium in which glucose and glutamine have been added to DSM medium in order to inhibit sporulation. The cells were grown for 9 hr in DSM and were sampled at 6 time points.

### Time series data of *Arabidopsis thaliana*

The *Arabidopsis thaliana* data used in the present study were obtained from DNA microarray experiments and by Fourier-transform ion cyclotron resonance mass spectrometry (FT-ICR-MS), as previously described [10]. In brief, *Arabidopsis* was cultured in sulfur-sufficient control medium for three weeks, transferred to control or sulfur-deprived medium, and cultured for up to one more week. Rosette leaves and roots were harvested at 3, 6, 12, 24, 48 and 168 hr after transfer, and subjected to transcriptome and metabolome analyses [10].

Transcriptome data were obtained using the Agilent Arabidopsis 2 microarray (Agilent Technologies, Palo Alto, CA), which carries 21,500 Arabidopsis genes [10]. The data are available on ArrayExpress in EMBL-EBI [12]. Non-targeted metabolome data were obtained by FT-ICR-MS [10], which produces precise mass-to-charge ratio values (m/z values) of metabolites [13]. Metabolites were provisionally identified from their m/z values and the analytical conditions used for FT-ICR-MS.

### Parameter estimation

The test distribution is defined as *q*(*X*_{1:T}|*Y*_{1:T}, *θ*) and is used to approximate the true posterior distribution. The Kullback-Leibler divergence takes the minimum value of 0 if the two distributions are equivalent.

In Equation (17), maximization of the free energy with respect to *q* and *θ* is equal to the calculation of the maximum likelihood estimate *θ* with respect to *Y*_{1:T}.

The free energy is maximized using the Expectation-Maximization algorithm [14] consisting of the following steps:

**Step 1**. Parameter set *θ* is initialized.

**Step 2**. E-step(step 2.1) and M-step (step 2.2) are successively repeated until the free energy converges.

**Step 2.1**. E-step:

*k* is a repeat loop index. By fixing the parameter *θ*^{(k-1)}, *F* in Equation (17) is maximized with respect to *q*.

According to Equation (17), the solution is

After the fixation of parameter *θ*, the calculations needed to calculate the value of Equation (18) are as follows:

When both the data *Y*_{1:t-1}and the parameter of the prior distribution *p*(*x*_{
t
}|*Y*_{1:t-1}, *θ*) of *x*_{
t
}are given, the posterior distribution of *x*_{
t
}, given the data *Y*_{1:t}, is

Using (19), the prior distribution of *x*_{t+1}, given the data *Y*_{1:t}, is

*p*(*x*_{t+1}| *Y*_{1:t}, *θ*) = ∫*d* **x**_{
t
}*p*(**x**_{t+1}| **x**_{
t
}, *θ*)*p*(**x**_{
t
}| *Y*_{1:T}, *θ*) (20)

By successively iterating Equations (19) and (20), *p*(**x**_{
t
}| *Y*_{1:T}*θ*) with arbitrary *t* is obtained. This repeating method is called the Kalman Filter [8].

If all data are given, the following joint probability is obtained:

If the parameter of *P*(*x*_{t+1}, | *Y*_{1:T}, *θ*) is given, the following distribution is obtained:

*p*(**x**_{
t
}| *Y*_{1:T}, *θ*) = ∫*d* **x**_{t+1}*p*(**x**_{t+1}, **x**_{
t
}| *Y*_{1:T}, *θ*)

By successively iterating Equations (21) and (22), *p*(**x**_{t+1}, **x**_{
t
}|*Y*_{1:T}, *θ*) and *p*(**x**_{
t
}|*Y*_{1:T}, *θ*), which are necessary to calculate the value of Equation (18), are obtained. (22)

This repeating method is called the Kalman smoother [15].

Using the Kalman smoother, the statistical values necessary for parameter estimation are obtained.

If *p*(**x**_{
t
}|*Y*_{1:t-1}, *θ*) is given, the following likelihood is calculated:

*p*(**y**_{
t
}|*Y*_{1:t-1}, *θ*) = ∫ *d* **x**_{
t
}*p*(**y**_{
t
},|**x**_{
t
}, *θ*)*p*(**x**_{
t
}|*Y*_{1:t-1}, *θ*) (23)

Using (23), the log-likelihood is calculated as

**Step 2.2**. M-step

In this step, the value of *θ* that will maximize *F* under the condition *q*_{
x
}= *q*_{
x
}^{(k)}is calculated using Equation (25):

The objective function to be maximized is defined as

*J*(*θ*) = ∫*dX*_{1:T}*q*_{
x
}^{(k)}(*X*_{1:T})log *p*(*X*_{1:T}, *Y*_{1:T}|*θ*) (26)

which is obtained by the following equation:

and the solution of parameter *θ* is calculated that maximizes *F*.

Parameter *θ* is then updated, and the process goes back to E-step.

## References

- 1.
Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian network to analyze expression data. J Comput Biol. 2000, 7: 601-620. 10.1089/106652700750050961.

- 2.
Tamada Y, Kim S, Bannai H, Imoto S, Tashiro K, Kuhara S, Miyano S: Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics. 2003, 19: 227-236. 10.1093/bioinformatics/btg1082.

- 3.
Toh H, Horimoto K: Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics. 2003, 18: 287-297. 10.1093/bioinformatics/18.2.287.

- 4.
Akutsu T, Miyano S, Kuhara S: Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fingerprint function. J Comput Biol. 2000, 7: 331-343. 10.1089/106652700750050817.

- 5.
Kim H, Lee JK, Park T: Boolean network using the chi-square test for inferring large-scale gene regulatory networks. BMC Bioinformatics. 2007, 8: 37-10.1186/1471-2105-8-37.

- 6.
Dewey TG, Galas DJ: Dynamic models of gene expression and classification. Funct Integr Genomics. 2001, 1: 269-278. 10.1007/s101420000035.

- 7.
Kanaya S, Kinouchi M, Abe T, Kubo Y, Yamada Y, Nishi T, Mori H, Ikemura T: Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene. 2001, 276: 89-99. 10.1016/S0378-1119(01)00673-4.

- 8.
Kalman RE, Bucy RS: New results in linear filtering and prediction theory. Trans ASME, J Basic Eng. 1961, 83D-1: 95-108.

- 9.
Nugroho FA, Yamamoto H, Kobayashi Y, Sekiguchi J: Characterization of a new sigma-K-dependent peptidoglycan hydrolase gene that plays a role in Bacillus subtilis mother cell lysis. J Bacteriol. 1999, 181: 6230-6237.

- 10.
Hirai MY, Klein M, Fujikawa Y, Yano M, Goodenowe DB, Yamazaki Y, Kanaya S, Nakamura Y, Kitayama M, Suzuki H, Sakurai N, Shibata D, Tokuhisa J, Reichelt M, Gershenzon J, Papenbrock J, Saito K: Elucidation of gene-to-gene and metabolite-to-gene networks in Arabidopsis by integration of metabolomics and transcriptomics. J Biol Chem. 2005, 280: 25590-25595. 10.1074/jbc.M502332200.

- 11.
Nikiforova VJ, Kopka J, Tolstikov V, Fiehn O, Hopkins L, Hawkesford MJ, Hesse H, Hoefgen R: Systems rebalancing of metabolism in response to sulfur deprivation, as revealed by metabolome analysis of Arabidopsis plants. Plant Physiology. 2005, 138: 304-317. 10.1104/pp.104.053793.

- 12.
EMBL-EBI. [http://www.ebi.ac.uk/arrayexpress/index.html]

- 13.
Aharoni A, Ric de Vos CH, Verhoeven HA, Maliepaard CA, Kruppa G, Bino R, Goodenowe DB: Nontargeted metabolome analysis by use of fourier transform ion cyclotron mass spectrometry. Omics. 2002, 6: 217-234. 10.1089/15362310260256882.

- 14.
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc. 1977, 39: 1-22.

- 15.
Anderson BDO, Moore JB: Optical Filtering. 1979, NY: Prentice Hall

## Acknowledgements

This work was supported in part by Grant-in-Aids for Scientific Research from Japan Society for the Promotion of Science. We would like to thank Dr. Kazuo Kobayashi, Graduate School of Biological Sciences, Nara Institute of Science and Technology, who provided the *Bacillus* datasets. We would like to thank Dr. Masanori Arita, RIKEN Plant Science Center, for critical comments.

## Author information

## Additional information

### Competing interests

The author(s) declares that there are no competing interests.

### Authors' contributions

RM designed the LDS method and carried out the computer simulations. SK designed the BL-SOM and carried out the computer experiments. MYH supplied the Arabidopsis dataset. MY analyzed FT-MS data with RM. NO supervised the *Bacillus* experiments. KS proposed and supervised the study. All authors read and approved the final manuscript.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

## About this article

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Time Series Data
- Anthocyanin Biosynthesis
- Lysogeny Broth
- Linear Dynamical System
- Accumulation Profile