Predicting state transitions in the transcriptome and metabolome using a linear dynamical system model
- Ryoko Morioka^{1},
- Shigehiko Kanaya^{2},
- Masami Y Hirai^{1},
- Mitsuru Yano^{3},
- Naotake Ogasawara^{2} and
- Kazuki Saito^{1, 3}Email author
DOI: 10.1186/1471-2105-8-343
© Morioka et al; licensee BioMed Central Ltd. 2007
Received: 28 December 2006
Accepted: 18 September 2007
Published: 18 September 2007
Abstract
Background
Modelling of time series data should not be an approximation of input data profiles, but rather be able to detect and evaluate dynamical changes in the time series data. Objective criteria that can be used to evaluate dynamical changes in data are therefore important to filter experimental noise and to enable extraction of unexpected, biologically important information.
Results
Here we demonstrate the effectiveness of a Markov model, named the Linear Dynamical System, to simulate the dynamics of a transcript or metabolite time series, and propose a probabilistic index that enables detection of time-sensitive changes. This method was applied to time series datasets from Bacillus subtilis and Arabidopsis thaliana grown under stress conditions; in the former, only gene expression was studied, whereas in the latter, both gene expression and metabolite accumulation. Our method not only identified well-known changes in gene expression and metabolite accumulation, but also detected novel changes that are likely to be responsible for each stress response condition.
Conclusion
This general approach can be applied to any time-series data profile from which one wishes to identify elements responsible for state transitions, such as rapid environmental adaptation by an organism.
Background
Biochemical systems in living cells are robust and flexible. Investigating the responses of cells (and organisms) to environmental changes typically requires a system-level analysis of the interactions between the various molecular elements (genes, enzymes, and metabolites) that comprise the system. A key step to analyze system responses to environmental changes is identifying large state changes or "transitions". A statistical method that could detect such transitions would be a powerful analytical tool for finding important factors in large-scale profiles, such as variations in gene expression.
Previous analyses of gene expression profiles have often made use of graphical models, such as Bayesian Networks [1, 2], Graphical Gaussian Modelling [3], Boolean Networks [4, 5], and Auto-Regressive models [6]. However, not many approaches have explicitly modelled observational noise. In Auto-Regressive analyses, for example, observational vectors y_{ t }are recursively defined by the following Equation [6]:
y_{ t }= A y_{t-1}+ ε_{ t } (1)
where y_{ t }is an observational vector of genes or metabolites at time t, A is an observational transition matrix, and ε_{ t }is Gaussian noise. Because this model does not distinguish observational and inherent (e.g. biological) noises, identification of transition states becomes difficult in the presence of substantial noise.
Here we propose an extension of the Auto-Regressive model [6], which has been modified by the addition of reduced set of internal states, as explained in the Results and Discussion section. We chose a mathematical model, the Linear Dynamical System (LDS), as the basis of our method because it does not impose any specific requirements on the data used. LDS is expected to eliminate the confounding influence of observational noise in time series data. The model was applied to detect cellular state transitions in transcriptome and metabolome time series datasets from Bacillus subtilis and Arabidopsis thaliana maintained under stress conditions.
Results and discussion
Overview of our method
Our method has two steps. First, transition time points for each time series are detected using LDS, which mathematically distinguishes transitional fluctuations from experimental noise. The transition point is detected by the logarithm of the likelihood values. Here "likelihood value" means the generative probability of current data based on the condition of the past datasets. If this value is low, then the current data cannot be adequately explained by past datasets. In other words, a transition has occurred. In the second step, relevant factors such as genes and/or metabolites related to the transitions are extracted by Batch-Learning Self Organizing Mapping (BL-SOM) using changes in expression levels [7]. In summary, the LDS uses compressed information called "internal states", defined as the degenerate parameters of gene expression/metabolite accumulation profiles, to detect transitions, and then BL-SOM generates a 'Feature map', which is a two-dimensional lattice reflecting the similarity among clusters, based on the gene expression/metabolite accumulation profiles in order to visually characterize each state.
Linear Dynamical System (LDS) for time series analyses
LDS uses internal state variables in the generative model for cellular internal state changes. These internal states correspond to the compressed description of the observed biological system prior to adding noise factors.
The total experimental dataset of the time series and the corresponding internal state are denoted by Y_{1:T}= {y_{1}, y_{2},..., y_{ T }} and X_{1:T}= {x_{1}, x_{2},..., x_{ T }}, respectively. Each element in these vectors is defined as:
y_{ t }= (y_{t 1}, y_{t 2}, ⋯, y_{ tD })' ∈ R^{ D } (2)
x_{ t }= (x_{t 1}, x_{t 2}, ⋯, x_{ tN })' ∈ R^{ N } (3)
where t = 1, 2,..., T is the measurement order of the time series, D is the dimension of vector y_{ t }representing expression levels of D genes or metabolites, and N is the dimension of vector x_{ t }representing internal states. To distinguish observational noise from true information on cellular transitions, we focus on two probability densities: the density between internal state variables p(x_{ t }|x_{t-1}), and the density for evaluation of observational noise p(y_{ t }|x_{ t }). The proposed model is further defined as follows:
Observational equation: y_{ t }= V x_{ t }+ η_{ t } (4)
Transition equation: x_{ t }= W x_{t-1}+ ε_{ t } (5)
where V is a D × N observational matrix, W is an N × N internal state transition matrix, D-dimensional vector η_{ t }is observational noise, N-dimensional vector ε_{ t }is transition noise. The vectors x_{1}, ε_{ t }, η_{ t }are generated according to the following equations:
x_{1} ~ N_{ N }(x_{1}|μ_{1}, σ_{1}^{2}I_{ N }) (6)
ε_{ t }~ N_{ N }(ε_{ t }|0_{ N }, σ_{ ε }^{2}I_{ N }) (7)
η_{ t }~ N_{ D }(η_{ t }|0_{ D }, σ_{ η }^{2}I_{ η }) (8)
We assume that the observational and internal transition noises are both Gaussian, and therefore the relationship is a first-order Markov process (Equation 10).
p(x_{ t }, y_{ t }| X_{1:t-1}, Y_{1:t-1}) = p(y_{ t }|x_{ t })p(x_{ t }|x_{t-1}) (10)
The model parameter of (4)–(8) is defined as the parameter set θ:
θ = {μ_{1}, σ_{1}, W, σ_{ ε }, V, σ_{ η }} (11)
Note that the model corresponds to a Kalman Filter when θ is known (see also Methods section) [8].
The initial state x_{1} is defined as:
p(x_{1}|θ) = N_{ N }(x_{1}|μ_{1}, σ_{1}^{2}I_{ N }) (12)
From Equations (5) and (7), the following function is obtained:
p(x_{ t }|x_{t-1}, θ) = N_{ N }(x_{ t }|Wx_{t-1}, σ_{ ε }^{2}I_{ N }) (13)
From Equations (4) and (8), the following function is obtained:
p(y_{ t }|x_{ t }, θ) = N_{ D }(y_{ t }|Vx_{ t }, σ_{ η }^{2}I_{ D }) (14)
The parameter optimization follows a standard EM algorithm (see Methods section).
Criterion for detecting internal state transitions
Using the resulting estimated parameter, the log-likelihood with respect to the present time point t when all time points are given is defined by Equation (16):
log L_{ t }= log p(y_{ t }|Y_{1:t-1}, θ) (16)
This is calculated using the E-step formula (see Equation 23 in Methods) after parameter estimation using the Kalman filter.
When the log-likelihood value log L_{ t }becomes much lower than log L_{t-1}, then y_{ t }cannot be explained by Y_{1:t-1}, i.e., the cellular internal state has changed at time t. In this study, the point at which the log-likelihood value becomes relatively low between whole time points is defined as the state transition point. If the log-likelihood value remains low over a certain period, then the cells are changing their states continuously during that period.
Analysis of the Bacillus subtilis data
The list of transition driving genes identified in cells grown in CSM and DSM
Functional categories | Genes |
---|---|
Adaptation to typical conditions | ypeB |
Cell wall | ytcC, ykuG, ykoT, ywhE, yunA, cwlH |
Germination | yaaH, yfkQ, yndD, gerBB, gerKB, yndE, gerAC, yfkR |
Membrane bioenergetics | yhfW |
Sporulation | spoVFA, spoVAD, spoVAC, spoVAE, spoVAB, spoVK, spoIVCA, cotC, cotA, yaaH, sspE, sspB |
Transport/binding proteins and lipoproteins | ymfD, araP, yveA, ywcA, ywrK |
Detoxification | ykoY |
Regulation | sigG, splA |
Antibiotic production | yitA, yitC |
Carbohydrates and related molecules | yqiQ, adhB, yoaI, yesX, yitF, yqiQ |
Metabolism of amino acids and related molecules | aprX, spoVFA |
Metabolism of lipids | yngF |
Phage-related function | yndL, yqbO, yqbQ |
Proteins with unknown function | yodP, yheD, yhcQ, yvdQ, ytzC, yybC, ydfO, yhcV, yesV, yndM, ydfR, yngD, ykjA, yetA, yusN, yozN, yppD, ytlB, yqaN, ythQ, yycQ, yurS, yrkS, yxaG, yesJ, ysnE |
Using the analytical approach described here, we not only succeeded in detecting the well-known transition from exponential growth to the stationary phase, but also identified another, novel transition point. This result suggests the possibility that, even in periods that are assumed to be eventless, cells may be invoking their adaptive systems.
Analysis of the Arabidopsis data
On the basis of the estimated transition results, coupled with prior knowledge and the Feature map subtraction obtained by BL-SOM, we identified metabolites whose accumulation profiles showed changes that coordinated with the predicted transition point. These metabolites were found to be involved in biochemical pathways that are critical for the response to sulfur deprivation stress, for example, glucosinolate biosynthesis in leaf and anthocyanin biosynthesis in root [10].
A profile of sulfate accumulation was generated using capillary electrophoresis (Figure 4f). In comparison with the control condition, the accumulation of sulfate was strongly repressed immediately after the shift to sulfur deficiency. Under sulfur deprivation, it was believed that sulfate levels (Figure 4f) would only decrease. During the transition period from 12 to 24 hr after the shift to sulfur deficiency, however, sulfate levels temporarily ceased declining and stayed relatively constant as compared to the control.
From these results, we hypothesize that sulfate is in an active form and is distributed throughout the plant at the transition time. During this period, in order to maintain the intracellular environment, membrane lipids are temporarily degraded and re-synthesized after the transition. This suggestion is consistent with the reported decrease in lipids under conditions of sulfur starvation [11].
Conclusion
In summary, by using a linear dynamical system, we have identified transition times in the adaptation processes of Bacillus subtilis and Arabidopsis thaliana to environmental stresses. By focusing on transition information based on a well-defined probabilistic index, we obtained novel observations on apoptosis in Bacillus subtilis and the regulation of lipid metabolism connected with sulfur-stress responses in Arabidopsis thaliana. As this approach uses probabilistic values to detect the transitions, the results are objectively supported without the risk of misinterpretation due to experimental noise. The results of this approach will enable us to more effectively design experiments specifically tailored for functional identification of genes and metabolites. By obtaining time series data with higher temporal resolution around the transition time points, we can obtain more precise information on the details of the responses. The strategy described here was successful in identifying a small number of candidate genes and metabolites, from the vast number of genes and metabolites in comprehensive "omics" databases.
Methods
Time series data of Bacillus subtilis
The Bacillus subtilis time series data used in the present study were obtained from microarray analysis of cells sampled from 8 different experimental conditions. The data were produced using a two-colour fluorescence cDNA microarray that included 3100 Bacillus subtilis genes. The LB medium was developed to maximize cellular growth, and cells grown in this medium represented the control, unstressed population. In the initial phase of culture, the cell number increases by binary division – this is called the exponential growth phase in contrast to the stationary phase where the cell number has reached equilibrium. Data were collected from cells grown at 37°C in LB medium; the total length of culture was 12 hr and sampling was performed at 8 time points. Other culture conditions were also used with the aim of inducing stress responses in the cells. Cells were grown in Minimum Glucose Medium (MGM) at 37°C for 13 hr and sampled at 8 time points. Glucose starvation (GS) was achieved by eliminating the sugar from MGM; the cells were cultured in this medium for 10 hr and were sampled at 5 time points. Phosphate starvation (PS) was achieved by eliminating phosphoric acid from the MGM; the cells were cultured in this medium for 11 hr, and were sampled at 6 time points. Some cells were grown in Competence Medium (CM), which increases the ability of the cells to ingest DNA from the external environment. The cells were grown in CM for 9 hr and were sampled at 5 time points. Some cells were grown in Competence-Sporulation medium (CSM) for 15 hr and were sampled at 13 time points. A second sporulation medium, Difco sporulation medium (DSM), was also used. Cells were grown in this medium for 12 hr and were sampled at 19 time points. We also used Difco Glucose Glutamine (DGG) medium in which glucose and glutamine have been added to DSM medium in order to inhibit sporulation. The cells were grown for 9 hr in DSM and were sampled at 6 time points.
Time series data of Arabidopsis thaliana
The Arabidopsis thaliana data used in the present study were obtained from DNA microarray experiments and by Fourier-transform ion cyclotron resonance mass spectrometry (FT-ICR-MS), as previously described [10]. In brief, Arabidopsis was cultured in sulfur-sufficient control medium for three weeks, transferred to control or sulfur-deprived medium, and cultured for up to one more week. Rosette leaves and roots were harvested at 3, 6, 12, 24, 48 and 168 hr after transfer, and subjected to transcriptome and metabolome analyses [10].
Transcriptome data were obtained using the Agilent Arabidopsis 2 microarray (Agilent Technologies, Palo Alto, CA), which carries 21,500 Arabidopsis genes [10]. The data are available on ArrayExpress in EMBL-EBI [12]. Non-targeted metabolome data were obtained by FT-ICR-MS [10], which produces precise mass-to-charge ratio values (m/z values) of metabolites [13]. Metabolites were provisionally identified from their m/z values and the analytical conditions used for FT-ICR-MS.
Parameter estimation
The test distribution is defined as q(X_{1:T}|Y_{1:T}, θ) and is used to approximate the true posterior distribution. The Kullback-Leibler divergence takes the minimum value of 0 if the two distributions are equivalent.
The free energy is maximized using the Expectation-Maximization algorithm [14] consisting of the following steps:
Step 1. Parameter set θ is initialized.
Step 2. E-step(step 2.1) and M-step (step 2.2) are successively repeated until the free energy converges.
Step 2.1. E-step:
k is a repeat loop index. By fixing the parameter θ^{(k-1)}, F in Equation (17) is maximized with respect to q.
After the fixation of parameter θ, the calculations needed to calculate the value of Equation (18) are as follows:
Using (19), the prior distribution of x_{t+1}, given the data Y_{1:t}, is
p(x_{t+1}| Y_{1:t}, θ) = ∫d x_{ t }p(x_{t+1}| x_{ t }, θ)p(x_{ t }| Y_{1:T}, θ) (20)
By successively iterating Equations (19) and (20), p(x_{ t }| Y_{1:T}θ) with arbitrary t is obtained. This repeating method is called the Kalman Filter [8].
If the parameter of P(x_{t+1}, | Y_{1:T}, θ) is given, the following distribution is obtained:
p(x_{ t }| Y_{1:T}, θ) = ∫d x_{t+1}p(x_{t+1}, x_{ t }| Y_{1:T}, θ)
By successively iterating Equations (21) and (22), p(x_{t+1}, x_{ t }|Y_{1:T}, θ) and p(x_{ t }|Y_{1:T}, θ), which are necessary to calculate the value of Equation (18), are obtained. (22)
This repeating method is called the Kalman smoother [15].
Using the Kalman smoother, the statistical values necessary for parameter estimation are obtained.
If p(x_{ t }|Y_{1:t-1}, θ) is given, the following likelihood is calculated:
p(y_{ t }|Y_{1:t-1}, θ) = ∫ d x_{ t }p(y_{ t },|x_{ t }, θ)p(x_{ t }|Y_{1:t-1}, θ) (23)
Step 2.2. M-step
The objective function to be maximized is defined as
J(θ) = ∫dX_{1:T}q_{ x }^{(k)}(X_{1:T})log p(X_{1:T}, Y_{1:T}|θ) (26)
and the solution of parameter θ is calculated that maximizes F.
Parameter θ is then updated, and the process goes back to E-step.
Declarations
Acknowledgements
This work was supported in part by Grant-in-Aids for Scientific Research from Japan Society for the Promotion of Science. We would like to thank Dr. Kazuo Kobayashi, Graduate School of Biological Sciences, Nara Institute of Science and Technology, who provided the Bacillus datasets. We would like to thank Dr. Masanori Arita, RIKEN Plant Science Center, for critical comments.
Authors’ Affiliations
References
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian network to analyze expression data. J Comput Biol. 2000, 7: 601-620. 10.1089/106652700750050961.View ArticlePubMedGoogle Scholar
- Tamada Y, Kim S, Bannai H, Imoto S, Tashiro K, Kuhara S, Miyano S: Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics. 2003, 19: 227-236. 10.1093/bioinformatics/btg1082.View ArticleGoogle Scholar
- Toh H, Horimoto K: Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics. 2003, 18: 287-297. 10.1093/bioinformatics/18.2.287.View ArticleGoogle Scholar
- Akutsu T, Miyano S, Kuhara S: Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fingerprint function. J Comput Biol. 2000, 7: 331-343. 10.1089/106652700750050817.View ArticlePubMedGoogle Scholar
- Kim H, Lee JK, Park T: Boolean network using the chi-square test for inferring large-scale gene regulatory networks. BMC Bioinformatics. 2007, 8: 37-10.1186/1471-2105-8-37.PubMed CentralView ArticlePubMedGoogle Scholar
- Dewey TG, Galas DJ: Dynamic models of gene expression and classification. Funct Integr Genomics. 2001, 1: 269-278. 10.1007/s101420000035.View ArticlePubMedGoogle Scholar
- Kanaya S, Kinouchi M, Abe T, Kubo Y, Yamada Y, Nishi T, Mori H, Ikemura T: Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene. 2001, 276: 89-99. 10.1016/S0378-1119(01)00673-4.View ArticlePubMedGoogle Scholar
- Kalman RE, Bucy RS: New results in linear filtering and prediction theory. Trans ASME, J Basic Eng. 1961, 83D-1: 95-108.View ArticleGoogle Scholar
- Nugroho FA, Yamamoto H, Kobayashi Y, Sekiguchi J: Characterization of a new sigma-K-dependent peptidoglycan hydrolase gene that plays a role in Bacillus subtilis mother cell lysis. J Bacteriol. 1999, 181: 6230-6237.PubMed CentralPubMedGoogle Scholar
- Hirai MY, Klein M, Fujikawa Y, Yano M, Goodenowe DB, Yamazaki Y, Kanaya S, Nakamura Y, Kitayama M, Suzuki H, Sakurai N, Shibata D, Tokuhisa J, Reichelt M, Gershenzon J, Papenbrock J, Saito K: Elucidation of gene-to-gene and metabolite-to-gene networks in Arabidopsis by integration of metabolomics and transcriptomics. J Biol Chem. 2005, 280: 25590-25595. 10.1074/jbc.M502332200.View ArticlePubMedGoogle Scholar
- Nikiforova VJ, Kopka J, Tolstikov V, Fiehn O, Hopkins L, Hawkesford MJ, Hesse H, Hoefgen R: Systems rebalancing of metabolism in response to sulfur deprivation, as revealed by metabolome analysis of Arabidopsis plants. Plant Physiology. 2005, 138: 304-317. 10.1104/pp.104.053793.PubMed CentralView ArticlePubMedGoogle Scholar
- EMBL-EBI. [http://www.ebi.ac.uk/arrayexpress/index.html]
- Aharoni A, Ric de Vos CH, Verhoeven HA, Maliepaard CA, Kruppa G, Bino R, Goodenowe DB: Nontargeted metabolome analysis by use of fourier transform ion cyclotron mass spectrometry. Omics. 2002, 6: 217-234. 10.1089/15362310260256882.View ArticlePubMedGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc. 1977, 39: 1-22.Google Scholar
- Anderson BDO, Moore JB: Optical Filtering. 1979, NY: Prentice HallGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.