Integrative analysis of time course metabolic data and biomarker discovery

Background Metabolomics time-course experiments provide the opportunity to understand the changes to an organism by observing the evolution of metabolic profiles in response to internal or external stimuli. Along with other omic longitudinal profiling technologies, these techniques have great potential to uncover complex relations between variations across diverse omic variables and provide unique insights into the underlying biology of the system. However, many statistical methods currently used to analyse short time-series omic data are i) prone to overfitting, ii) do not fully take into account the experimental design or iii) do not make full use of the multivariate information intrinsic to the data or iv) are unable to uncover multiple associations between different omic data. The model we propose is an attempt to i) overcome overfitting by using a weakly informative Bayesian model, ii) capture experimental design conditions through a mixed-effects model, iii) model interdependencies between variables by augmenting the mixed-effects model with a conditional auto-regressive (CAR) component and iv) identify potential associations between heterogeneous omic variables by using a horseshoe prior. Results We assess the performance of our model on synthetic and real datasets and show that it can outperform comparable models for metabolomic longitudinal data analysis. In addition, our proposed method provides the analyst with new insights on the data as it is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and find significant associations with additional omic variables. We also show through simulation that our model is fairly robust against inaccuracies in metabolite assignments. On real data, we demonstrate that the number of profiled metabolites slightly affects the predictive ability of the model. Conclusions Our single model approach to longitudinal analysis of metabolomics data provides an approach simultaneously for integrative analysis and biomarker discovery. In addition, it lends better interpretation by allowing analysis at the pathway level. An accompanying R package for the model has been developed using the probabilistic programming language Stan. The package offers user-friendly functions for simulating data, fitting the model, assessing model fit and postprocessing the results. The main aim of the R package is to offer freely accessible resources for integrative longitudinal analysis for metabolomics scientists and various visualization functions easy-to-use for applied researchers to interpret results.

In this section we illustrate a fully reproducible application of the iCARH package to a publicly available dataset from Brunk et al. (2016). Brunk et al. (2016) carried out a three step framework to capture experimental design, examine metabolic correlations and integrate multi-omics data respectively with each step achieved independently from the others. The dataset comprises metabolomic and proteomic time series from eight pathway engineered strains of E.coli and a wild-type E.coli DH1. Our analysis includes the eight strains where the mevalonate pathway has been adapted to produce isopentenol (I1, I2, I3), limonene (L1, L2, L3) and bisabolene (B1, B2). Each of the isopentenol-producing strains and limonene-producing strains comprise three subtypes of production performance 1, 2, 3 indicating low, medium and high production respectively with 1 being the unoptimized strain. The bisabolene-producing strain comprises two subtypes of performance 1 and 2 with 1 being the unoptimized strain as well. Time measurements were taken at 7 timepoints starting from 0h (induction) to 36h.
The dataset is subject to incompleteness and several metabolites have missing measurements. Before starting the analysis, extracellular measurements and metabolites with more than 25% of missing values where removed. About half of the remaining metabolites have 15% of missing values. Additionally, the metabolomics and proteomics data were both centered with respect to the control group. In our analysis, the control group consists in the wild-type strain and each of unoptimized strains from the isopentenol-producing strains, limonene-producing and bisabolene-producing strains whereas the cases group contains all the optimized strains. The curated dataset can be loaded from "multiomics_ Ecoli.rda" file using R> load(file="multiomics_Ecoli.rda", verbose = T) where X is the metabolomics data array with dimensions 7 timepoints × 9 strains × 29 metabolites. Y is the proteomics data array with dimensions 7 timepoints × 9 strains × 79 proteins. Z contains group information and have dimensions 7 timepoints × 9 strains and keggids are corresponding KEGG identifiers for metabolite names.

Model fitting and assessing model fit
In order to include pathway information into the model, metabolite names used in the dataset need to be mapped to KEGG metabolite identifiers 1 . In practice, this mapping was done using the MetaboAnalyst online tool (Chong et al., 2018) and corresponding KEGG identifiers stored in keggids variable. It should be noted that iCARH supports unknown KEGG identifiers entextttd by Unk<x> where <x> is an integer so that no pathway information will be considered for the corresponding metabolites. The package also supports multiple KEGG identifiers per metabolite which is not an infrequent case in metabolomics, usually occuring due to uncertainty in the exact identification (e.g. uncertainty between D-and L-alanine). To obtain the pathway adjacency matrices for the metabolomics data: R> library(iCARH) R> database = iCARH.getPathwaysMat(keggids, "sce") As some pathways (such as "Biosynthesis of antibiotics") are biologically irrelevant, we reduce the list of pathways as follows: R> keep = c (1,2,5,7,8,9,13,14,15) R> pathways = database [keep] and fit the model: We fit the model using the default value for τ, however this can be set freely by user. The pathways variable contains pathway adjacency matrices but in practice gene ontology classes or other molecular signatures instead of metabolic networks can be used. The model takes about 10 hours to fit using NUTS sampler on our Intel i7 2.8 GHz processor. A summary of the model parameters (by means of illustration, the α parameter here) can be obtained by running the iCARH.params function as below. Note that the function also checks for convergence and throws a warning if convergence diagnostics fail.  The function output has been truncated for readability (note the ...). Eff.Sample is a measure of effective sample size, and Rhat is the scale reduction factor on split chains and should be close to 1 at convergence. The output also includes means (Estimate), standard deviations (Est.Error) and Monte Carlo standard errors (MC.Error) if multiple chains are fitted. Due to the large number of metabolomic and proteomic variables, it is not convenient to display all model parameters. As iCARH.model returns a stanfit object that can be used independently of the package, diagnostic plot functions available in rstan can be applied to fit. A histogram of Rhat values can be plotted by typing plot(fit$icarh , plotfun="rhat") (See figure 1). The traceplot function displays trace plots of selected posterior distributions of model parameters of the E.coli dataset. By means of example, figure 1 show traceplots of σ γ 2 , σ β 18 , σ, φ controls  Figure 2 depicts quantile-quantiles plots of the cases and control groups of the E.coli dataset with a slightly better fit for the cases. This can be explained by the heterogeneity amongst the control strains comprising both the wild type and the engineered unoptimized strains. In addition, we perform posterior predictive checks by computing MADs and comparing our model performance to DPPCA. The texttt below runs normality checks on fit and creates quantile-quantile plots and computes MADs using iCARH: R> iCARH.checkNormality(fit) R> mads = iCARH.mad(fit) Histograms in figure 3 show distribution of the logarithm of MADs for DPPCA and iCARH. As observed in the previous analysis, iCARH shows better stability than DPPCA. In fact, as most MADs are smaller than 1 (log(MADs) is less than 0), some MADs reach the unreasonable value of 403 leading to a local maximum around 5. iCARH computed MADs show less variability and are more consistent with log(MADs) ranging from 0.4 to 1.8.

Data results
The output of the iCARH.model function is rich and there are different results that can be explored after fitting the iCARH model. We will first show how to display missing  values imputation before investigating model parameters of interest. Particularly, the function iCARH.getDataImputation returns the imputed missing values for both the metabolomics and proteomics data. In addition, iCARH.plotDataImputation provides an elegant way to plot imputed values and can take as input particular metabolite or protein indices. iCARH.plotDataImputation produces an output as in figure 4 where each subplot represents time series for one strain. Observed data points are in blue whereas inferred mean values for missing data points are in green. The posterior distribution of inferred values for a missing data point is also computed and displayed as a violin plot in orange.
One of the first effects of interest for applied researchers are longitudinal effects and the treatment effect. Longitudinal effects are captured by θ m for each metabolite and can be investigated as follows:   Non significant time dependency is indicated by a low value of θ m (closer to zero) whereas significance with respect to time scale is indicated by θ m closer to 1. Namely, in this case glyoxylate does not show significant time dependence. For this dataset, α m represents an indicator variable for the treatment effect. As the dataset was centered with respect to the control group, a non-zero shift in α m indicates significant changes in metabolite concentrations as a result of the engineered optimization of the native mevalonate pathway. In iCARH, these effects can be directly examined by running R> iCARH.plotTreatmentEffect(fit) which produces the plot in figure 6. Boxplots depict slight changes in metabolite concentrations for the engineered strains. We note that inositol triphosphate (IP) shows the highest shift whereas glyoxylate metabolism related metabolites are not significantly perturbed. In addition, intracellular central carbon metabolites, such as succinate and phosphoenolpyruvate (PEP), and citrate cycle (TCA) metabolites, namely citrate, do not exhibit informative shifts. However, other TCA cycle-related metabolites such as pyruvate (end of glycolysis) and cis-aconitate exhibit higher shifts.  To elucidate these metabolic perturbations in the context of metabolic pathways, we examine the pathway perturbation plot of the iCARH.plotPathwayPerturbation function in figure 7. Some pathways exhibit notable differences between the distribution of the pathway perturbation coefficients φ e for the control and cases groups. Most importantly, TCA cycle and pyruvate metabolism show concentrated densities of the corresponding pathway coefficient around the positive eigenvalue which potentially indicates increasing production compared to controls. These findings support results of Brunk et al. (2016) suggesting that TCA cycle reactions and pyruvate pathway are the main drivers of variation.
In order to contextualize our analysis within a biological systems level and get further insights into the underlying mechanisms, the next step of analysis consists in investigating associations between metabolic and proteomic variables. Estimates of effects of proteins on metabolite profiles are captured by β m . Overall, the model indicates higher sparsity in the β m coefficients compared to the metformin dataset. The output of the function iCARH.plotBeta for some selected proteins is shown in fig 8. Some metabolites present significant changes along with the proteomic profiles. Namely, HMGR protein is negatively associated with pyruvate which is consistent with the pyruvate pathway and positively associated with mevalonate which consistent with the mevalonate pathway. FUMA and NudB proteins exhibit positive association with oxoglutarate. FUMA is an active protein in the TCA cycle and NudB is an active protein in the mevalonate pathway. The association between these proteins and oxoglutarate are consistent with the perturbation observed in the TCA cycle in 7. To a large extent, our findings are in accordance with the results presented in Brunk et al. (2016) which supports the use of our single model to capture the different sources of variation compared to a multi-step framework assuming independence between these sources. The sparsity level observed for the E.coli data is higher than the sparsity level observed for the metformin data.