Piecewise multivariate modelling of sequential metabolic profiling data

Background Modelling the time-related behaviour of biological systems is essential for understanding their dynamic responses to perturbations. In metabolic profiling studies, the sampling rate and number of sampling points are often restricted due to experimental and biological constraints. Results A supervised multivariate modelling approach with the objective to model the time-related variation in the data for short and sparsely sampled time-series is described. A set of piecewise Orthogonal Projections to Latent Structures (OPLS) models are estimated, describing changes between successive time points. The individual OPLS models are linear, but the piecewise combination of several models accommodates modelling and prediction of changes which are non-linear with respect to the time course. We demonstrate the method on both simulated and metabolic profiling data, illustrating how time related changes are successfully modelled and predicted. Conclusion The proposed method is effective for modelling and prediction of short and multivariate time series data. A key advantage of the method is model transparency, allowing easy interpretation of time-related variation in the data. The method provides a competitive complement to commonly applied multivariate methods such as OPLS and Principal Component Analysis (PCA) for modelling and analysis of short time-series data.


Background
Metabolic profiling, (also referred to as metabonomics [1] or metabolomics [2]) is a rapidly developing field in which the levels of hundreds to thousands of low molecular weight metabolites are simultaneously profiled in biofluids, cells and tissues. The methodology is wellestablished for characterizing disease states, toxicity and differences in physiological condition [1,[3][4][5][6][7][7][8][9][10] and for extracting metabolite patterns associated with these conditions. In many experiments the biological system is followed over time, generating a multivariate metabolic time course. For example, staging of a disease process may be more important than merely determining its presence or absence. The ability to accurately define and predict disease stage also has obvious application in assessment of response to therapeutic intervention. Ideally such timeseries data should be well sampled in the time domain and have an adequate statistical experimental design, which is an essential component for the outcome of the study and quality of data [11]. However, for practical reasons collection of an optimal dataset is not always possible.
In the case of multivariate time-series data with low sample rate, many classical statistical methods are not appropriate for analysis and characterization of the time related variance due to the low number of time-points and in some cases inability to handle multivariate data. Principal Component Analysis (PCA) [12] has previously been applied for the analysis of time-series data in metabonomics [13][14][15][16][17], allowing visualization of the main timerelated patterns of variation. PCA has the objective of describing the main variance in the data in a low-dimensional subspace spanned by a few linear components. Since PCA does not explicitly model time-related variation, it does not provide an optimal representation of time-related data. In addition, the PCA model usually has more than one PCA component, making subtle changes described by multiple components hard to interpret. Partial Least Squares (PLS) regression [18] with the time as regressand has also been used for analysis of time-series metabonomic data [19,14,15,20], but the assumption of a linear relation between descriptor variables and time is valid only under some specific circumstances, but not in the general case. The PARAFAC model [21,22] provides a generalisation of PCA to data matrices of higher dimension. In this case, the three-way structure of the data consists of [Animal × Time × Variables]. The reason why 3way methods are not used in this study is because the NMR spectral profile (over all animals) do not preserve neither the rank, nor the spectral profile over all time points, which violates the 3-way method assumption of tri-linearity. Smilde et al. [23] described a generalization of the ANOVA approach to the multivariate case for data generated from an experimental design, labelled as ANOVA-Simultaneous Component Analysis (ASCA), with application to time-series data. However, in ASCA the time related effects are assumed to be linear in relation to time, which is rarely a valid assumption, neither does the ASCA method providing a predictive model. For short and univariate time-series, piecewise linear modelling methods can be used to describe progression over a timeseries, which is similar in some aspects to the method proposed here for the multivariate case. Other statistical methods applied for the analysis and modelling of timeseries data in omics biology include Clustering [24], Dynamic Bayesian Networks [25] and Batch Statistical Process Control [14,26,27]. Applications of time-series analysis have been described by Trygg and Lundstedt [11] in a review of chemometric techniques applied in metabonomics, and some of the current issues with regard to analysis of time-series gene expression data were reviewed by Bar-Joseph [28].
Here a new method for piecewise multivariate modelling of time-series spectroscopically generated metabolic data is proposed, which can be used for characterization and modelling of short (less than 20 time points) and sparsely sampled (sampling frequency is low relative the timescale of the events studied) time-series data of high dimension. The method is well suited for analysis of spectral metabolite profiles where variables are intrinsically multicolinear, but is also generally applicable to other types of omics data. The suggested method also provides descriptive information, enables visualization and establishes a predictive model based on time-related variance, putting focus on effects seen between local time-points. The proposed method is based on multivariate piecewise models, where each sub-model describes changes occurring between neighbouring time points in a series of time frames over the time course, here the piecewise model is an Orthogonal Projections to Latent Structures [29] (OPLS) model. Overall, the set of sub-models describe the time-related changes over the full time also encompassing the modelling of non-linear changes in relation to time. Visualization of the piecewise multivariate model can be accomplished by investigation of sub-models separately, cumulatively over all time frames and as a time-trajectory. One can interpret the local changes as the rate of metabolic change in the time course. This aspect of explicitly investigating the multivariate characteristics of change, together with the magnitude of change over time in a biological system, has not been explored previously to the knowledge of the authors. In addition we show how this approach can be used for prediction of the time-point along a time-series, based upon measured metabolic profiles. Prediction of time-point by the model could be used for monitoring disease stage over time as well as for evaluation of the efficacy of an intervention, e.g. by assessing change in predicted disease stage after an intervention.
The paper is organized as follows. A brief introduction to the OPLS method is given, followed by a detailed description of then proposed piecewise multivariate modelling of sequential data. Finally the method is demonstrated on both simulated data and metabolic profiling data and results are compared to results from PCA analysis as well as linear OPLS regression modelling.

Algorithm
With the objective of describing the time-related variance in the data, a set of multivariate piecewise models is estimated, describing the transitions between metabolic states in neighbouring time points, using the OPLS algorithm. Each model establishes a function for the transition between two time points will be called a sub-model in the following sections. A distinction is made between the piecewise approach, consisting of a set of OPLS sub-models, and the OPLS regression approach where the descriptor matrix is regressed against the time using all time points in a common model, thus, assuming a linear relationship between the data and time. Let the matrix X [N × K] (for N observations and K descriptor variables) represent the matrix of descriptor variables, where each observation, e.g. a metabolic profile, is a row-vector of X. The data vector for time-point i for individual n is denoted by x n,i . Y is the response matrix [N × M] (for N observations and M response variables). T represents the total number of time-points, resulting in T-1 sub-models. Throughout the paper matrices are represented by bold uppercase letter, vectors bold lower-case, scalars are represented as italics, p(·) represents a probability density and tr(·) is the trace function.

PLS and OPLS methods
Partial Least Squares regression (PLS) [18] has been used successfully for estimation of multivariate regression and discriminant models in many applications, especially in cases when descriptor variables are multicollinear and noisy, and when the number of variables exceeds the number of observations, which is common for e.g. spectroscopic and other omics data. For data with systematic variation, which is orthogonal to the regressand, the number of PLS components required for an optimal predictive model normally exceeds the rank of the Y-matrix. In such cases, the Orthogonal Projections to Latent Structures (OPLS) method [29], which has an integrated Orthogonal Signal Correction filter [30][31][32][33] specifically designed for PLS, will benefit the analysis. This allows the estimation of an optimal model (in the predictive sense) with a single predictive component for the single Y-variable case, contrary to the PLS model which may have several components if structured Y-orthogonal noise is present in data. This property of the OPLS algorithm, guaranteeing a single predictive component for the single Y-variable case, is utilized in the method described here. It confers an advantage compared to other similar multivariate projection methods, in terms of clearer interpretation of the model and enabling a straightforward extension to the piecewise model described here. The simplicity of interpretation is due to the separate modelling of correlated components and Y-orthogonal components in the OPLS model.

Estimation of piecewise sub-models
Estimation of a multivariate sub-model between time point i and i+1 can be treated as a discriminant analysis problem between two time points, describing the time (Y) as a function of the descriptor matrix (X). Let the X i [N i × K] matrix consist of training data from time t = i and t = i+1 with N i observations, and let the Y i [N i × 1] matrix to be a dummy matrix of zeros and ones, indicating which observation belongs to time point t = i and t = i+1 respectively.
The OPLS algorithm decomposes X i into a predictive weight vector, w p,i [K × 1], describing the direction in the K-dimensional space between the two time points (i and i+1), and a predictive score vector, t p,i [N i × 1], representing the orthogonal projection of X onto w p,i (Equation 1). If Y-orthogonal variance is present in the data, the optimal predictive PLS model would include more than one PLS component, which in the OPLS model is equivalent to the estimation of additional Y-orthogonal components in addition to the predictive component in the model (Equation 1). This results in the guarantee of a single predictive component w p,i , describing the discriminative (locally time related) direction in X i , and A o Y-orthogonal components with loading matrix P o,i [K × A o ] and score matrix , describing the systematic Y-orthogonal variation present in the data, if any.
The Y-orthogonal variance may be analyzed further, either separately or together with the X residuals (E i ) to understand the variance patterns present in the data that are not time-related, which may provide information of systematic instrumentation errors or biological variation not directly related to time but which may still be of value.
In Equation 1, w p,i may be interpreted as the direction of change in the 'metabolic space' in the local time frame, describing the transition between two neighbouring time points, i and i+1. w p,i may also be interpreted as an approximation to the derivative of the time dependent function of the metabolic state. w p,i only describes a direction but does not contain any information quantifying the magnitude of the change in each time frame. An intuitive measurement of the magnitude of change would be the Euclidean norm of the predictive score vector ||t p,i ||. However the Euclidean norm may be affected by even moderate outliers and is therefore not an optimal choice. Instead we use the median distance in the score space as the metric for the magnitude of change (Equation 2).
w dist,i is then defined as w p,i weighted by a scalar (d i ) defining the magnitude of the change in the local time frame i (Equation 3), thus incorporating the information about the direction as well as magnitude of change. w dist will be referred to as the magnitude weights and may be used as a way of describing and visualizing the profile of timerelated change in any given sub-model. w dist is also comparable in magnitude between the different sub-models, contrary to w p , which is scaled to unit norm.

Interpretation of time related changes in model
By applying elementary vector algebra we also define the cumulative w dist , w dist,cum , which represents the total time related changes as described by the sub-models between t = 1 and t = T (Equation 4). This provides useful information for interpretation and visualization of the time related changes described by the sub-models over the whole time-series.
w dist,cum provides information on the overall change from a given reference point (e.g. t = 1). This enables us to not only track the changes in the local time frames, but also to depict the accumulated change over the time course. w dist,cum may prove to be useful for investigations of systems where there is a change occurring from a homeostatic state or when studying the recovery over time after a perturbation to establish whether the system returns either to the biological state prior to the perturbation, or alternatively, to a new state. A return back to the original biological state would in result in a w dist.cum vector close to a vector of zeros. For visualization purposes, and to summarize the changes described by w dist.cum vectors, PCA may be applied on the W dist,cum matrix to visualize the major patterns of time-related variation in a low dimensional subspace, describing the main changes in the timeseries. The low dimensional representation of the time points provides an overview of the relationship and similarity between the temporal states, or stages, rather than maximizing the amount of modelled variation in the original data, hence providing a less noisy visualization of the time-related variation in the data compared to a conventional PCA trajectory.

Prediction of time point
Time predictions for new observations are carried out in two steps. First the sub-model that best fits the new observation is established and within this sub-model a more detailed time prediction is then made. The decision of which sub-mode fits best the test-set observation is determined by two likelihoods. The first is p T2 (t p,test |m i ), the likelihood for a test-set observation (represented by the predicted score, t p,test ) to fit to the sub-model (with set of parameters m i ), based on the score (t p ), where the likelihood is based upon Hotellings T 2 statistic estimated from the training data. The second is based upon analysis of the model residual vector (e). Using the distribution of the residuals from the training set we can calculate the Q-statistics [34], or alternatively DmodX [35], which shows similar characteristics. The Q-statistic (Equation 5), is based upon the sums of squares of the residuals, which is used to estimate the likelihood p Q (e|m i ) for the test-set observation based on the Q-statistic from the training data. Q-statistics for residual analysis were described by Jackson [36,37]. Equations 5-8 describe how the parameter c, which follows an approximate N(0,1) distribution, can be calculated [36,37], leading us to the calculation of P Q (e|m i ) (Equation 9). In Equation 5 e test represents the residual vector for a test-set observation. In equation 6, Σ Ei is the covariance matrix of E i , which is the residual matrix for the training data for sub-model i. In equation 9 c' represents an instance of c as calculated in Equation 8 for a specific test-set observation.
p Q (e test |m i ) = p(c<=c'), c~N(0,1) Here p Q (e|m i ) and p T2 (t p |m i ) are treated as independent, which should be acceptable in most cases of application. The joint probability for the new observation to belong to

Testing
The simulated data set To illustrate some of the properties of piecewise multivariate modelling approach a tractable example based on simulated data was used, which has both linear and nonlinear time-related variation present. A spectral-like data set with 200 spectral variables, 11 time points, and 100 replicates for each time point was simulated using a bilinear model. The data contain two time dependent components, one non-linearly (u 1 ) and one linearly (u 2 ) related to time ( Figure 1A) in addition to a constant component (u 3 ) which contain only random variation, described in Equation 14. Each one of these three components is related to a specific spectral profile (p 1 , p 2 , p 3 ) ( Figure  1B). Random variation (ε ~ N(0,0.1)) was added to the time dependent latent variables for each time point and each observation.

PRINCIPAL COMPONENT ANALYSIS
Results from PCA analysis of the simulated data is shown in Figure 2A-  2A) (data was mean-centred prior to analysis). The loading plots for component one and two (Figure 2B and 2C) show that the two sources of variation in the simulated data are slightly confounded between the two PCA components.

OPLS REGRESSION
We investigate the same data set using the OPLS regression approach and regress the simulated spectral data against time, using one predictive component. Figure 3A shows OPLS predictive weights, indicating the predictive (Y-related) variation that is described by the OPLS model. As expected, only the leftmost peak in the spectral profile is given any weight in the model, while the time related, but non-linear component, is not present (compare with Figure 1). Predictions of a test-set (an independently drawn sample from the same distribution and with the same number of observations as the simulated training data set) using the OPLS model give a Root-Mean-Square Error of Prediction (RMSEP) of 9.7% ( Figure 3B).

PIECEWISE MULTIVARIATE MODELLING
In the piecewise model, the cumulative magnitude weights W dist,cum illustrate the cumulative change in the system over the time-series ( Figure 4A). Figure 4B illustrates the relative magnitude of change (d i ) in each submodel, allowing the magnitude of change in each submodel to be established. Deviations from the expected symmetrical magnitude of change profile ( Figure 4B) for the simulated data set are a direct effect of the Gaussian noise present in the data set. It can be seen that d i approximately traces the gradient of the non-linear time-related component ( Figure 1A) as expected. For the time-point predictions, the probability of sub-model membership was calculated for each of the test-set observations. The probabilities for all time points from one time-series is shown as an example in Figure 4C. The lack of symmetry seen in Figure 4C is because each sub-model (x-axis) represents a model between two neighbouring time-points. Therefore, an observation (y-axis) has the potential to fit fairly well into both of the adjacent sub-models, or one of them. The time predictions (RMSEP = 6.5%) based on the best fitting sub-model are displayed in Figure 4D in the form of a boxplot, indicating successful predictions. The larger prediction errors observed for time-points 4-8 (Figure 4D), compared to earlier and later time-points, are due to lower signal to noise level for these time-points. This is an effect of the time related component U 2 ( Figure  1A), which has a lower amount of time-related change over these time-points, while the noise level remains constant over the time course. The RMSEP is similar to the OPLS regression model, which is expected in this case since there is one latent variable in the simulated data that is linearly related to time. Crucially, the linear OPLS regression only models the linearly related time variation in the spectral profile, while the piecewise model shows both the linear and non-linear time-related variation in the data. This provides a model demonstrating a more complete representation of the time-related variation, enabling a more comprehensive interpretation.

The mercury II chloride data set
To test the method on real data, we used data from a renal toxicity study using mercury II chloride to induce a proximal tubular damage [38] in the rat. This is a 1 H NMR based metabonomic study of rat urine with data from seven time points (pre-dose, 0 h, 8 h, 24 h, 48 h, 72 h, 96 h) and including ten animals in total. Prior to analysis the data were pre-processed using standard methods. First the spectra were interpolated to a common chemical shift Simulated data set visualized as a PCA-trajectory scale using cubic spline interpolation. The region corresponding to water and urea resonances (δ 4.5 -6) was excluded from each spectrum and the spectral intensity was subsequently integrated over adjacent δ 0.04 ppm width bins. Each spectrum was normalized to the total sum of 100 units to reduce the overall dilution effect due to inter animal variability in urine excretion rates. A typical integrated NMR spectrum after pre-processing is shown in Figure 5. After 48 hours five animals were sacrificed, rendering N = 5 animals to be left in the study after 48 h.

PRINCIPAL COMPONENT ANALYSIS
Results from PCA analysis of the HgCl 2 data is shown in Figure 6A-C visualized as a PCA trajectory plot, where the centroids in the score space are calculated for each timepoint and then connect to form the trajectory ( Figure 6A). Figure 6B and Figure 6C show the loadings for each one of the two PCA components calculated. In many instances the pattern of change between time points is a combination of variation in more than one PCA component, making it hard to interpret the unique pattern of change over different regions of the time-series, especially when these changes may be subtle.  The predicted time based on cross-validation, as described in the previous section, plotted against the true time point for each animal in the study is shown in Figure 9A (RMSECV = 23.2%), showing a different pattern of prediction results compared to the OPLS time regression model ( Figure 7B). We note that the predictions are less variable in the early time points compared to the OPLS regression approach. However, in the latter part of the time-course the OPLS regression model performs quite similarly to the Modelling and prediction of the simulated data set   T1-T2   T2-T3   T3-T4   T5-T6   T6-T7   T7-T8   T8-T9   T9-T10   T10-T11   T4-T5   A  B C D piecewise approach, indicating that there is some variation in the metabolite profiling data linearly related to time. For the time points up to 48 h the predictions are relatively good, but as the observation decrease from n = 10 to n = 5 after 48 h, the model becomes less reliable as evidenced by the poorer predictions. However, important information can still be extracted, for example animal 18 ( Figure 7B) is clearly a non-responder, which was confirmed by histopathology data, showing no renal damage, and clinical chemistry data, showing osmolality and glucose levels in the same range as control animals, indicating a negligible effect of the toxin on this animal. Figure  9B shows the probabilities for all time-points of animal 16 for fitting each sub-model, as an example of how submodels are chosen.

OPLS REGRESSION
To provide an overview of the time-related changes, PCA was used to visualize W dist.cum as a trajectory ( Figure 9C). The main information provided by this plot is for assessment of whether the perturbed system has returned to its starting point, and to provide a visualization of the overall time-events. In Figure 9C we can see a trend towards the metabolite profile returning to a state close to that prior to the administration of HgCl 2 . However, over the study duration the recovery was not complete, which may either be due to presence of irreversible injury inflicted by the toxin or that the time span over which the study was carried out was too short to allow observation of a full recovery.

Implementation
A R package [39] implementing the method described is available upon request from the corresponding author.

Discussion
We have presented a framework for analysis of short timeseries multivariate data, typical of post-genomic (omic) biology, using piecewise multivariate modelling and we demonstrated the method on metabolic profiling data. However, this approach is applicable to other types of omics data as well. The proposed method facilitates a transparent model allowing straightforward interpretation of time related variation in the data over the time course. Prediction of the time-point for a new sample is possible. The method has applications in areas such as monitoring and prediction of disease progression or the effects of a perturbation over time, allowing for evaluation of different types of interventions. The piecewise approach makes no assumption of linear relationship between the data and time and is therefore ideal for the analysis of non-linear time-related events in a biological system, as exemplified in the analysis of the simulated and the exemplar HgCl 2 nephrotoxic data sets.
In comparison to PCA, the proposed method provides a more detailed picture of the time-related events including small and local changes in the time domain. In addition, the predictive properties of the proposed method can be utilised for prediction of different stages of time-dependent biological events, such as disease or a toxic perturbation studied over time. In comparison to linear multivariate regression methods (e.g. PLS and OPLS), using the time as a response variable, the piecewise multivariate approach also models non-linear time related variation. This renders a model framework describing additional time-related variation with the potential to improve prediction and interpretation if non-linear variation is dominant, in this way providing a complement to both PCA and OPLS regression against time. PCA provides an overview of the main variation in the data, while OPLS regression against time models monotonic increasing or decreasing signals over the time course. The piecewise OPLS approach provides detailed information of time related effects seen locally over the time-course as well as non-linear time-related variation.
The proposed method does not exploit autocorrelation structures in the time-series and does not providing a tool for forecasting, as do methods like Auto-Regressive Moving Average (ARMA) [40]. One reason why ARMA cannot be applied successfully to the type of data described here is the restricted number of time-points available. Non-linear modelling approaches, e.g. Artificial Neural Networks or kernel based regression methods such as Kernel PLS [41], have properties that in some cases provide models Typical integrated 1 H NMR spectrum from the HgCl 2 data set Figure 5 Typical integrated 1 H NMR spectrum from the HgCl 2 data set. with better prediction results, however, these models are often hard or impossible to interpret in relation to the descriptor variables. For example, in the case of the mercury II chloride data set, a Kernel PLS model provided only a marginally better prediction result, with a RMSECV value of 20.0% (using a Gaussian kernel function with sigma = 0.016 and three Kernel PLS components), compared to the piecewise multivariate model. However, in many biological applications of predictive modelling it is essential to have access to transparent models that allow interpretation, rendering the proposed method beneficial compared to less transparent alternatives. Another possible limitation of the proposed method is to handle timeseries samples that are severely unsynchronized, e.g. high variability in response time between animals after a perturbation.
Mercury II chloride toxicity data visualized by PCA  A new perspective on the time dynamic data was achieved by placing the focus on the analysis and comparison of data based on the changes over the time-series, i.e. the derivative of the time dependent function. This approach provides new insights into the dynamics of the biological system, which may otherwise have been overlooked. The method could also provide the basis for fast and largescale comparison of biological responses studied over time due to different types of perturbations. This can be accomplished by means of comparison of the piecewise weights between sub-models, which can be seen as a multivariate representation of the time-dependent events taking place in the biological system.
The common case of data sampled in a synchronized fashion has mainly been discussed in this paper. The method could easily be modified to handle cases where individuals in the training set are not sampled at the same time Mercury II chloride toxicity data set modelled by piecewise multivariate modelling  δ H Sub-model (i) Mercury II chloride toxicity data modelled by piecewise multivariate modelling points, but where the sampling time is known. In this case the sub-model will be a regression model instead of a discriminant model. In such cases, the boundaries of the local time frames will be chosen so that they are sufficiently local and overlapping in the time domain. If changes between subsequent time points are very small and noisy, while the number of time-points is not limiting, the same approach can be applied by treating some neighbouring time points together in a common timeframe when estimating sub-models.

Conclusion
Given short time-series data of high dimensionality, the proposed multivariate piecewise approach provides more detailed information compared to other commonly applied multivariate methods for analysis of postgenomic data. The temporal resolution for interpretation of the model is enhanced in the sense that it is easier to conclude which changes occur over time and when they occur, improving the interpretation of the data and providing a tool for the understanding of the biological system. The method also allows time predictions, which is an important feature in many biological and clinical applications, where time may represent e.g. disease stage and interventions are evaluated in relation to the disease stage.

Authors' contributions
MR developed the method, evaluated it and drafted the manuscript. OC, TL and TMDE scrutinized the method. EH, JKN and JT supervised the project. All authors read and approved the final manuscript.