Metabolomics relies extensively upon the multivariate analysis of data . Unsupervised data mining tools such as principal component analysis (PCA) [2, 3] and hierarchical clustering [4, 5] and supervised methods such as partial least squares discriminant analysis (PLS-DA) [1, 6] are commonly used to search for patterns and other features within metabolomic data sets. Many multivariate techniques evaluate possible relationships between samples by examining the variance of the data [3, 7], the most notable being PCA for which principal components (PCs) are calculated along the directions of maximum variance. Hence the structure of the variance within a metabolomics data set can have a major effect on the output of the multivariate analyses. It is therefore important to assess (and modify appropriately) the variance structure of a metabolomics data set prior to multivariate analysis. Variation between samples can be broadly classified into one of two types – 'technical' and 'biological' [2, 8]. Technical variance is created by the experimental procedure, and includes sample preparation and analytical measurement errors, whilst biological variance is the inherent variation between samples created by genetic differences, pathological or environmental factors, etc . Clearly, the technical variance does not contribute any useful information to discriminate between different biological sample classes and so, ideally, this variance would not contribute to any multivariate analyses.
Data processing methods can be used to affect the structure of the variance of experimental data sets, helping to focus the subsequent multivariate analysis onto more biologically relevant information arising from the biological variance [2
]. Common processing methods in nuclear magnetic resonance (NMR) spectroscopy based metabolomics include autoscaling [1
] and Pareto scaling [1
]. The generalised logarithm (glog) has also been investigated, but only in a limited number of studies [11
]. For each variable in the spectrum, the glog transforms the intensity at that point to a value dependent on both the original intensity and the value of a transform parameter. The equation for the glog transform is shown below, where y represents the untransformed data, λ is the transform parameter, and z is the transformed data.
Autoscaling is a processing technique in which the variance of each variable is scaled to unity and the mean of each variable is set to zero . In Pareto scaling each variable's intensity is scaled by the square root of the standard deviation of that variable , producing a data set where the variance changes from variable to variable, but the range of variance across each spectrum is much reduced from the initial, unscaled data.
The glog is a transformation that was originally applied to microarray data [12, 13] and is based on the two-component error model . Specifically, the measurement error of an observation is characterised by one component representing the error of the data as being proportional to the intensity of the measurement, and a second, additive, component of the error characterising the noise. Previously, the glog transform has been applied to one-dimensional (1D) NMR data as first shown by Purohit et al . Unlike autoscaling and Pareto scaling, the glog transform initially requires a single parameter to be calibrated from a series of 'technical replicates'. These replicates must be recorded from one biological sample that has been divided into five or more components, each of which is subject to independent sample preparation and NMR analysis. The variance within this data set arises solely from technical sources [2, 8], upon which the glog transform is calibrated . Hence, when the glog is then applied to a biological data set it effectively reduces the amount of technical variance present, leaving the biological variance to dominate any subsequent multivariate analysis. To date, the glog transform has not been compared against other processing techniques used in metabolomics. Furthermore, the calibrated glog transform has only been tested using a single, relatively small 1D NMR data set . Recently, due largely to severe peak congestion in 1D NMR spectra, there has been a significant increase in the range of 2D NMR experiments conducted in metabolomics [15–21]. Although many of these experiments require substantially longer acquisition times and so are not appropriate for high throughput metabolomics, 2D J-resolved (JRES) spectroscopy has been shown to provide spectra with low peak congestion and high metabolite specificity in a short acquisition time [15, 20]. Consequently, several multivariate analyses of 1D projections of 2D JRES spectra have been reported [15, 18–21]. To our knowledge the applicability of the glog transform, including the initial calibration of the function using technical replicates, has not been evaluated for these 1D projections of 2D JRES spectra, nor for the analysis of the intact 2D JRES spectra.
Here, we first aimed to evaluate comprehensively the glog transform compared to two other commonly used scaling methods in NMR metabolomics as well as against unscaled data. This evaluation was conducted using three disparate data sets to confirm the broad applicability of the approach, including: urine samples to discriminate between two dog breeds, muscle tissue extracts to discriminate between hypoxia and normoxia in marine mussels, and liver tissue extracts to discriminate between fish collected from two different rivers. The performances of each of the scaling methods – autoscaling, Pareto and glog – were assessed by conducting PCA of each of the processed two-class data sets. This was achieved by calculating the sensitivities and specificities derived from applying linear discriminant analysis (LDA) to each of the resulting PCA scores plots. The effect of each scaling method upon the ability to discover potential metabolic biomarkers was also investigated. This was accomplished by selecting the largest peaks in the PCA loadings plots and then evaluating if the corresponding peaks in the NMR spectra were of significantly different intensity between the biological classes. Secondly, we aimed to evaluate the applicability of the glog transform for 1D NMR spectra, 1D projections of 2D JRES spectra, and intact 2D JRES spectra. This enabled the first NMR metabolomics study of intact 2D JRES spectra; including the reconstruction of the PCA loadings plot to a 2D format analogous to the JRES spectra, which is anticipated to have significant benefit in terms of the ease of metabolite identification. During this second aim we also sought to extend the glog transform to reduce the deleterious effects of noise.