Three data sets were used to highlight the broad applicability of the generalised log transformation across multiple biological species and sample types. The three data sets comprised spectra of mammalian (canine) urine, extracts of marine mussel adductor muscle, and extracts of fish liver. The preparation, NMR analysis and processing of each is described below.
Sample Preparation and Collection of NMR Spectra
Canine urine
Free-catch urine samples were collected over several days from two breeds of dog (17 samples from three male Labradors and 20 samples from four male Miniature Schnauzers), frozen at -80°C, and subsequently prepared and analysed using the methods described elsewhere [22]. Briefly, urine was diluted in a sodium phosphate buffer (pH 7.0; 100 mM final concentration) containing sodium 3-trimethylsilyl-2,2,3,3-d4-propionate (TMSP; 1 mM final concentration), 0.2% sodium azide and 8% D2O. The sample pH was then manually adjusted to 7.05 (± 0.05) using 1 M HCl or NaOH. Samples were analyzed on a Bruker 500 MHz NMR spectrometer equipped with a 5 mm cryoprobe and BACS-60 automatic sample changer. 2-D 1H, 1H JRES NMR spectra were collected with 16 increments using excitation sculpting to suppress the water resonance, and transients were processed using methods described previously [15], yielding 1-D skyline projections of the JRES spectra (termed pJRES). In addition to the 37 individual urine samples, an additional pooled sample was split into 5 fractions and each of these was then prepared and analysed separately, using the same methods as above. This provided the spectra of technical replicates needed to calibrate the generalised log transform.
Mussel adductor muscle
Muscle tissues were dissected from two groups of Mediterranean mussels (Mytilus galloprovincialis), the first group being hypoxic (i.e. oxygen deficient; n = 12) and the second group normoxic (n = 15). The tissues were prepared using a methanol:chloroform extraction protocol as recently reported [23]. Polar extracts were dried and then resuspended in 100 mM sodium phosphate buffer (pH 7.0; 1 mM TMSP; 10% D2O). 1-D 1H NMR spectra of the polar metabolites were collected, as described previously [24]. Similar to the canine urine study, an additional pooled tissue sample was homogenised, split into 6 fractions and then each fraction was extracted and analysed separately, providing spectra of technical replicates.
Fish liver
European flounder (Platichthys flesus) were sampled from the River Alde, UK (n = 20) and the River Tyne, UK (n = 18). Liver tissue was rapidly dissected and then extracted using the methanol:chloroform protocol as above [23]. All polar extracts were dried and resuspended in 90% H2O and 10% D2O with sodium phosphate buffer (100 mM; pH 7.0) containing 0.5 mM TMSP. 2-D 1H, 1H JRES NMR spectra were collected using methods described above. Again, an additional pooled tissue sample was homogenised, split into 5 fractions and then each was extracted and analysed separately, providing spectra of technical replicates.
Technical Replicates
It should be noted that for all data sets, the technical replicates form an integral part of calibrating the glog transformation. A minimum of five or six replicates should be generated from a single homogenous pool of the relevant biological material for each data set. Ideally, this pool of biological material is formed by mixing several smaller amounts of different samples from all experimental classes (e.g., control and stressed).
Data Processing
The 1-D, pJRES and 2D JRES NMR spectra were converted to an appropriate format for multivariate analysis using custom-written ProMetab software [15] running within MATLAB (version 7.1; The MathWorks, Natick, MA). All spectra were sectioned into 1960 chemical shift bins between 0.2 and 10.0 ppm, corresponding to a bin width of 0.005 ppm. Note that the 2D JRES spectra were not "binned" along the J coupling dimension at this stage of the processing. Next, a series of bins were removed from each data set: for the canine urine from 4.50–6.45 ppm (residual water and urea); for the mussel adductor muscle from 4.70–5.15 ppm (residual water) and 7.60–7.76 ppm (chloroform); and for fish liver from 4.60–5.20 ppm (residual water). The spectra for each data set were then normalised to a total spectral area of unity for ease of comparison between samples. Next, due to slight pH-induced chemical shift variations of some peaks between samples, groups of bins were each compressed into single bins: for the canine urine ten regions were compressed between 2.40–2.425, 2.52–2.57, 2.66–2.71, 2.935–2.955, 2.96–2.98, 3.105–3.130, 3.72–3.77, 3.955–3.990, 7.08–7.20 and 8.00–8.18 ppm; for the mussel adductor muscle between 7.08–7.10 and 7.84–7.875 ppm; and for fish liver five regions were compressed between 7.74–7.77, 7.77–7.79, 7.94–7.955, 7.97–8.03 and 8.23–8.25 ppm. Compression regions were chosen by visually inspecting the superimposed NMR spectra and then selecting regions of the spectra that showed pH or matrix induced chemical shift variation. Finally, for the fish liver only, the increments of each intact 2D JRES spectrum (i.e. the rows of the 2D data matrix representing each spectrum) were concatenated into a single row vector of dimension 232,448 containing the intensities of each bin in the spectrum, allowing the JRES spectra to be analysed in a similar manner to the 1D and pJRES spectra, described below.
Scaling Methods
After each data set was binned, normalised and bin compressed – and for the intact 2D JRES spectra, concatenated – the following scaling techniques were applied:
Autoscaling
The variance of each bin was scaled to unity by dividing the intensity of each bin by the standard deviation of that bin; note that mean centring was not applied yet.
Pareto scaling
The intensity of each bin was divided by the square root of the standard deviation of that bin; again, mean centring was not applied at this point.
Glog transformation
The glog transformation is given in equation (1), where z is the intensity of the transformed data, y is the intensity of each original bin, and λ is a parameter that affects the gradient of the function (see Figure 11). This parameter must be found prior to using the transformation as it is specific to each type of biological sample and set of NMR conditions.
In order to calculate the transform that minimises the technical variance, λ is calibrated using technical replicates generated from a single pooled biological sample. The replicate spectra are processed in exactly the same manner as the biological data set, i.e. normalisation, compression regions etc, to ensure all technical variance is accounted for when calibrating the glog parameters. The calibration was achieved using a maximum likelihood method proposed by Rocke and Durbin [25]. To avoid scaling artefacts arising from the change in scale between the untransformed and transformed variables the Jacobian, of the glog function is used as a scaling factor as described by Rocke and Durbin. Here however, we have used an alternative scaling function which maintains most of the properties of the Jacobian but is computationally more robust; shown in equation (2), where z
i
represents bin i and n is the total number of bins contained within the spectrum being scaled.
(2)
The parameter λ was optimised by minimising the variance, S, (3) over k technical replicates and all n bins in the Jacobian-scaled data vectors w
j
= z
j
J, giving a measure of all variance contained within the technical replicates.
(3)
Here, is calculated as the mean spectrum of all scaled and transformed technical replicates, w
j
. Minimising the variance S thus gives an optimal value for λ.
The optimisation of λ is achieved via the Nelder-Mead unconstrained non-linear minimization routine in the MATLAB optimisation toolbox. The optimised λ value was then used to transform the binned intensities of each spectrum in the full biological data set. The MATLAB code developed here is included as additional file 2.
The extended glog is given in equation (4) where an extra transformation parameter y0 has been added. As illustrated in Figure 11, y0 shifts the transformation function so that the bins with the lowest intensities are scaled by the section of the glog function which has a relatively small slope.
(4)
The parameter y0 was calibrated by first estimating the noise contained within the spectra of technical replicates. The noise level was set to the smallest standard deviation of 32 equally sized regions across the spectra [26]. The shift y0 of the glog function was then determined by calculating the point in glog where the slope of the function increases, i.e. by calculating the point where the second derivative of z in equation (5) has its maximal value. This point was typically set to three times the noise value of the spectrum. This shift ensures that the noise of the spectrum is minimally scaled by the flat region of the glog function while the larger intensity bins remain transformed to the higher values. Thus the noise is effectively suppressed relative to those bins corresponding to low and medium intensity peaks.
(5)
Since y0 depends on the choice of λ the optimisation of λ must be carried out first followed by the calculation of y0. In some cases it may be necessary to optimise λ a second time after y0 has been set, in particular for very noisy data.
For both calibration methods described here, the minimisation routine was terminated when the absolute change in λ was less than a predetermined value (here 1 × 10-16) or a maximum number of iterations was completed (here 1 × 103). Table 1 contains the optimised λ and y0 values for the glog and extended glog transformations for each of the three biological samples investigated.
Analysis of Models
Each unscaled or scaled data set was then mean centred and PCA performed using PLS_Toolbox (Eigenvector Research, Inc., Wenatchee, WA, USA). Next, using the Discriminant Analysis Toolbox (Michael Kiefte, Dalhousie University, Canada [27]) Fisher's LDA was applied to the first and second PCs of the PCA scores plot, producing a decision region for each two-class problem. This decision region was then used to construct classification statistics (sensitivities and specificities) to evaluate the effects of the scaling techniques upon each data set (Table 2). Leave-one-out cross-validation was performed on the PCA-LDA models to assess the robustness of the analyses (Table 2). The coefficient of variation (CV) for each bin, given as the standard deviation divided by the mean, was calculated for each set of technical replicates, excluding bins with an intensity lower than the estimated noise level of the spectrum (i.e., the CV was calculated using only those bins that contained peaks). The median and range of these CVs were calculated for each of the three data sets. Additionally, PCA loadings plots for the 1D and pJRES data (Figures 5 and 7) were produced by constructing the linear combination of the loadings along PC1 and PC2 that is perpendicular to the LDA decision line. The loadings plot for the 2D JRES experiment, shown in 2D matrix format to mimic an intact 2D JRES spectrum (Figures 10B and 10C), was reconstructed from the row vector containing the loadings of the concatenated spectra. To evaluate the discriminatory potential of metabolic biomarkers discovered in the loadings plots, one-way analysis of variance (ANOVA) was performed on each of the 5 bins with the largest absolute loadings values, for each data set and method of scaling.