Automics: an integrated platform for NMRbased metabonomics spectral processing and data analysis
 Tao Wang^{1, 2},
 Kang Shao^{4},
 Qinying Chu^{5},
 Yanfei Ren^{2},
 Yiming Mu^{6},
 Lijia Qu^{2},
 Jie He^{4},
 Changwen Jin^{1, 2, 3} and
 Bin Xia^{1, 2, 3}Email author
DOI: 10.1186/147121051083
© Wang et al; licensee BioMed Central Ltd. 2009
Received: 06 October 2008
Accepted: 16 March 2009
Published: 16 March 2009
Abstract
Background
Spectral processing and postexperimental data analysis are the major tasks in NMRbased metabonomics studies. While there are commercial and free licensed software tools available to assist these tasks, researchers usually have to use multiple software packages for their studies because software packages generally focus on specific tasks. It would be beneficial to have a highly integrated platform, in which these tasks can be completed within one package. Moreover, with open source architecture, newly proposed algorithms or methods for spectral processing and data analysis can be implemented much more easily and accessed freely by the public.
Results
In this paper, we report an open source software tool, Automics, which is specifically designed for NMRbased metabonomics studies. Automics is a highly integrated platform that provides functions covering almost all the stages of NMRbased metabonomics studies. Automics provides high throughput automatic modules with most recently proposed algorithms and powerful manual modules for 1D NMR spectral processing. In addition to spectral processing functions, powerful features for data organization, data preprocessing, and data analysis have been implemented. Nine statistical methods can be applied to analyses including: feature selection (Fisher's criterion), data reduction (PCA, LDA, ULDA), unsupervised clustering (KMean) and supervised regression and classification (PLS/PLSDA, KNN, SIMCA, SVM). Moreover, Automics has a userfriendly graphical interface for visualizing NMR spectra and data analysis results. The functional ability of Automics is demonstrated with an analysis of a type 2 diabetes metabolic profile.
Conclusion
Automics facilitates high throughput 1D NMR spectral processing and high dimensional data analysis for NMRbased metabonomics applications. Using Automics, users can complete spectral processing and data analysis within one software package in most cases. Moreover, with its open source architecture, interested researchers can further develop and extend this software based on the existing infrastructure.
Background
Since Nicholson et al. introduced the terminology [1], metabonomics evolved into a rapid development period. Metabonomics is now widely applied in research areas such as drug toxicology, biomarker discovery, gene function study, functional genomics, natural products research, and molecular pathology etc. [2–5]. Metabonomics studies strongly rely on multiple analytical techniques. These techniques afford a wide range of information for metabolic characterization of biological samples [6–10]. Based on the acquired spectra, data models can be constructed by statistical analysis, pattern recognition methods, and machine learning methods to explain the dynamic activities of metabolites in organisms. Due to the significant quantity and complexity of the spectroscopic data, a major challenge of metabonomics studies is data processing and data interpretation [11]. Therefore, software tools play a significant role in metabonomics studies, and plenty of efforts have been made on software development [12].
Nuclear Magnetic Resonance (NMR) is widely used in metabonomics studies. Compared to other analytical techniques, NMR has the advantage of fully quantitative analysis and minimal requirement for sample preparation [13]. For the "classical" NMRbased metabonomics approach, after NMR experimental data collection, postexperimental data handling including NMR spectral processing, data preprocessing and data analysis, is critical for obtaining good results. To assist these procedures, several software tools have been released, such as: AMIX (Bruker Biospin, Germany), KnowItALL (BIORAD, USA), Chenomx NMR Suite (Chenomx, Canada) and Hires [14]. NMRPipe, a widely used traditional NMR data processing software tool [15], also provide some metabonomics related features now. Most of the existing metabonomics tools are commercial products, except Hires, which is free licensed to our knowledge. These software tools provide plenty of functions involving spectral processing, comprehensive identification and quantification of metabolites. Some of these software tools also provide features for basic data analysis, such as principal component analysis (PCA). However, due to the complexity of NMR data and different application purposes, further data analysis procedures, such as filtering out unwanted variations (e.g. background noise, uncorrelated variation in data model, etc.) in a dataset, generating and applying predictive classification or regression models, are usually required. To complete these tasks, researchers usually have to invoke other advanced chemometrics tools or statistical tools. Commercial software packages such as MATLAB (Mathworks, USA), SIMCAP (Umetrics, Sweden) and SPSS (SPSS, USA) are frequently used by researchers.
In this report, we introduce a new software tool, Automics, the first highly integrated open source software designed specifically for metabonomics to our knowledge. Automics runs on the Microsoft Windows platform, and it is developed with Visual C++. Automics provides features for almost all stages of metabonomics studies, including: NMR spectral processing (high throughput automatic modules and convenient manual modules), data organization, data preprocessing (four data filtering methods), and data analysis (nine data analysis methods), along with other useful functions such as statistical total correlation spectroscopy method (STOCSY), expression calculator and database resource exploring. Automics enables researchers to carry out most of their studies within only one software tool, and thus avoid extensive training on different software tools in order to use them properly. Furthermore, with the keen interest of researchers in developing new algorithms for spectral processing, data preprocessing and data analysis, Automics can serve as a framework for quickly implementing these new data processing algorithms and other useful features, due to its open source architecture. As it provides basic data structures, data management and lowlevel functions, Automics enables interested developers to focus on kernel algorithmic approaches instead of implementing infrastructure within this framework.
Implementation
Overview of software system
Spectra format conversion
Automics takes Bruker (Bruker Biospin, Germany) format raw FIDs and XWINNMR processed spectra by default. For NMR data collected on spectrometers from other venders, FID format conversion should be carried out with a conversion module. Before conversion, a metadata file is first created in a format definition module, which contains parameters for the source format: FID filename, acquisition parameter filename, spectral width (Hz), chemical shift of spectral center (carrier position), observe frequency, optional FID file header size, byte order (little/big endian) and variable type of data points in computer memory (integer, 16/32 bits; float, 32/64 bits). Based on these parameters, Automics can convert raw FIDs of a variety of existing NMR FID formats, such as those from Varian (Varian Inc., USA) and JEOL (JEOL Ltd, Japan), to Bruker FID data format and rewrite them in Bruker style directories.
Manual spectral processing
Although high throughput automatic spectral processing is one of the major goals and an important feature of Automics, we believe a powerful and convenient manual processing module is still necessary. For situations when the automatic method does not work well (e.g. processing severely distorted spectra), the manual method may be an effective way to correct spectra.
Automics provides an easy to use manual spectral processing module. For spectral visualization, features such as spectral editing, peak labeling, peak information browsing, moving and zooming, are supported. To change the display properties of the visualized spectrum, a dialog can be used to set properties such as line color, line width, line style and background color. For spectral processing, floating tool bars can be used to continuously adjust zeroorder and firstorder phases for the interactive phase correction, or adjust coefficients of a selected fitting function (polynomial function, sine function and exponential function) for the interactive baseline correction. Other commonly used features, such as referencing, peak picking and spectral derivative, are also supported.
High throughput automatic spectral processing
Fast Fourier transform
FFT converts NMR signals from time domain to frequency domain. This module can perform both complex FFT and real FFT. In addition, DC offset, zero filling, window function with a specific line broadening factor and removal of a potential digital filter imposed on FID (such as that from Bruker Avance spectrometer) are also carried out in this module.
Automatic phase correction
Traditionally, phase correction is generally accomplished manually by trial and error until the real part of the Fourier transformed spectrum appears globally as a pure NMR absorption spectrum. The corrected spectrum is dependent on one's experience. Several automatic methods have been proposed to estimate zeroorder and firstorder phases [16–19]. The global methods, such as maximization of the spectrum integral, minimization of the spectrum entropy [20], and a recently patented method by Bruker (it uses a fingerprint of the first derivative as the objective function for the real part of the spectrum and tunes the spectrum until its real part matches its fingerprint the best), require extensively iterative computing and are time consuming. Other methods, such as methods based on dispersion versus absorption relationship (DISPA) [21, 22], method based on symmetrizing lines [23] and method based on phase angle measurement from peak areas (PAMPAS) [24], share the common feature that they determine zeroorder and firstorder phases by linear regression of a set of selected peaks. For these methods, it is critical to find a set of appropriate isolated individual peaks. However, in metabonomics studies, NMR signals from hundreds of metabolites in the samples often cause severe peak overlap, which may affect the accuracy of the phase correction.
Besides two global methods (maximization of the spectrum integral and minimization of the spectrum entropy) implemented in Automics, we have introduced another easier to implement method for automatic phase correction. This method does not require detection of isolated individual peaks and is efficient for processing large quantity of similar spectra in metabonomics studies. Considering a 1D ^{1}H NMR spectrum with good baseline, the regions near the two ends of the spectrum are normally free of signals. These regions usually belong to the baseline, and they should have nearly horizontal straight line shape after correction. Based on this principle, enough information can be acquired to calculate the zeroorder (phc0) and the firstorder phases (phc1). This method consists of the following steps:
(1) Define two pairs of small regions T_{i} and T_{j}, T_{k} and T_{l}(i, j, k and l are the center position of each region in data point) with a certain window length L (for example, 30 data points). T_{i} and T_{j} belong to the higher frequency baseline region of the spectrum, and T_{k}and T_{l} are in the lower frequency baseline region of the spectrum. Sum up each region and get their real parts (R_{ i }, R_{ j }, R_{ k }, R_{ l }) and imaginary parts (I_{ i }, I_{ j }, I_{ k }, I_{ l });
(2) Determine two phase errors (θ_{0}, θ_{1}) at positions of (i + j)/2 and (k + l)/2. Because these four regions all belong to the baseline, and the distance between two regions of each pair is very small, the two regions of each pair should have approximately the same phase errors (with a distance of 100 data points and a 300° firstorder phase error, the difference between T_{i} and T_{j} usually is smaller than 4°). Therefore, T_{i} and T_{j} should have nearly the same intensity after correction, so do T_{k} and T_{l}. The two phase errors can be thus calculated with the following equations: R_{i}cos(θ_{0})+I_{i}sin(θ_{0}) = R_{j}cos(θ_{0})+I_{j}sin(θ_{0})
R_{K}cos(θ_{1})+I_{K}sin(θ_{1}) = R_{1}cos(θ_{1})+I_{1}sin(θ_{0})
Therefore, the phase errors can be expressed as:θ_{0} = arctan((R_{ i } R_{ j })/(I_{ j } I_{ i }))+mπ
θ_{1} = arctan((R_{ k } R_{ l })/(I_{ l } I_{ k }))+nπ
(4) Correct the spectrum with the determined phc0 and phc1.
Automatic baseline correction
Current version of Automics provides two methods for automatic baseline correction: linear fitting and nonparametric recognition. Linear fitting method uses predefined positions of the spectrum to calculate coefficients of a linear function, which are then used to construct a baseline. Our experience has shown that this method works well in most cases, despite its simplicity.
A nonparametric method was implemented with a variant of Sergey's algorithm [25]. It includes two steps. First is baseline recognition. To decide whether a data point belongs to the baseline, the first derivative of the spectrum is calculated, which can be used to distinguish sharp peaks from hump regions in the spectrum and helps to recognize baseline regions. A data point is considered to be on the baseline if the absolute intensity of the corresponding point in the derivative spectrum is below a predefined noise threshold. The second step is to construct a smoothed baseline from those recognized data points using a moving convolution window. Then the baseline is subtracted from original spectrum, resulting in baseline corrected spectrum.
Peak alignment
Frequency shifts due to unstable experimental and instrumental conditions are one of the main sources of unwanted variations for further data analysis. These variations obscure the process of pattern discovery and impede the performance of data analysis. Peak alignment is an essential step to remove effects of such variations from the spectral datasets. Spectral referencing, which sets the inner reference peak (DSS/TSP) of each spectrum to 0 ppm, can be regarded as a simple global method for peak alignment. This method shifts the entire spectrum based on the same reference peak position. Thus, all the spectra with global peak misalignments are well aligned. However, it is not sufficient for correcting individual peak misalignments in spectra, such as those from urine samples with variant solution conditions. Several methods have been proposed to solve this problem. A genetic algorithm can align peaks in automatically selected segments of each spectrum to the corresponding peaks in a preselected reference spectrum [26]. A principal component analysis method can identify and adjust individual peak variations through examining the correlation between peakderivative shapes and the second or higher order principal components (PCs) [27]. As these two methods deal with every data points in the interested regions, both of them are time consuming. Automics implements a fuzzy wrapping method [28]. This method detects the maximal position of peaks in each spectrum and aligns them to a reference spectrum using their similarity determined using a fuzzy Gaussian function. It is more efficient than the above mentioned two approaches due to the reduced data size of processed peaks vector.
Bucket/binning and normalization
Bucket/binning is a commonly used technique for digitizing a spectrum into a row vector. It has the advantages of minimizing misalignment effects and reducing data dimensionality (usually from several thousand to several hundreds of bins) for further analysis. However, it also leads to a lower data resolution. An extreme case of bucket/binning is that each bin contains a single data point, thus it has a full resolution. However, the quality of data generated in this way highly relies on accurate peak alignment, and this method may bear heavy computing burden such as calculating elementbased leaveoneout cross validation.
Automics provides both full resolution (data point) bucket/binning and traditional bucket/binning with equal bin width options (Fig. 3–D). For the second option, we implement a method to determine an appropriate bin width for balancing resolution and dimensionality. First, peak widths for all identified peaks of each spectrum are determined. Then, the average peak width of the "sharper peaks" half is used as bin size. The peak finding and peak width determining are carried out as following:
(1) Noise filtering: use a SavitskyGolay filtering window to smooth the spectrum by removing high frequency noise with a predefined noise threshold.
(2) Peak finding: Among the data points whose intensities are above the threshold, find all maximal points. A maximal point is defined as a point with a number of adjacent consecutive data points on both sides that all have smaller intensities than this point; meanwhile, the intensities of these points on each side are in descending order.
(3) Peak width determining: For each side of a maximal point, the total number of data points whose intensities are in descending order is counted. The sum of the two numbers for both sides is used as peak width for this peak.
In addition to the above mentioned two methods, an intelligent adaptive binning method was also implemented in Automics [29]. This method recursively identifies bin edges in existing bins and requires minimal user input, and it can largely circumvents problems such as the loss of information due to low resolution, the occurrence of artifacts caused by frequency shifts and the presence of noise variables. Generally, normalization of each row vector produced from bucket/binning is required before further data analysis. Four normalization methods are available in Automics: normalizing against the total spectral area, normalizing against the maximum peak area, normalizing against the inner reference peak area, and normalizing against a specific peak area. After bucket/binning and normalization, the produced data matrix can be saved into a commadelimited text file, or can be exported to a worksheet directly for further analysis in Automics.
Data organization and data preprocessing
A worksheet module was developed in Automics for data organization. Data preprocessing and data analysis procedures are all based on data in the active worksheet. Automics can import/export data files in text format or Microsoft EXCEL spreadsheet format. Commonly used editing functions and some basic statistical analysis (column based statistics, row based statistics and matrix standardization) are supported.
To remove undesirable systematic variations in the spectroscopic data before data analysis, four commonly used data filter methods were integrated into Automics: multiplicative signal correction (MSC) [30], standard normal variate transform (SNV) [31], direct orthogonal signal correction (DOSC) [32] and orthogonal projections to latent structures (OPLS) [33]. DOSC is a variant algorithm of the wellknown orthogonal signal correction (OSC) [34]. It is a powerful method for removing structured variation which is orthogonal to the response variables (Y matrix), from the observation variables (X matrix). However, in some cases, not all the structured Yorthogonal variations need to be removed. Only those irrelevant variations that create problems for PLS (or other regression methods) should be removed. OPLS is a generic hybrid OSC+PLS method which takes the objective of the PLS regression model into account and removes Yorthogonal variations when necessary.
Data analysis module
After data preprocessing, data analysis modules can be invoked to analyze the data and build data models. Automics provides nine different pattern recognition methods for data analysis. These methods include feature (variable) selection method (Fisher's criterion (FC) [35]), data reduction method (principal component analysis (PCA), linear discriminant analysis (LDA) [36], uncorrelated linear discriminant analysis (ULDA) [37, 38]), unsupervised clustering method (KMean Clustering (KMean) [39]), and supervised regression and classification methods (partial least squared analysis (PLS) [40, 41], K nearest neighbor classification (KNN) [42], soft independent modeling of class analogy (SIMCA) [43] and support vector machine (SVM) [44]).
FC
Fisher's criterion method is a feature selection technique. The general purpose of the feature selection is to find significant features (variables) from the original data space in order to produce a better prediction result. Irrelevant features that introduce noises should be eliminated. The importance of an individual feature for discriminating different groups in the training dataset is expressed by the Fisher's ratio, which is the ratio of betweenclass variance to withinclass variance for the training group. A feature with larger Fisher's ratio means that it is more important for classification. Users can select a number of features with the largest Fisher's ratios for further analysis.
PCA and PLS
These are two commonly used multivariate analysis techniques in metabonomics studies. The main purpose of PCA is to eliminate the collinear problem and then reduce the dimensionality of the original feature (variable) space. It is an unsupervised method used to reveal the internal structure of datasets in an unbiased way. PLS is a supervised method for regression. The overall goal of PLS is to maximize the covariance between the predictor space and the response space, and then use the predictor matrix to predict responses in the population. With a cutoff for predicted responses, PLS regression can be used for classification and discrimination analysis (PLSDA). PCA and PLS reveal the variable contribution for the separation between different groups by loadings or regression coefficients.
LDA and ULDA
LDA is a wellknown technique for the dimension reduction and the feature extraction closely related to PCA. Differing from PCA, LDA is a supervised method. It aims to find an optimal transformation that maps the data into a lower dimensional space with minimized withinclass distance and maximized betweenclass distance, thus achieving the best separation between two or more classes of observations. A variant algorithm, ULDA, was proposed for solving singular problem limitation in LDA. ULDA employs the generalized singular value decomposition method to handle singular data. The advantage of this method is that features in the transformed space are uncorrelated, which makes it attractive for the feature dimension reduction. The work on plasma fatty acid metabolic profiling analysis by Yi et al. [45] showed that a better discrimination is achieved using ULDA feature reduction comparing with that using PCA and PLS data reduction methods, which suggests that ULDA is a good complement for commonly used PCA and PLS methods.
KMean
KMean is an unsupervised method for grouping samples into a fixed number (k) of groups in a dataset by their similarity. The similarity is defined by the distance. Several distance measures can be used in Kmean method, such as Euclidean distance, Manhattan distance, or correlation coefficient. KMean method implemented in Automics can be used as an initial step to quickly detect outliers and assign class indicators of observations (samples) in metabonomics studies.
KNN and SIMCA
KNN and SIMCA are two supervised methods for classification using data similarity. In KNN method, the test dataset is classified by a majority vote of its neighbors, with a sample being assigned to the class most common amongst its k nearest neighbors. The k value is a positive integer, typically small (such as 3, 5, 7 etc.). If k = 1, then the sample is simply assigned to the class of its nearest neighbor (NN). SIMCA works as following: PCA is first performed on each independent group in the dataset, and a sufficient number of principal components are retained to account for most of the variations within each class. Hence, a principal component model is used to represent each class in the dataset. Finally, samples in the test dataset are classified to one of the established models on the basis of their best fit to the respective model.
SVM
It was integrated into Automics as another powerful classification tool. SVM is based on rigorous statistical learning theory, and it has been used in a wide range of problems for the classification of datasets such as proteomics data and genomics data. SVM takes a set of features (variables) as input and outputs a classification or a regression vector. It maps input vectors into a higher dimensional feature space using a kernel function. The training procedure leads to the finding of a hyper plane in the feature space, which optimally separates training vectors of two classes. Then, it finds several support vectors that contribute most for the classification. When a new feature vector (sample or row vector in metabonomics) is input, its class membership is predicted on the basis of which side of the plane it maps.
Before invoking data analysis methods, new datasets must be created based on the data in the active worksheet by a dialog interface (Fig. 3–D). This dialog is used to construct training dataset, testing dataset and define variables in them (i.e. variables in X and Y matrix). Four options, central scaling, auto scaling (UV scaling), Pareto scaling and no scaling, are available for users to scale variables (column vectors).
In addition to the implementation of these data analysis algorithms, Automics provides a convenient way to visualize parameters of data models in 2D scatter plot, line plot, or column plot. These parameters usually include scores, loading, explained variances (R^{2}, Q^{2}), residual matrix, Hotelling's T^{2} etc. Plot properties (color, legend, title, footnote, scale of an axis etc.) can be conveniently changed. These features in Automics are comparable to those in commercial statistics software.
The performance of many pattern recognition methods are related to the inner data relationship and the data structure. Automics is flexible enough to combine different data preprocessing methods (noise filter, feature selection, dimension reduction) with different classification methods, and produce different classification methods such as FC/KNN, ULDA/KNN, FC/PLSDA, OPLS/PLS, and FC/DOSC/SVM etc., to facilitate different metabonomics applications and achieve better results.
Results and discussion
Automatic spectral processing
For several different spectral datasets, the new automatic phase correction algorithm we proposed worked well. We have also examined this method on a set of spectra that contained significant zeroorder and firstorder phase distortions ranging from 20° to 300°, and we achieved less than 8° errors for the two phase values (data not shown). As our method does not use peaks for determining phases, our method has no weaknesses due to peak shape, digitization rate or peak overlap. The signaltonoise ratio of a spectrum has little influence on phase determination, owing to our summation procedure. In practice, the baseline regions of a spectrum selected for determining phase errors are not always in a horizontal straight line, sometimes they could be a little tilted and have small slope angles. However, the angles can be determined from a corrected reference spectrum and then be applied to uncorrected spectra as a prior knowledge for compensation. The main drawback of this method is that it relies on a not severely distorted baseline. Nevertheless, the fact that a spectrum can not be phased correctly due to severe baseline problem may indicate an abnormal situation in the NMR experiment, which should not happen very frequently. Therefore, this drawback will not be a significant problem in metabonomics studies.
A metabonomics application using Automics
We have tested Automics for several application datasets. Here, with an example on the study of metabolic profile in type 2 diabetes, we provide an overview of the validity and the ability of Automics.
Sample preparation
Human blood samples were collected from 41 healthy adults and 57 patients with type 2 diabetes mellitus from No. 304 Hospital in Beijing. The ages of patients were between 21 and 79 years (44 ± 17, mean ± STD.). All the samples were collected under the same clinical condition before breakfast. The plasma samples were first allowed to clot in plastic tubes for about 1 hour at room temperature, and then aliquots of serum were collected and stored at 80°C until assayed. Right before the NMR experiment, each serum sample (150 μ l) was diluted with 300 μ l of 50 mM PBS buffer (pH 7.0), along with addition of 50 μ l D_{2}O and 3 μ l DSS.
NMR experiment
All the 1D ^{1}H NMR spectra were collected at a temperature of 298 K on a Bruker Avance 600 MHz NMR spectrometer using Bruker pulse sequence NOESY PRESAT, which can be depicted as: RD90°t_{1}90°t_{m}90°acquisition. RD represents a relaxation delay of 1.5 s during which the water resonance is selectively irradiated, and t_{1} is a fixed time interval. During the mixing time t_{m} (150 ms), the water resonance is irradiated for a second time. For each sample, 32 scans were collected into 16 K data points with a spectral width of 9615.4 Hz.
Spectral processing
Raw NMR FIDs from 41 healthy samples and 57 diabetic patient samples were processed in Automics using automatic spectral processing module, including Fast Fourier Transform (correction of DC offset, exponential window function with a line broadening factor of 0.3 Hz), phase correction (new method we proposed), baseline correction (linear fitting method) and peak alignment (global shift method). These automatic spectral processing procedures produced a good result (data not shown) and no further manual correction was carried out. All the processed spectra were data reduced to 470 segments between 0.2 ppm and 10.0 ppm using bucket/binning module with a bin width of 0.02 ppm. Due to the strong solvent signal, the spectral region between 4.6 ppm and 5.0 ppm was excluded.
PLS analysis together with DOSC and OPLS
Comparison of different classification methods in Automics
Comparison of different classification methods
Recognition rate  Prediction rate  Sensitivity  Specificity  Accuracy rate  

Ctrl/PLSDA  95.5% (63/66)  75.0% (24/32)  91.2% (52/57)  85.4% (35/41)  89.8% (88/98) 
UV/PLSDA  98.5% (65/66)  78.1% (25/32)  91.2% (52/57)  92.7% (38/41)  91.8% (90/98) 
DOSC/PLSDA  100% (66/66)  84.4% (27/32)  93.0% (53/57)  95.1% (40/41)  94.9% (93/98) 
OPLS/PLSDA  100% (66/66)  81.3% (26/32)  91.2% (52/57)  95.1% (40/41)  93.9% (92/98) 
FC/DOSC/PLSDA  98.5% (65/66)  90.6% (29/32)  94.7% (54/57)  97.6% (40/41)  95.9% (94/98) 
KNN (K = 3)  95.5% (63/66)  71.9% (23/32)  84.2% (48/57)  92.7% (38/41)  87.8% (86/98) 
SIMCA  90.9% (60/66)  75.0% (24/32)  87.7% (50/57)  82.9% (34/41)  85.7% (84/98) 
FC/KNN (K = 3)  95.5% (63/66)  81.3% (26/32)  93.0% (53/57)  87.8% (36/41)  90.8% (89/98) 
SVM  100% (66/66)  81.3% (26/32)  94.7% (54/57)  92.7% (38/41)  93.9% (92/98) 
DOSC/SVM  100% (66/66)  87.5% (28/32)  94.7% (54/57)  97.6% (40/41)  95.9% (94/98) 
FC/SVM  100% (66/66)  90.6% (29/32)  96.5% (55/57)  97.6% (40/41)  96.9% (95/98) 
FC/DOSC/SVM  100% (66/66)  96.9% (31/32)  100% (57/57)  97.6% (40/41)  99.0% (97/98) 
Without data preprocessing, the number of correctly classified samples in the testing set (prediction rate) decreases in the following order: SVM gives the best result (prediction rate 81.3%); SIMCA and PLSDA show similar results (prediction rate 75.0%); and KNN produces the worst result (prediction rate 71.9%). With data preprocessing methods (FC was used to select the top 30 significant features from the original variables; DOSC and OPLS were used to remove orthogonal variations for the first component) applied to the dataset, all the classifiers show better prediction performance. For example, DOSC/PLSDA gives an improved prediction rate of 84.4%; OPLS/PLSDA, DOSC/SVM and FC/SVM also show improved prediction rates of 81.3%, 87.5% and 90.6%, respectively. These demonstrate the general ability of DOSC and OPLS for removing noise from data set. As shown in Table 1, FC/DOSC/SVM has the best prediction result (prediction rate 96.9%), indicating that combining different data preprocessing techniques can improve the prediction ability. Although OPLS together with PLS analysis has advantages such as improved interpretability and informative orthogonal variations explain [33], it shows nearly the same prediction rate as DOSC processed PLS model on this dataset.
Whether using data preprocessing or not, SVM gives the best prediction result compared with PLSDA, KNN and SIMCA on this dataset. The superiority of SVM in prediction suggests that not only well known collinear relationships, but also nonlinear relationships may exist in this dataset. To our knowledge, there are very few applications of SVM in metabonomics studies. The better performance of SVM on our dataset is consistent with the result from Bullinger et al [51]. In their study, SVM also showed a better performance for prediction of breast cancer. Although the classification ability of different classifiers is related to inner data structure of the dataset, we believe SVM is a competitive classifier in metabonomics studies and will be widely used in this field. In this example, SIMCA and KNN did not show better performance than the commonly used PLSDA classifier, presumably due to that they put a focus on the similarity within a class. In addition, UV scaling on the data has better prediction ability than meancentered scaling in PLSDA model (Table 1).
Other related aspects and the future of Automics
Currently, Automics provides a module for conveniently exploring database resources. This function is helpful when researchers want to explore the structure and chemical shift information of metabolites from available database such as Madison Metabolomics Consortium Database (MMCD) [52] and the Human Metabolome Database (HMDB) [53]. Statistical total correlation spectroscopy (STOCSY) analysis method has also been implemented in Automics. This method takes advantage of the multi colinearity of the intensity variables in a set of 1D ^{1}H spectra to generate a correlation matrix about the intensity correlations among various peaks across the whole dataset [54]. 2D contour plot implemented in Automics can be used to display and analyze the correlation matrix as a pseudo 2D NMR spectrum.
There is still plenty of room for improving the functionality and usability of Automics. For example, some commonly used data analysis approaches such as O2PLS will be implemented in the near future. A more userfriendly interface is also in our plan for the future development. As Automics was designed with a free open architecture, interested researchers are encouraged to implement new algorithms and extend the software based on the existing infrastructure. We also expect that applications of Automics by metabonomics researchers will help us to get valuable feedbacks and suggestions for improving the platform.
Conclusion
In this paper, we introduced Automics, the first open source software tool with highly integrated modules specifically designed for NMRbased metabonomics applications. This tool covers almost all stages of the NMRbased metabonomics study workflow. The spectral processing modules in Automics are efficient and convenient for either processing a large number of spectra or processing single spectrum offline without commercial software. In addition, features such as data organization, data preprocessing and a wide range of data analysis techniques for multivariate data analysis, classification and regression have been implemented in Automics. Some of the useful data analysis methods in Automics (such as SVM) are not available in widely used commercial software tools, such as SIMCAP (Umetrics, Sweden). Automics enables researchers to complete spectral processing and data analysis in one software package. Moreover, Automics could be applied to metabonomics data generated from other analytical techniques (such as mass spectroscopy), owing to its flexible and independent module designs. More details about the usage of Automics can be found in the welldocumented HTML help [see Additional file 1].
Availability and Requirements
Project name: Automics
Project home page: http://code.google.com/p/automics/
Operating system: Windows 2000/NT/XP/2003
Programming language: Visual C++
License: open source under GNU license
Abbreviations
 NMR:

Nuclear Magnetic Resonance
 MSC:

Multiplicative Signal Correction
 SNV:

Standard Normal Variate Transform
 DOSC:

Direct Orthogonal Signal Correction
 OPLS:

Orthogonal Projection to Latent Structures
 FC:

Fisher's Criterion
 PCA:

Principal Component Analysis
 LDA:

Linear Discriminant Analysis
 ULDA:

Uncorrelated Linear Discriminant Analysis
 PLS:

Partial Least Squared
 KNN:

K Nearest Neighbors
 SIMCA:

Soft Independent Modeling of Class Analogy
 SVM:

Support Vector Machine
 STOCSY:

Statistical Total Correlation Spectroscopy.
Declarations
Acknowledgements
All NMR experiments were carried out at the Beijing Nuclear Magnetic Resonance Center (BNMRC), Peking University. This research was supported by Grant 2009CB521703 from 973 Program of China and Grant 30125009 from NSFC to BX.
Authors’ Affiliations
References
 Nicholson JK, Lindon JC, Holmes E: 'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica. 1999, 29 (11): 11811189. 10.1080/004982599238047.View ArticlePubMedGoogle Scholar
 Rochfort S: Metabolomics reviewed: A new "Omics" platform technology for systems biology and implications for natural products research. J Nat Prod. 2005, 68 (12): 18131820. 10.1021/np050255w.View ArticlePubMedGoogle Scholar
 Kell DB: Systems biology, metabolic modelling and metabolomics in drug discovery and development. Drug Discov Today. 2006, 11 (23–24): 10851092. 10.1016/j.drudis.2006.10.004.View ArticlePubMedGoogle Scholar
 Nicholson JK, Wilson ID: Understanding 'global' systems biology: Metabonomics and the continuum of metabolism. Nat Rev Drug Discov. 2003, 2 (8): 668676. 10.1038/nrd1157.View ArticlePubMedGoogle Scholar
 Nicholson JK, Connelly J, Lindon JC, Holmes E: Metabonomics: a platform for studying drug toxicity and gene function. Nat Rev Drug Discov. 2002, 1 (2): 153161. 10.1038/nrd728.View ArticlePubMedGoogle Scholar
 Lisec J, Schauer N, Kopka J, Willmitzer L, Fernie AR: Gas chromatography mass spectrometrybased metabolite profiling in plants. Nat Protoc. 2006, 1 (1): 387396. 10.1038/nprot.2006.59.View ArticlePubMedGoogle Scholar
 Wilson ID, Plumb R, Granger J, Major H, Williams R, Lenz EA: HPLCMSbased methods for the study of metabonomics. J Chromatogr B. 2005, 817 (1): 6776. 10.1016/j.jchromb.2004.07.045.View ArticleGoogle Scholar
 Shockcor JP, Nichols A, Antti H, Plumb RS, CastroPerez JM, Major H, Preece S: LCMS/MS approach to 'metabonomics' – What can it do for drug discovery/development?. Drug Metab Rev. 2003, 35 (Suppl 1): 11.Google Scholar
 Dunn WB, Ellis DI: Metabolomics: Current analytical platforms and methodologies. TracTrend Anal Chem. 2005, 24 (4): 285294. 10.1016/j.trac.2004.11.021.View ArticleGoogle Scholar
 Soga T, Ohashi Y, Ueno Y, Naraoka H, Tomita M, Nishioka T: Quantitative metabolome analysis using capillary electrophoresis mass spectrometry. J Proteome Res. 2003, 2 (5): 488494. 10.1021/pr034020m.View ArticlePubMedGoogle Scholar
 Shulaev V: Metabolomics technology and bioinformatics. Brief Bioinform. 2006, 7 (2): 128139. 10.1093/bib/bbl012.View ArticlePubMedGoogle Scholar
 Katajamaa M, Oresic M: Data processing for mass spectrometrybased metabolomics. J Chromatogr A. 2007, 1158 (1–2): 318328. 10.1016/j.chroma.2007.04.021.View ArticlePubMedGoogle Scholar
 Griffin JL: Metabonomics: NMR spectroscopy and pattern recognition analysis of body fluids and tissues for characterisation of xenobiotic toxicity and disease diagnosis. Curr Opin Chem Biol. 2003, 7 (5): 648654. 10.1016/j.cbpa.2003.08.008.View ArticlePubMedGoogle Scholar
 Zhao Q, Stoyanova R, Du SY, Sajda P, Brown TR: HiRes – a tool for comprehensive assessment and interpretation of metabolomic data. Bioinformatics. 2006, 22 (20): 25622564. 10.1093/bioinformatics/btl428.View ArticlePubMedGoogle Scholar
 Delaglio F, Grzesiek S, Vuister GW, Zhu G, Pfeifer J, Bax A: Nmrpipe – a Multidimensional Spectral Processing System Based on Unix Pipes. Journal of Biomolecular Nmr. 1995, 6 (3): 277293. 10.1007/BF00197809.View ArticlePubMedGoogle Scholar
 Montigny F, Elbayed K, Brondeau J, Canet D: Automatic Phase Correction of FourierTransform NuclearMagneticResonance Spectroscopy Data and Estimation of Peak Area by Fitting to a Lorentzian Shape. Anal Chem. 1990, 62 (8): 864867. 10.1021/ac00207a019.View ArticleGoogle Scholar
 Balacco G: A New Criterion for Automatic Phase Correction of HighResolution NmrSpectra Which Does Not Require Isolated or Symmetrical Lines. J Magn Reson Ser A. 1994, 110 (1): 1925. 10.1006/jmra.1994.1175.View ArticleGoogle Scholar
 Miyabayashi N: Automatic Phase Correction of NuclearMagneticResonance Spectrum. Bunseki Kagaku. 1995, 44 (7): 549554.View ArticleGoogle Scholar
 Witjes H, Melssen WJ, Zandt HJAI, Graaf van der M, Heerschap A, Buydens LMC: Automatic correction for phase shifts, frequency shifts, and lineshape distortions across a series of single resonance lines in large spectral data sets. J Magn Reson. 2000, 144 (1): 3544. 10.1006/jmre.2000.2021.View ArticlePubMedGoogle Scholar
 Chen L, Weng ZQ, Goh LY, Garland M: An efficient algorithm for automatic phase correction of NMR spectra based on entropy minimization. J Magn Reson. 2002, 158 (1–2): 164168. 10.1016/S10907807(02)000691.View ArticleGoogle Scholar
 Wachter EA, Sidky EY, Farrar TC: Calculation of PhaseCorrection Constants Using the Dispa PhaseAngle Estimation Technique. J Magn Reson. 1989, 82 (2): 352359.Google Scholar
 Craig EC, Marshall AG: Automated Phase Correction of Ft NmrSpectra by Means of Phase Measurement Based on Dispersion Versus Absorption Relation (Dispa). J Magn Reson. 1988, 76 (3): 458475.Google Scholar
 Heuer A: A New Algorithm for Automatic Phase Correction by Symmetrizing Lines. J Magn Reson. 1991, 91 (2): 241253.Google Scholar
 Dzakula Z: Phase angle measurement from peak areas (PAMPAS). J Magn Reson. 2000, 146 (1): 2032. 10.1006/jmre.2000.2123.View ArticlePubMedGoogle Scholar
 Golotvin S, Williams A: Improved baseline recognition and modeling of FT NMR spectra. J Magn Reson. 2000, 146 (1): 122125. 10.1006/jmre.2000.2121.View ArticlePubMedGoogle Scholar
 Forshed J, SchuppeKoistinen I, Jacobsson SP: Peak alignment of NMR signals by means of a genetic algorithm. Anal Chim Acta. 2003, 487 (2): 189199. 10.1016/S00032670(03)005701.View ArticleGoogle Scholar
 Stoyanova R, Nicholls AW, Nicholson JK, Lindon JC, Brown TR: Automatic alignment of individual peaks in large highresolution spectral data sets. J Magn Reson. 2004, 170 (2): 329335. 10.1016/j.jmr.2004.07.009.View ArticlePubMedGoogle Scholar
 Wu W, Daszykowski M, Walczak B, Sweatman BC, Connor SC, Haseldeo JN, Crowther DJ, Gill RW, Lutz MW: Peak alignment of urine NMR spectra using fuzzy warping. J Chem Inf Model. 2006, 46 (2): 863875. 10.1021/ci050316w.View ArticlePubMedGoogle Scholar
 De Meyer T, Sinnaeve D, Van Gasse B, Tsiporkova E, Rietzschel ER, De Buyzere ML, Gillebert TC, Bekaert S, Martins JC, Van Criekinge W: NMRbased characterization of metabolic alterations in hypertension using an adaptive, intelligent binning algorithm. Anal Chem. 2008, 80 (10): 37833790. 10.1021/ac7025964.View ArticlePubMedGoogle Scholar
 Geladi P, Macdougall D, Martens H: Linearization and ScatterCorrection for nearInfrared Reflectance Spectra of Meat. Appl Spectrosc. 1985, 39 (3): 491500. 10.1366/0003702854248656.View ArticleGoogle Scholar
 Barnes RJ, Dhanoa MS, Lister SJ: Standard Normal Variate Transformation and DeTrending of nearInfrared Diffuse Reflectance Spectra. Appl Spectrosc. 1989, 43 (5): 772777. 10.1366/0003702894202201.View ArticleGoogle Scholar
 Westerhuis JA, de Jong S, Smilde AK: Direct orthogonal signal correction. Chemometr Intell Lab. 2001, 56 (1): 1325. 10.1016/S01697439(01)001022.View ArticleGoogle Scholar
 Trygg J, Wold S: Orthogonal projections to latent structures (OPLS). J Chemometr. 2002, 16 (3): 119128. 10.1002/cem.695.View ArticleGoogle Scholar
 Wold S, Antti H, Lindgren F, Ohman J: Orthogonal signal correction of nearinfrared spectra. Chemometr Intell Lab. 1998, 44 (1–2): 175185. 10.1016/S01697439(98)001099.View ArticleGoogle Scholar
 Wu W, Guo Q, JouanRimbaud D, Massart DL: Using contrasts as data pretreatment method in pattern recognition of multivariate data. Chemometr Intell Lab. 1999, 45: 12. 10.1016/S01697439(98)001919.View ArticleGoogle Scholar
 Desantis F, Pagliuca A: On FactorAnalysis and Fishers Linear DiscriminantAnalysis. Cybernet Syst. 1982, 13 (1): 7791. 10.1080/01969728208927690.View ArticleGoogle Scholar
 Ye JP, Janardan R, Li Q, Park H: Feature reduction via generalized uncorrelated linear discriminant analysis. Ieee T Knowl Data En. 2006, 18 (10): 13121322. 10.1109/TKDE.2006.160.View ArticleGoogle Scholar
 Yang WH, Dai DQ, Yan H: Feature extraction and uncorrelated discriminant analysis for highdimensional data. Ieee T Knowl Data En. 2008, 20 (5): 601614. 10.1109/TKDE.2007.190720.View ArticleGoogle Scholar
 Jain AK, Dubes RC: Algorithms for clustering data. 1988, Boston: Prentice Hall PressGoogle Scholar
 Wold S, Kettanehwold N, Skagerberg B: Nonlinear Pls Modeling. Chemometr Intell Lab. 1989, 7 (1–2): 5365. 10.1016/01697439(89)80111X.View ArticleGoogle Scholar
 Wold S, Ruhe A, Wold H, Dunn WJ: The Collinearity Problem in LinearRegression – the Partial LeastSquares (Pls) Approach to Generalized Inverses. Siam J Sci Stat Comp. 1984, 5 (3): 735743. 10.1137/0905052.View ArticleGoogle Scholar
 Shakhnarovich G, Darrell T, Indyk P: NearestNeighbor Methods in Learning and Vision. 2006, Cambridge: The MIT PressGoogle Scholar
 Wold S, Johansson E, Jellum E, Bjornson I, Nesbakken R: Application of Simca Multivariate DataAnalysis to the Classification of GasChromatographic Profiles of HumanBrain Tissues. Anal Chim ActaComp. 1981, 5 (3): 251259. 10.1016/S00032670(01)831998.View ArticleGoogle Scholar
 Vapnik VN: The Nature of Statistical Learning Theory. 2000, New York: Springer Press, 2View ArticleGoogle Scholar
 Yi LZ, Yuan DL, Che ZH, Liang YZ, Zhou ZG, Gao HY, Wang YM: Plasma fatty acid metabolic profile coupled with uncorrelated linear discriminant analysis to diagnose and biomarker screening of type 2 diabetes and type 2 diabetic coronary heart diseases. Metabolomics. 2008, 4 (1): 3038. 10.1007/s1130600700987.View ArticleGoogle Scholar
 Daykin CA, Foxall PJD, Connor SC, Lindon JC, Nicholson JK: The comparison of plasma deproteinization methods for the detection of lowmolecularweight metabolites by H1 nuclear magnetic resonance spectroscopy. Anal Biochem. 2002, 304 (2): 220230. 10.1006/abio.2002.5637.View ArticlePubMedGoogle Scholar
 Brindle JT, Antti H, Holmes E, Tranter G, Nicholson JK, Bethell HWL, Clarke S, Schofield PM, McKilligin E, Mosedale DE: Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using H1NMRbased metabonomics. Nat Med. 2002, 8 (12): 14391444. 10.1038/nm802.View ArticlePubMedGoogle Scholar
 Bergman RN, Ader M: Free fatty acids and pathogenesis of type 2 diabetes mellitus. Trends Endocrin Met. 2000, 11 (9): 351356. 10.1016/S10432760(00)003234.View ArticleGoogle Scholar
 Drexel H, Aczel S, Marte T, Rein P, Koch L, Schmid F, Langer P, Hoefle G, Saely CH: High triglycerides, low HDL cholesterol, and small LDL particles are the main lipid risk factors in coronary patients with type 2 diabetes. Circulation. 2006, 114 (18): 883883.Google Scholar
 Wang C, Kong HW, Guan YF, Yang J, Gu JR, Yang SL, Xu GW: Plasma phospholipid metabolic profiling and biomarkers of type 2 diabetes mellitus based on highperformance liquid chromatography/electrospray mass spectrometry and multivariate statistical analysis. Anal Chem. 2005, 77 (13): 41084116. 10.1021/ac0481001.View ArticlePubMedGoogle Scholar
 Bullinger D, Frohlich H, Klaus F, Neubauer H, Frickenschmidt A, Henneges C, Zell A, Laufer S, Gleiter CH, Liebich H: Bioinformatical evaluation of modified nucleosides as biomedical markers in diagnosis of breast cancer. Anal Chim Acta. 2008, 618 (1): 2934. 10.1016/j.aca.2008.04.048.View ArticlePubMedGoogle Scholar
 Cui Q, Lewis IA, Hegeman AD, Anderson ME, Li J, Schulte CF, Westler WM, Eghbalnia HR, Sussman MR, Markley JL: Metabolite identification via the Madison Metabolomics Consortium Database. Nat Biotechnol. 2008, 26 (2): 162164. 10.1038/nbt0208162.View ArticlePubMedGoogle Scholar
 Wishart DS, Tzur D, Knox C, Eisner R, Guo AC, Young N, Cheng D, Jewell K, Arndt D, Sawhney S: HMDB: the human metabolome database. Nucleic Acids Res. 2007, D521D526. 10.1093/nar/gkl923. 35 database
 Cloarec O, Dumas ME, Craig A, Barton RH, Trygg J, Hudson J, Blancher C, Gauguier D, Lindon JC, Holmes E: Statistical total correlation spectroscopy: An exploratory approach for latent biomarker identification from metabolic H1 NMR data sets. Anal Chem. 2005, 77 (5): 12821289. 10.1021/ac048630x.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.