MetAssimulo:Simulation of Realistic NMR Metabolic Profiles
© Muncey et al. 2010
Received: 9 June 2010
Accepted: 6 October 2010
Published: 6 October 2010
Skip to main content
© Muncey et al. 2010
Received: 9 June 2010
Accepted: 6 October 2010
Published: 6 October 2010
Probing the complex fusion of genetic and environmental interactions, metabolic profiling (or metabolomics/metabonomics), the study of small molecules involved in metabolic reactions, is a rapidly expanding 'omics' field. A major technique for capturing metabolite data is 1H-NMR spectroscopy and this yields highly complex profiles that require sophisticated statistical analysis methods. However, experimental data is difficult to control and expensive to obtain. Thus data simulation is a productive route to aid algorithm development.
MetAssimulo is a MATLAB-based package that has been developed to simulate 1H-NMR spectra of complex mixtures such as metabolic profiles. Drawing data from a metabolite standard spectral database in conjunction with concentration information input by the user or constructed automatically from the Human Metabolome Database, MetAssimulo is able to create realistic metabolic profiles containing large numbers of metabolites with a range of user-defined properties. Current features include the simulation of two groups ('case' and 'control') specified by means and standard deviations of concentrations for each metabolite. The software enables addition of spectral noise with a realistic autocorrelation structure at user controllable levels. A crucial feature of the algorithm is its ability to simulate both intra- and inter-metabolite correlations, the analysis of which is fundamental to many techniques in the field. Further, MetAssimulo is able to simulate shifts in NMR peak positions that result from matrix effects such as pH differences which are often observed in metabolic NMR spectra and pose serious challenges for statistical algorithms.
No other software is currently able to simulate NMR metabolic profiles with such complexity and flexibility. This paper describes the algorithm behind MetAssimulo and demonstrates how it can be used to simulate realistic NMR metabolic profiles with which to develop and test new data analysis techniques. MetAssimulo is freely available for academic use at http://cisbic.bioinformatics.ic.ac.uk/metassimulo/.
In the postgenomic era there has been a massive growth in 'omics' techniques investigating different levels of biological organisation. Metabolic profiling (or metabonomics/metabolomics) is a key area of systems biology research focussing on high-throughput identification and quantification of metabolites, small molecules (≤ 1500 Da) involved in metabolism . When trying to relate genes to the overall function of a system, the metabolome (the complete set of metabolites) more closely reflects the activities of the organism at a functional level than, for example, the transcriptome . Metabolic fluxes are not only regulated by gene expression, but also by additional factors, which include the abundance of metabolites as substrates and products . Therefore metabolic profiling adds another dimension to our understanding of biological systems.
The chemical shift (δ) of each resonance is dependent upon the local magnetic field experienced by each nucleus. This local field is dependent on the degree to which molecular orbitals shield the influence of the external spectrometer field. Thus the chemical shift can reflect the chemical structure of the metabolite. The position of each peak is measured relative to that of an internal standard in a scale of parts per million (ppm) . A commonly used internal standard is 3-(Trimethylsilyl)-Propionic acid-D4, sodium salt (TSP).
Spin-spin coupling causes NMR resonances to split into multiplet patterns due to magnetic interactions between nearby nuclei.
Integrated peak area is proportional to the number of observed 1H nuclei (assuming there are no differential relaxation effects) and allows quantification of the metabolite concentrations.
The NMR spectrum of a complex mixture can be well approximmated by a linear combination of the spectra of pure compounds, potentially thousands of metabolites. Biofluid spectra can be treated as K dimensional objects, in which each dimension represents the concentration of a single metabolite . This super-posed structure is exploited in our simulation method, detailed in the 'Implementation' section.
Metabolic NMR spectra are highly complex and the field benefits greatly from the application of machine learning and statistical tools to extract information. Pattern recognition analyses such as Principal Components Analysis (PCA) have long been combined with NMR to investigate normal and pathological metabolic states . Data processing methods are being developed to extract metabolite information and concentrations from raw spectra, allowing automation of spectral processing. Development of advanced mathematical, statistical and computational methods are also essential for characterisation of the metabolic state, delineation of metabolic changes over time and the efficient identification of potential biomarkers. There are a wide variety of diseases where key changes in metabolites have been deduced e.g. cancer, diabetes, hypertension etc. [8–10]. However, as algorithms and methods are developed, they need to be refined and validated to ensure results will be biologically meaningful. It is hard to effect this without using test datasets where the true answers are known; this can be accomplished using simulation techniques. An alternative approach is to design artificial mixtures of metabolites which are prepared and analysed in identical fashion to real samples. However this is expensive in terms of man power and instrument time, and offers few advantages over in silico simulation when assessment of analytical procedures is not required. The purpose of MetAssimulo is to simulate datasets of realistic NMR spectra with known parameters in order to test data analysis techniques, hypotheses and experimental designs. Few methods for generating simulated NMR datasets have appeared in the literature to date [11, 12]. Most model a limited number of metabolites, make no attempt to reproduce realistic levels of metabolites, and do not allow for between-metabolite or 'inter-metabolite' correlations ( excepted) and do not always model peak positional shifts. It is common to fit Lorentzian peak shapes in an attempt to characterize spectral peaks, e.g . However, this ignores the fact that peak shapes in real NMR profiles are variable and can be far from ideal. Here we outline a novel approach making use of individual standard metabolite data extracted from the Human Metabolome Database (HMDB)  and a local NMR standard spectra database (NSSD). Many metabolic profiling labs host their own NSSD appropriate to the biological systems and sample types they work with and thus the simulations can be tailored to virtually any sample type or organism as required. In this work human urine is used as the example biofluid as it is one of the most widely used in the field and, in healthy subjects, has no protein or lipid content, both of which make the simulation more complex. MetAssimulo is written in MATLAB with a graphical interface allowing the user to alter processing parameters and add new standard spectra as needed. The software is freely available along with an example NSSD of 48 metabolites commonly found in normal human urine. We stress that this list of metabolites and their concentration means and standard deviations does not constitute a definitive description of human urine; such a goal is beyond the scope of this paper. It is provided for the sole purpose of demonstrating the capabilities of the software.
Each individual metabolite spectrum is sampled at a series of n uniformly spaced data points. The overall spectrum is made up of pairs of data points, (x, y) = (x i ,y i ) i = 1,..., n , where y i = y(x i ); n is set by the user, and x i defines a point on the ppm grid.
In real NMR spectra, the signal intensity is affected by the extent to which the observed nuclei are allowed to relax before each observation. In MetAssimulo we do not currently attempt to simulate the effects of differential inter-molecular relaxation. However, intra-molecular relaxation effects are accounted for by the fact that experimentally obtained pure compound spectra are used to form the mixture spectra.
Parameters can be altered either in the MetAssimulo GUI or within the parameter file 'parameters.txt'. Default values for parameters are given in the Supplementary Material. The interface provides the user with several different processing options. For example the second group ('cases') may be specified as fold change ratios of the concentrations of the first group and the user can specify whether to produce output with or without peak shifts or both. The user also chooses whether to include inter-metabolite correlation (pairwise correlations between metabolites) or not; either as a textfile whose entries can be altered using the interface or constructed from scratch in the Correlation GUI.
The Human Metabolome Database (HMDB)  contains information about more than 2180 metabolites found in humans and includes literature data relating to normal and abnormal concentrations in biofluids. Metabocards is the flat file download of the entire database, available at http://www.hmdb.ca. Also required is the HMDB set of NMR Peak Lists (containing locations of individual peaks for metabolites) which is available in a downloadable zip-file. In constructing the template of normal human urine concentrations various problems of incompleteness and/or ambiguity were encountered. For example, in many HMDB entries the metabolite concentration mean and standard deviation is unavailable, or simply a range is given. In these cases the standard deviation was estimated by dividing the mean (or 'half-range') by 1.95. There are instances where a metabolite is identified as present in urine, but a normal concentration value is not available. We have attempted to rectify as many of these discrepancies as possible in the provided concentration file by cross-checking with other sources, i.e. literature articles, but do not claim the result represents a complete description of human urine; it does, however serve to demonstrate the software.
The quality of MetAssimulo simulations is also dependent on the quality and coverage of the NSSD used, as well as the peak shift settings affecting multiplet detection. By distributing an NSSD it is not our intention to provide a comprehensive NMR standard database but merely an initial set of common metabolite spectra with which users can begin to make their own simulations. Many users will wish to add their own locally acquired standard spectra for metabolites specific to their areas of interest and we have provided functionality to do this.
There are a number of input files that are required for MetAssimulo.
Concentration files* are needed for both groups of metabolites, these detail the mean and standard deviation of the concentration for each metabolite.
An NMR Standard Spectral Database (NSSD) comprising standard 1D 1H-NMR spectra for metabolites is essential. MetAssimulo is designed to work with any metabolite database set out in the Bruker file format. Standard spectra of 48 of the most abundant metabolites in normal human urine is distributed with MetAssimulo.
Experiment file identifying the experiments to use in the metabolite database, as one metabolite may have many spectra, taken at different pH for example.
Proton file listing the number of protons, p k observed for each metabolite, k.
Multiplet data files* specifying the position of each peak in a multiplet for each metabolite in order to incorporate simulated peak shifts. Known pKa values and acid/base limits can also be included.
Inter-metabolite correlations can be input via a text file or the GUI.
Synonym files* that allow MetAssimulo to match metabolites in the HMDB data to those in the NSSD.
Parameter file containing the default values or simulation parameters (alterable in the GUI).
*Can be generated automatically using 'Format HMDB Data' within MetAssimulo.
Examples of all input files in the appropriate format are included with the MetAssimulo distribution. Much of the required input data can be generated using the in-built function 'Format HMDB Data' (accessed via the GUI) which should be run as an initial 'setup'. It produces the files necessary for conversion between the local database and HMDB synonyms, data required for peak shift simulation and a raw template of concentration data for 'normal' urine. The normal urine concentration file provided with the distribution has been optimized to provide realistic values and correct a number of errors found in the current version of the HMDB whilst reducing the number of metabolites used in order to decrease processing time.
Initially, a set of metabolite concentrations is simulated for the case and control groups, based on the mean and standard deviations in the concentration file. Next, the required spectra from the NSSD must be loaded. Even 1H-NMR spectra of standard pure compounds contain a number of complexities, such as chemical and electronic noise, phase and baseline errors, contaminants and water suppression residuals. Thus it is ususally necessary to preprocess these spectra into a form suitable for combining into the final metabolic profiles.
Significant inter-metabolite correlations, here assumed to be linear pairwise correlations between metabolites, are often found within the field of metabolic profiling so were considered an important feature to incorporate into the simulation. Where inter-metabolite correlations are required, the concentrations are simulated by sampling from the appropriate multivariate normal distribution. Rejection sampling is utilised to ensure non-negative concentrations. Using the method detailed in  the nearest positive semidefinite correlation matrix is calculated given user-specified pairwise correlations. The covariance matrix is constructed using the metabolite standard deviations and specified correlations, and the diagonal entries are increased sufficiently to ensure positive-definiteness. Any necessary alterations to the correlation and covariance matrices are output for inspection.
After the concentrations have been simulated the standard spectra of the metabolites are read in. Each spectrum consists of chemical shift in ppm, x and intensity, y. Spectra are then linearly interpolated onto a ppm grid of user-specified resolution.
Exclusion regions, corresponding to the location of the internal standard peak (default < 0.2 ppm ) and the residual water peak (default 4.5 ppm - 6.0 ppm ), are set to zero. In urine, the urea signal (between 5.4 ppm and 6.0 ppm ), the most abundant proton-containing metabolite , can be problematic particularly when water-suppression methods are used. Water-suppression is usually imperfect and the resulting residual peaks (near to the urea signal) are not dealt with easily by baseline correction algorithms . Often, the urea and water peaks are combined into one exclusion region lying between 4.5 ppm and 6 ppm (default exclusion region, but can be adjusted by the user). Excluding these areas of the spectrum helps reduce sensitivity to artifacts.
It is easier to distinguish peaks in a spectrum when the baseline is featureless , however, spectra can have distorted baselines due to imperfections in the detection process . Curved baselines can be a major source of error and so a correction is carried out on the raw spectrum using a moving average . This method involves splitting the data into windows of size ω (default is 0.3125 ppm), defined by the user, then using the median within the window to estimate the baseline. In order to alter the baseline without losing metabolite peaks, a threshold is set by dividing the maximum height by a user specified parameter (default is 10). All the intensities found below this threshold are corrected by subtracting the estimated baseline.
M is the median of the intensities, y i . All intensities appearing below this limit,l, are set equal to it.
Noise from each standard metabolite spectrum will remain in the final mixture spectrum, reducing the overall signal to noise ratio. Kernel smoothing is used to reduce the noise in each individual metabolite spectrum. This process estimates the smooth function underlying the noisy data using a weighted mean of surrounding data points with weights defined according to the choice of kernel. Whilst the default kernel type is 'Normal', the user may also choose from a number of options and also alter the bandwidth (given as number of data points), controlling the degree of smoothing required. Since smoothing the whole spectrum woiuld increase the peak widths, only intensities below a user-defined threshold (a percentage of the maximum intensity, default 0.8) are subject to kernel smoothing.
where δ ij is the un-shifted position of peak i of metabolite j in ppm (known),
η ij is the amount the peak is shifted in ppm (generated by Eq.4),
pH is the pH of sample (simulated or input),
pKa j is the pKa of metabolite j (simulated or input) assumed here to be the same for all peaks of a given metabolite,
a ij is the position of peak i of metabolite j in the acid limit (ppm) (simulated or input),
b ij is the position of peak i of metabolite j in the basic limit (ppm) (simulated or input).
After this process, the spectrum is then smoothed again in order to suppress any unwanted artifacts created by the peak shift.
where is the intensity in the preprocessed, unnormalised standard spectrum.
Even after preprocessing, the signal to noise levels of the individual metabolite spectra may vary, so therefore the final signal to noise ratio cannot be controlled perfectly. However, adding noise in this way allows the simulation of mixture spectra with a wide variety of signal to noise ratios. After adding the random noise, ε(δ) ~ N (0, σ n ), kernel smoothing is used on the composite spectrum to reproduce the effect of apodization on real spectra .
In this section some example outputs from MetAssimulo will be shown. Simulations of normal urine were run using the optimized template with parameters set to the default values and using the NSSD consisting of 48 spectra recorded at 600 MHz 1H observation frequency.
There are currently simulation programs in different areas of post-genomic science, such as SNP simulators that are being used in whole genome association studies [22, 23]. MetAssimulo is a valuable addition to these tools, enabling the simulation of realistic 1H NMR spectra of complex biological mixtures including group-wise variation, intermetabolite correlations and peak positional variation. However, there are areas which could be enhanced. Any simulator of this kind is limited by the sources of data available. The HMDB only contains information about metabolite concentrations in humans, therefore further user input or other metabolite databases may be needed to address other organisms. Human urine is the default setting for MetAssimulo, but given the numerous alterable parameters, it is easy to simulate profiles for other species and biofluids.
Project name: MetAssimulo
Project home page: http://cisbic.bioinformatics.ic.ac.uk/metassimulo/
Operating system(s): Platform independent
Programming language: MATLAB
Other requirements: MATLAB
Graphical User Interface
Human Metabolome Database
Nuclear Magnetic Spectroscopy
NMR Standard Spectra Database
HM and RJ acknowledge financial support from an MRC capacity building studentship. MDI and TE were partially supported by the Biotechnology and Biological Sciences Research Council (Grant Ref.BB/E20372/1).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.