Data collection, curation and pre-processing
The reference libraries information used in this study was collected from two publicly available resources, namely 876 metabolite peak lists (experimental) from the Human Metabolome Database, version 2.5 [28], and 448 metabolite peak lists from Madison Metabolomics Consortium Database [29]. The data was curated and re-formatted. The curation process consisted of manually adjusting the formatting inconsistencies (mostly found in the NMR Peaklist HMDB data files), that were further automatically processed and summarized into indexed tables for fast access. Specific metabolite data extracted from HMDB metabocard files was also re-organized and indexed to further facilitate its usage on the web server.
The data has been re-formatted as follows: (i) a set of peaks (ppm and height pairs) has been re-formatted with a 0.01 ppm precision and assigned to each metabolite. These pieces of information are subsequently used for metabolite identification and, (ii) each metabolite has been assigned a source of provenience (HMDB, MMCD), a type (drug, food additive, mammalian, microbial, plant, synthetic/industrial chemical), the pH of the sample in which it was measured (3.00 - 10.00), the solvent (water, CDCl3, CD3OD, 5% DMSO), and the frequency of the NMR machine (400 MHz, 500 MHz, 600 MHz).
Figures 8, 9, 10 and 11 show the frequencies of metabolites from HMDB and MMCD data that share the same peak coordinates and the distribution of the number of peaks for all metabolites. The number of metabolites sharing the same peak coordinate is much higher for HMDB reference metabolites than for the ones in MMCD, while the number of metabolites in HMDB is roughly only about two times higher than MMCD.
Datasets
The datasets used in this study consist of both, synthetic and experimental data. The synthetic data can be categorized as: 1H-NMR spectra (_s), list of peaks extracted externally (with MNova) from spectra (_f) and mixtures of peaks obtained directly from HMDB peak lists (_p). The following synthetic data sets are considered:
Single metabolites
-
SYN1_s: Acetylcholine HMDB spectrum,
-
SYN1_f: Acetylcholine HMDB spectral peaks,
-
SYN1_p: Acetylcholine HMDB database peaks,
-
SYN2_s: 17a-Estradiol HMDB spectrum,
-
SYN2_f: 17a-Estradiol HMDB spectral peaks,
-
SYN2_p: 17a-Estradiol HMDB database peaks,
-
SYN3_s: Cholesterol HMDB spectrum,
-
SYN3_f: Cholesterol HMDB spectral peaks,
-
SYN3_p: Cholesterol HMDB database peaks,
-
SYN4_s: D-Glucose HMDB spectrum,
-
SYN4_f: D-Glucose HMDB spectral peaks,
-
SYN4_p: D-Glucose HMDB database peaks.
Multiple metabolites
-
SYN5_s: Mixture of 13 metabolites spectrum,
-
SYN5_f: Mixture of FIDs for 13 metabolites from HMDB spectral peaks,
-
SYN5_p: Mixture of 13 metabolites HMDB peaks.
The 13 metabolites included in SYN5 are: Choline, Glutathione, L-Alanine, L-Glutamic Acid, L-Glutamine, L-Leucine, L-Valine, L-Asparagine, L-Isoleucine, L-Lactic acid, L-Proline, Succinic acid and Taurine.
The experimental data includes:
-
EXP1: 1H-NMR spectrum of a spiked-in urine sample from a healthy individual obtained from Zheng et al. [21]. The spectrum was measured on a 500 MHz Bruker Avance NMR spectrometer. The five spiked-in metabolites in this sample are: Taurine, Hippuric acid, Nicotinate, Malic acid and Oxoglutaric acid. The remaining metabolites in the mixture are not known.
-
EXP2: An experimental mixture of 5 metabolites measured on a 270 MHz Jeol spectrometer at 25°C. Mixture was prepared under N2 atmosphere, in 99.8% D2O with reference substance (3-(trimethylsilyl)propionic-2,2,3,3-d4 acid, sodium salt, 98% atom% D, TMSP, at 0.6 mM). The five metabolites included in this sample are: Creatine, D-Glucose, Citric acid, Phosphocholine and Accetylcholine.
Peak identification and noise removal
The user input for MetaboHunter consists of a complete spectrum in ASCII format structured on two columns, first one representing the spectral point position (ppm) and the second column representing the corresponding intensity of the signal. Based on a user-provided threshold for noise level, i.e. the minimum intensity level above which peaks should be considered, the data is pre-processed so that data points with intensity below the threshold are removed and the peaks are automatically identified (see Figure 12).
Exact and variable peaks matching
In MetaboHunter, a metabolite is considered to be present if at least one of its peaks matches a peak in the test sample and the computed significance score is greater than a user-set threshold (default is 0.5).
While the significance score proposed in [6] computes the ratio between the number of matched peaks and the total number of peaks for each metabolite, here we propose a metabolite significance score that is able to differentiate between matches of metabolites with low number of peaks as opposed to those with greater number of peaks. The significance score is computed as the ratio between: (i) the number of matched metabolite peaks and (ii) one plus the total number of metabolite peaks (Equation 1). For example, one metabolite with 2 out of 4 peaks matching a sample gets a score of 0.4, while another metabolite with 5 out of 10 peaks matching the same sample will get a higher score of 0.45, thus being ranked higher.
Equation 1. A novel scoring function used by all three methods implemented in MetaboHunter.
Exact matches for metabolite peak locations can be considered when chemical shifts of metabolites in the sample are assumed to be the same as in the reference database. Thus we provide an implementation of this approach based on the significance score described above. Methods MH2 and MH3 allow variations in chemical shifts and take as input a user-set parameter τ representing the maximal range allowed. Each metabolite peak is matched against all sample peaks p that fall within the interval [p-τ, p+τ] and the corresponding significance score is calculated and reported.
Metabolite fingerprinting methods
For the detection of metabolites using 1H-NMR spectra of sample mixtures, we use three methods.
The first method (MH1) relies on exact matching of peaks from the mixture with reference metabolite spectrum peaks collected and carefully curated from two publicly available databases (HMDB and MMCD). The method starts by computing and sorting initial scores using Equation 1 based on peak matches between the test spectrum and all the library spectra. Next, the matched metabolites are then screened based on the desired user features, such as metabolite type, sample pH, solvent, and NMR frequency and only those with scores larger than the desired confidence threshold are displayed. The method is more accurate if the spectral measurements of the mixture and of the metabolites in the database are performed under the same NMR and sample conditions, otherwise, peak position changes can lead to false metabolite identifications.
Pseudo-code for MH1
Input: A spectrum (or set of peaks) S of a mixture of metabolites M = {m
1
, m
2
,..., m
n
}, a library L with k metabolites {l
1
, l
2
,..., l
k
}, a noise threshold η, a ranking threshold r, and a set of metabolite features F.
Output: A set of metabolites predicted to be in the mixture.
Method:
-
1.
Pre-process S: remove all peaks in S, whose amplitudes are below η.
-
2.
for i = 1:k
-
3.
Compute significance score S
i
for each library metabolite l
i
without horizontal peak drift
-
4.
Output metabolite l
i
if S
i
≥ r and l
i
features obey F.
-
5.
end for
To partially alleviate this inconvenience, a second method (MH2) was proposed that allows inexact one dimensional match of peak chemical shifts (measured in parts per million - ppm) via horizontal peak drifts by a not constrained user-defined margin (e.g. a small chemical shift variation of ± 0.005 ppm or no variation at all). This feature offers more flexibility than HMDB NMR Search [28] and MetaboMiner [6] towards counter-acting the effects of measurement and pre-processing variability introduced by factors such as: different instruments, pH values and solvents. In this case, the number of matches typically increases with increased tolerance level around a given peak location, thus increasing also the number of false positives. The main difference in functionality between MH2 and MH1 resides in an adjusted matching criterion for computing the scores obtained with Equation 1. Two peaks are considered to match if their shift position is within the given "shift tolerance" parameter value, and thus the scores obtained with MH2 are typically higher than those computed with MH1.
Pseudo-code for MH2
Input: A spectrum (or set of peaks) S of a mixture of metabolites M = {m
1
,m
2
,...,m
n
}, a library L with k metabolites {l
1
,l
2
,...,l
k
}, a noise threshold η, a ranking threshold r, a user-set peak drift margin δ, and a set of metabolite features F.
Output: A set of metabolites predicted to be in the mixture.
Method:
-
1.
Pre-process S: remove all peaks in S, whose amplitudes are below η.
-
2.
for i = 1:k
-
3.
Compute significance score S
i
for each library metabolite l
i
with δ horizontal peak drift
-
4.
Output metabolite l
i
if S
i
≥ r and l
i
features obey F.
-
5.
end for
The third method (MH3) consists of a greedy selection approach that enforces mutual exclusion of peaks via an iterative coordinate removal for each selected peak. In the same fashion as for MH1, MH3 starts by computing and sorting initial scores using Equation 1 based on peak matches between the test spectrum and all the library spectra. Next, it proceeds in an iterative manner by selecting the library metabolite with the highest score and highest number of peaks (in case of a tie) and then removing the corresponding peaks for the selected metabolite from the remaining pool of unassigned peaks. The method stops when no spectral peaks remain to be assigned. This approach favours the early selection of metabolites with higher scores and higher number of peaks, thus decreasing the number of false positives (through iterative removal of remaining peak coordinates) at the cost of increasing the number of false negatives. False negatives typically happen in this method for metabolites in the mixture with a high degree of overlapping peak coordinates. MH3 also allows a user-defined shift tolerance around peak locations (not evaluated here and thus experimental). We also note that MH3 could be further improved so that to better mimic the processing of human experts, if after iteratively removing the corresponding spectrum of one metabolite, the remaining spectrum will be further adjusted to take into account the effect of the removed components on the mixture spectrum. Nevertheless, supporting information that could help with this step is not available at this time.
Pseudo-code for MH3
Input: A spectrum (or set of peaks) S of a mixture of metabolites M = {m
1
,m
2
,...,m
n
}, a library L with k metabolites {l
1
,l
2
,...,l
k
}, a noise threshold η, a ranking threshold r, a user-set peak drift margin δ, and a set of metabolite features F.
Output: A set of metabolites predicted to be in the mixture.
Method:
-
1.
Pre-process S: remove all peaks in S, whose amplitudes are below η.
-
2.
for i = 1:k
-
3.
Compute significance score S
i
for each library metabolite l
i
with δ horizontal peak drift
-
4.
end for
-
5.
Sort metabolites in reverse order based on their significance scores
-
6.
while (number of peaks in S ≠ 0)
-
7.
for i = 1:k
-
8.
Select metabolite l
i
with highest significance score with/without δ horizontal peak drift
-
9.
Output metabolite l
i
if S
i
≥ r and l
i
features obey F.
-
10.
Remove all peaks from S corresponding to metabolite l
i
-
11.
end for
-
12.
Re-sort metabolites in reverse order based on their updated significance scores
-
13.
end while
Given the two reference libraries used as support for the proposed methods, we append the suffixes "_HMDB" and "_MMCD" to the corresponding method names.
Results evaluation
Similarly to the methodology described in [6], we perform our analysis by defining the confusion matrix (Figure 13), which includes:
-
True Positives (TP): The number of metabolites correctly identified as being present in the mixture.
-
True Negatives (TN): The number of metabolites correctly identified as not being present in the mixture. This includes the total number of metabolites in the library minus TP, FP and FN.
-
False Positives (FP): The number of metabolites incorrectly identified as being part of the mixture.
-
False Negatives (FN): The number of metabolites being present in the mixture but unidentified by the method.
Each of the measures defined above requires a cut-off threshold defined as the first N metabolites predicted to be in the mixture. The size of the metabolite libraries, upon which the TN is defined, is 876 for our HMDB library, 448 for our MMCD library and 916 for the online HMDB NMR Search DB.
The following statistics are used for performance evaluation of the peak identification methods proposed in this article:
Unless otherwise specified, all results obtained with MetaboHunter and reported in this article were obtained using water as solvent and zero shift tolerance.
Run times and computational settings
The absolute CPU time for each method was measured on a PC with one 2.8 GHz Intel Pentium 4 CPU, 512 KB cache and 1 GB RAM running Mandriva Linux release 2010.2 (kernel 2.6.33.7). The obtained run time values for MH3_HMDB range from 2.1 seconds for input peak list files with 5 metabolites to 7 seconds for files with 100 metabolites, while for MH1_HMDB the run times range from 3.4 seconds to 5.7 seconds for the same input files. The run time estimates were averaged over 10 runs for each method and for each input peak list file. When MMCD is used as reference library, the average run times decrease in average with 3.5 seconds for input files with peak lists corresponding to mixtures of 100 metabolites. The decrease in run time is caused by the reduced size of the MMCD reference library (448 spectra) compared to HMDB (867 spectra).
Throughout the paper we used default settings for all publicly available software, such as HMDB NMR Search, BMRB, and Chenomx Profiler, when not specified otherwise. In the case of the free evaluation copy of Chenomx Profiler, we could only test it on data sets where FID files were accessible (SYN1-SYN4), since the software does not accept lists of peaks as input.
Web user interface
MetaboHunter was implemented in Perl and its graphical user interface was developed in PHP, thus being browser independent. There are 4 functional views in MetaboHunter: (i) a Processing View, (ii) a Search Results View, (iii) a Plot View and, (iv) a Peaks Hit Map view. When the user accesses MetaboHunter web site, the Processing View (Figure 14) is the default view. The user can provide its input as a 2 column full spectrum or list of peaks, which can be uploaded as a text file or be copied and pasted in a text box.
MetaboHunter has a comprehensive output for identified metabolites in input spectra or lists of peaks. The output is depicted in the Search Results View (Figure 15), which includes a ranked list of metabolite IDs, names and taxonomic origin, direct links to their descriptive original web pages, scoring information, and links to spectral plots and peaks hit map visualization of selected results. The results displayed in the Search Results View can be downloaded as text.
The Plot and Peaks Hit Map Views depicted in Figures 16 and 17 permit users to visualize the overlap between selected peaks corresponding to specific metabolites and the input data, as well as, the shared and disjoint peak locations for selected metabolites in the output. The plots are automatically generated using the Highcharts interactive JavaScript library and can be exported as PDF files.