An empirical Bayes model using a competition score for metabolite identification in gas chromatography mass spectrometry
© Jeong et al; licensee BioMed Central Ltd. 2011
Received: 12 July 2011
Accepted: 10 October 2011
Published: 10 October 2011
Mass spectrometry (MS) based metabolite profiling has been increasingly popular for scientific and biomedical studies, primarily due to recent technological development such as comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GCxGC/TOF-MS). Nevertheless, the identifications of metabolites from complex samples are subject to errors. Statistical/computational approaches to improve the accuracy of the identifications and false positive estimate are in great need. We propose an empirical Bayes model which accounts for a competing score in addition to the similarity score to tackle this problem. The competition score characterizes the propensity of a candidate metabolite of being matched to some spectrum based on the metabolite's similarity score with other spectra in the library searched against. The competition score allows the model to properly assess the evidence on the presence/absence status of a metabolite based on whether or not the metabolite is matched to some sample spectrum.
With a mixture of metabolite standards, we demonstrated that our method has better identification accuracy than other four existing methods. Moreover, our method has reliable false discovery rate estimate. We also applied our method to the data collected from the plasma of a rat and identified some metabolites from the plasma under the control of false discovery rate.
We developed an empirical Bayes model for metabolite identification and validated the method through a mixture of metabolite standards and rat plasma. The results show that our hierarchical model improves identification accuracy as compared with methods that do not structurally model the involved variables. The improvement in identification accuracy is likely to facilitate downstream analysis such as peak alignment and biomarker identification. Raw data and result matrices can be found at http://www.biostat.iupui.edu/~ChangyuShen/index.htm
The metabolome represents the collection of small compound metabolites in an organism or biological system, typically under 1000 daltons . The network of metabolic reactions, where outputs from one enzymatic chemical reaction are inputs to other chemical reactions, is a key component of the cellular physiology. In addition, the interactions of metabolites with other larger bio-molecules (i.e. proteins) are critical for many important biological processes. Therefore, metabolomics, the study of all metabolites in a system, in its own has great implications in scientific and biomedical advancement [2, 3].
With the development of mass spectrometry technology, particularly combined with the comprehensive two-dimensional gas chromatography (GCxGC) that substantially improves the separation capacity, a large number of metabolites can now be identified at a time. By comparing spectra from those metabolites with spectra from known identities, identification is performed . However, these identifications are subject to errors due to experimental noise, incompleteness of the library, technical limitations and so on. Thus, it is in great need to improve the accuracy of both the identifications and estimates of false positives at the data analysis stage as the validity and efficiency of the downstream analyses rely on the quality of the identifications.
To our knowledge, there have been relatively few developments along this line, compared with similar analysis issues in mass spectrometry based proteomics [5, 7–9]. Several studies on spectra registration (or alignment) for comprehensive two-dimensional GC data have been done [10–13]. In some studies, without addressing the identification issue, they assumed that metabolite identification by ChromaTOF software is correct instead and used those identification results directly for alignment. In addition, no model to analyse score distribution has been developed in order to improve the accuracy of metabolite identification. In this paper, we propose an empirical Bayes model which analyzes similarity score distribution to improve the accuracy of metabolite identifications and their confidence measures for GCxGC/TOF-MS data. The model orchestrates all information coming from each experiment step and produces confidence measure of identification in the form of posterior probability. The advantages of our method include (i) the posterior probability allows straightforward estimation of false discovery rate (FDR)  and serves as the confidence measure; (ii) metabolites in the library that are not matched to any spectrum are also assigned a confidence measure regarding their presence/absence status in the sample; (iii) integration of different sources of evidence may provide better identification accuracy. A major novelty of our method is the inclusion of a competition score (b j ) for each metabolite j in the library, which is defined based on all spectra in the library. The competition score is a measure of the propensity of j being matched to some sample spectra through the resemblance of the spectrum of metabolite j with other spectra. As explained in details in Methods Section, the competition score provides useful information to discern the true from the false positives. In what follows, we provide a detailed description of the model and demonstrate the utility of our model through analysis of a mixture of metabolite standards and a real data set generated by GCxGC/TOF-MS. For terminological clarity, since our identification is the compound identification via database search, it is also known as putative annotated identification. For simplicity, we use the word "identification" throughout the article.
Experiment 1: Mixture of metabolite standards
A mixture of 35 amino acids, fatty acids and organic acids were prepared in pyridine and derivatized with 100 μL of N-Methyl-N-(Tert-Butyldimethylsilyl) trifluoroacetamide (MTBSTFA). All GCxGC/TOF-MS analyses were performed on a LECO Pegasus 4D time-of-flight mass spectrometer (TOF-MS). Then, acquired data were processed with a user defined data processing method. The LECO ChromaTOF software version 3.41 was used for instrument control and spectrum deconvolution. A total of 3286 sample spectra were obtained from this experiment.
Experiment 2: Rat plasma data
Metabolites were extracted from a 100μL rat plasma sample using 900μL of organic solvent mixture (methanol:water = 8:1) and further derivatized with MTBSTFA. All GCxGC/TOF-MS analysis were performed on a LECO Pegasus 4D time-of-flight mass spectrometer (TOF-MS). The acquired data were processed with a user defined data processing method. The same software, LECO ChromaTOF software version 3.41 was used for instrument control and spectrum deconvolution. 1122 sample spectra were obtained from the experiment.
More details about both experiments are provided in the Additional file Additional File 1.
Library and database search
where < A, B > is the inner product of spectra A and B and || · || is the Euclidean norm. To calculate the competition score b j , we conducted all pairwise comparisons among the 2052 spectra in the library and considered two values for the threshold h, h = 30 and h = 40 (see Methods Section for the definition of b j and h). For each of the 3286 sample spectra, we compared it with the 2052 spectra from the library using cosine score and selected the metabolite with the best score (i.e. smallest) as the assignment to that sample spectrum.
For the analysis of Experiment 2, we used the library obtained from Automated Mass Spectral Deconvolution and Identification System (AMDIS) software, which includes 3540 spectra. AMDIS is a software for GC-MS data interpretation from National Institute of Standards and Technology (NIST). Again, b j was calculated based on all pairwise comparison of spectra in the library and database search assigned the metabolite with the best cosine score to each sample spectrum.
Mixture of metabolite standards
The definitions of terminologies such as FDR, sensitivity, specificity and ROC curve are provided in the Additional file.
Number of claimed positives and FDR
Rationale behind the use of competition score and multiple similarity scores
In this section, we explain what kind of benefits we get from the use of competition score and multiple similarity scores of the same metabolite, the main advantages of our model. For this purpose, we present two illustrative examples by using Experiment 1. In the first example, we highlight the effective role of competition score in separating positives from negatives. To this end, we consider two hypothetical metabolites j = 1, 2, all of which are assigned to two sample spectra with the same similarity scores 40 and 50. However, the competition scores are different with (b1, b1*) = (0.1, 0.49), and (b2, b2*) = (1.08, 1.15), i.e., the only difference between two hypothetical metabolites is their competition scores. Then, any difference in the posterior probability is attributed to the difference in the competition score. Based on our model, the posterior probabilities for the two metabolites are 0.63 and 0.94, respectively. Therefore, the competition score allows adjustment of the confidence of a metabolite in addition to the similarity scores. To understand the difference in posterior probabilities, note that given the same similarity scores, the confidence of a metabolite is driven by the likelihood ratio of equation (4) to equation (3) as a function of b j and b j *. Metabolite 2 has higher confidence than metabolite 1 because the relative increase in likelihood of being matched due to the presence of metabolite 2 is higher than that of metabolite 1. To summarize, the idea is that evidence on the presence or absence status of a metabolite based on the fact that it is matched to some sample spectrum should be quantified according to the capability of the metabolite's spectrum to mimic other spectra in the library, and how the likelihood of being matched will be altered from absence to presence.
In the second example, we elucidate the treatment of our model on multiple similarity scores. More precisely, we highlight the effect of multiple similarity scores on confidence of metabolite identification. What if we use single average score of those multiple scores instead of individual scores? To answer this question, we select two metabolites with b j = 0 from the mixture of metabolite standards to exclude the effect of competition score: one true positive TP1 (CAS number:107715-91-3) and one true negative TN1 (CAS number: 6066826). Since we know which metabolite exists in sample in this case, a true positive presents a metabolite in library which exists in sample as well and is claimed as positive. TP1 was assigned to 8 sample spectra with similarity scores highlighted by * in red in Figure 3. TN1 was assigned to 2 sample spectra highlighted by * in blue. The bigger * represents the average score for each of the metabolites. The naive method is not able to discriminate the two metabolites because they have similar average scores, i.e. close to 56. However, our empirical Bayes model provides posterior probabilities 0.99 and 0.01 for TP1 and TN1, respectively. The reason our model performs well here is that all individual scores, not the single average score are incorporated into the model and play a key role of producing posterior probability. The point is that taking into account the number of matches and the distribution of the similarity scores provides rich information in discriminating the true from the false positives.
In this paper, there are two elements in layers 2 and 4 that need more explanation. The detail description of the model is given in Methods Section. In layer 2, the quadratic form of the function of conditional probability was inspired by the results of logistic regression. More specifically, we investigated the relationship between competition score and status of metabolites. For example, we fitted logistic regression model (linear and quadratic) to 2000 true negatives and noticed that quadratic function is fitted well, i.e., p-values for parameters corresponding to b j and b j 2 were less than 0.0001. In layer 4, the characterization of score function was based on the distributional behavior of similarity scores. It varies from data to data. In the data we analyzed, we considered three component normal mixture and the model worked reasonably well. Note that data-specific property of score distribution made us utilise different component mixing for each data set. For the mixture of standard metabolites, we considered f T to be two component normal mixture and f F single normal distribution. In contrast, for the rat plasma data, we considered f F to be two component normal mixture and f T single normal distribution.
The composition of the library will influence identification process. Obviously the quality and species of the spectra included in the library will affect how well identification can be made. However, this aspect is complicated and beyond of the scope of this paper. Actually it can be a separate study in its own right. Nevertheless, we did make observations on the effect of the size of the library on the quality of identifications. When the library include much more false positives, identification of component in the standard mixtures become much more difficult as there are "false positives" with great similarity with the true positives. The posterior probability in this case tend to be much lower. Note that this is not a failure of our model itself, it is simply because the matching score is not sufficiently discriminating anymore.
Database-search based algorithm has been a popular approach to mass spectrometry-based high-throughput metabolite profiling. Due to the complexity of the experimental procedure and dynamic nature of the fragmentation process, identifications of metabolites in GCxGC/TOF-MS are subject to errors. The accuracy of identifications and false positive estimate are critical for the error control of downstream in silico or experimental investigations. During the database search, each candidate metabolite faces competition from other metabolites in the library to be the top hit. On the other hand, a large number of sample spectra also offer many opportunities for a candidate metabolite to be falsely matched. Taking into account the competition and opportunity associated with each candidate metabolite allows one to more properly extract evidence on the presence/absence status based on whether or not a candidate metabolite is matched to some sample spectra. This type of evidence adds another dimension of information to the similarity score for the assessment of the confidence of metabolite identifications. In this article, we proposed the concept of a competition score to characterize the magnitude of competition and opportunity for each candidate metabolite. The competition score and similarity score are integrated by an empirical Bayes model to yield confidence measure in the form of posterior probability.
Since our method is a novel model-based approach to metabolite identification in GCxGC/TOF-MS data, there is no other model-based method to compare with. Thus, we compared it with four other methods, especially, the naive method which is solely based on the similarity score. Through the experiment of mixture of metabolite standards, it was demonstrated that our model provides more accurate metabolite identifications than other methods. Just as controlling type I error is very important issue in classical statistics problem, so is controlling false discovery rate in high-throughput data [14, 17, 18]. As we see in Figure 4, our estimate of FDR is reliable and conservative. From a sensitivity perspective, a moderate sensitivity (about 0.6) is retained even at the extremely high level of specificity (greater than 0.999). It should be noted that the primary goal of high-throughput data analysis is to select promising candidate for downstream target-orientated experimental studies. Therefore, false positive is more of the investigator's concern than false negative. In this sense, we consider FDR is of higher priority than sensitivity in data analysis.
It is conceivable that the conditional probability of being a correct match given the presence status (i.e. Pr[W jl = 1|Y j = 1, Z j = 1]) may also depend on the competition score or some other more appropriate measures. Extension of our model along this direction could lead to further accuracy improvement.
Other identification methods
Since there is no other model-based approach other than our method, we compare our method with four other identification methods: naive method, NIST MS dot product, weighted dot product, and composite similarity based on the methods developed by Stein and Scott . Among the four methods, the naive method has best performance in terms of ROC curve and FDR. Comparison results are provided in the Additional file. Thus, for comparison, we focus on the naive method, which is defined by using the same similarity scores which is incorporated in our model-based method. As mentioned, the similarity score is calculated by comparing library and sample spectra. Given the similarity score, the process of the naive method consists of three steps. First, we find the best match of each sample spectrum, i.e. a library spectrum with the best score obtained by the comparison of sample and library. Second, for those metabolites matched to more than one sample spectrum, we take average of the scores. Third, after applying cutoff value to the average score, we claim each metabolite as either positive or negative. On the other hand, metabolites in the library not matched to any sample spectrum are considered as absent in the sample.
where N is the number of the spectra in the library.
Hence, we assume that conditional on Y j = 1 and Z j = 1, the correctness of each assignment of j to some sample spectrum follows independent Bernoulli distribution. This layer allows our model to account for the situations where although a metabolite j is present in a sample, its assignment to some spectrum is not always correct. In other words, the match of a true positive to a sample spectrum may not always be correct.
where f is the mixture of f T and f F that are the probability density functions of the scores of the correct matches and incorrect matches, respectively, and ϕ T and ϕ F are corresponding parameters.
This work was supported by the National Institutes of Health [1R01GM087735 to J.J., X.Z., and C.S.] and an Indiana University Showalter Research Trust Funding Award to J.J and C.S. and the Department of Defense grant [BC030400 to J.J. and C.S.] and the Department of Energy grant [DE-EM0000197 to S.K.]
- Oliver GS, Winson KM, Kell BD, Baganz F: Systematic functional analysis of the yeast genome. Trends in Biotechnology. 1998, 16: 373-378. 10.1016/S0167-7799(98)01214-1.View ArticlePubMed
- Roessner U, Bowne J: What is metabolomics all about?. BioTechniques. 2009, 46: 363-365. 10.2144/000113133.View ArticlePubMed
- Bathen TF, Sitter B, Sjobakk TE, Tessem M, Gribbestad IS: Magnetic resonance Metabolomics of intact tissues: A biotechnological tool in cancer diagnostics and treatment evaluation. Cancer Res. 2010, 70: 6692-6696. 10.1158/0008-5472.CAN-10-0437.View ArticlePubMed
- Dettmer K, Aronov PA, Hammock BD: Mass spectrometry-based metabolomics. Mass Spectrometry Reviews. 2007, 26: 51-78. 10.1002/mas.20108.PubMed CentralView ArticlePubMed
- Stein SE: An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J of the American Society for Mass Spectrum. 1999, 10: 770-781. 10.1016/S1044-0305(99)00047-1.View Article
- Fraga CG, Prazen BJ, Synovec RE: Objective data alignment and chmometric analysis of comprehensive two-dimensional separations with ren-to-run peak shifting on both dimensions. American Chemical Society. 2001, 73: 5833-5840.
- Kind T, Fiehn O: Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinfo. 2007, 8: 105-10.1186/1471-2105-8-105.View Article
- Zhang X, Oh C, Riley CP, Buck C: Current Status of Computational Approaches for Protein Identification Using Tandem Mass Spectra. Current Proteomics. 2007, 4: 121-130. 10.2174/157016407783221349.View Article
- Shen C, Wang Z, Shankar G, Zhang Z, Li L: A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry. Bioinfo. 2008, 24: 202-208. 10.1093/bioinformatics/btm555.View Article
- Wang B, Fang A, Heim J, Bogdanov B, Pugh S, Libardoni M, Zhang X: DISCO: distance and spectrum correlation potimization alignment for two-dimensional gas chromatography time-of-flight mass spectrometry-based metabolomics. Anal Chem. 2010, 82: 5069-5081. 10.1021/ac100064b.PubMed CentralView ArticlePubMed
- Oh C, Huang X, Regnier FE, Buck C, Zhang X: Comprehensive two-dimensional gas chromatography/time-of-flight mass spectrometry peak sorting algorithm. Journal of Chromatography. 2008, 1179: 205-215. 10.1016/j.chroma.2007.11.101.PubMed CentralView ArticlePubMed
- Pierce KM, Wood LF, Wright BW, Synovec RE: A comprehensive two-dimensional retention time alignment algorithm to enhance chemometric analysis of comprehensive two-dimensional separation data. Analytical Chemistry. 2005, 77: 7735-7743. 10.1021/ac0511142.View ArticlePubMed
- Kim S, Fang A, Wang B, Jeong J, Zhang X: An optimal peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry using mixture similarity measure. Bioinformatics. 2011, 27: 1660-1666. 10.1093/bioinformatics/btr188.PubMed CentralView ArticlePubMed
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J of the Royal Statistical Society B. 1995, 57: 289-300.
- Stein SE, Scott D: Optimization and Testing of Mass Spectral Library Search Algorithms for Compound Identification. J of the American Society for Mass Spectrometry. 1994, 5: 859-866. 10.1016/1044-0305(94)87009-8.View Article
- Stein SE: Estimating probabilities of correct identification from results of mass spectral library searches. J of the American Society for Mass Spectrometry. 1993, 5: 316-323.View Article
- Efron B, Tibshirani R: Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology. 2002, 23: 70-86. 10.1002/gepi.1124.View ArticlePubMed
- Efron B: False discovery rate control. Large-scale inference.
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J of the Royal Statistical Society B. 1977, 39: 1-38.
- Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment. J of the American Statistical Society. 2001, 96: 1151-1160.View Article
- Newton MA, Noueiry A, Sarkar D, Ahlquist P: Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004, 5: 155-176. 10.1093/biostatistics/5.2.155.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.