- Research article
- Open Access
Model-based peak alignment of metabolomic profiling from comprehensive two-dimensional gas chromatography mass spectrometry
© Jeong et al; licensee BioMed Central Ltd. 2012
- Received: 3 October 2011
- Accepted: 8 February 2012
- Published: 8 February 2012
Comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GCxGC/TOF-MS) has been used for metabolite profiling in metabolomics. However, there is still much experimental variation to be controlled including both within-experiment and between-experiment variation. For efficient analysis, an ideal peak alignment method to deal with such variations is in great need.
Using experimental data of a mixture of metabolite standards, we demonstrated that our method has better performance than other existing method which is not model-based. We then applied our method to the data generated from the plasma of a rat, which also demonstrates good performance of our model.
We developed a model-based peak alignment method to process both homogeneous and heterogeneous experimental data. The unique feature of our method is the only model-based peak alignment method coupled with metabolite identification in an unified framework. Through the comparison with other existing method, we demonstrated that our method has better performance. Data are available at http://stage.louisville.edu/faculty/x0zhan17/software/software-development/mspa. The R source codes are available at http://www.biostat.iupui.edu/~ChangyuShen/CodesPeakAlignment.zip.
- Homogeneous Data
- Peak Pair
- Rank Distance
- Peak Match
- Peak Alignment
Comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GCxGC/TOF-MS) has been employed in analysis of complex samples in metabolomics studies, especially in cancer [1–3]. However, experiment output still suffers from substantial variations, challenging the data interpretation in these studies. For this reason, the methodological development at the data analysis stage has become a crucial research area, which is still at its infancy. Throughout the paper, we mean experiment by technical run, which is performed by machine.
In practice, there are two types of experiment variations: variation within an experiment and variation between experiments. The need to control variation within an experiment has been reported by many researchers [4–6]. Theoretically, all instrument signals generated by one metabolite species should be reported as a single peak within an experiment. However, in reality, multiple peak entries can occur due to the abnormality in peak shape, high sensitivity of the peak detection algorithm, and experimental cause such as short modulation time [4–6]. To reduce such variation, peak merging should be done before proceeding to any subsequent analysis. On the other hand, variation between experimental runs is induced by factors such as difference in experiment configuration and run-to-run variations. Between-experiment variation usually is much higher in magnitude than within-experiment variation. Specifically, retention time (RT) depends heavily on experimental setup by the nature of experiment. For analysis purpose, different retention times of the same metabolite between experiments should be adjusted for further analysis. The process of such an alignment is usually referred to as peak alignment. Typically, a peak alignment process includes two steps: peak matching where the identity of peaks from each experiment is matched, and RT adjustment that is based on the results of the peak matching.
Several studies addressed the alignment issue of metabolomic profiling from GCxGC/TOF-MS experiment [4–9]. From a methodological perspective, we can classify them into three categories. In the first generation methods [7–9], alignment implies RT adjustment, which is solely based on data of the retention time without the input of metabolite identification. For example, algorithms for aligning local region of interest were introduced in Fraga et al. and Mispelaar et al. [7, 8]. And then, algorithm to align entire chromatogram in two-dimensional GC was suggested by Pierce et al. . However, the limitation of those methods is that aligning metabolites by using two-dimensional retention times only may produce false alignment because some metabolites with similar chemical functional groups have similar retention times in both GC dimensions . For this reason, the second generation methods have been developed exploiting two different types of information: closeness in two-dimensional retention times and spectrum similarity [4–6]. Three well-known methods of this generation are MSort , DISCO , and mSPA . Since DISCO is a modified version of MSort, they are similar in many respects. Thus, our focus is on the difference between mSPA and the other two. First of all, the process of peak alignment in mSPA includes metabolite matching only, without reference to RT adjustment while the other two address both. Second, Kim et al.  defined and used a mixture similarity score, weighted average of RT distance and spectra similarity for peak matching while the other two used both information sequentially in each step. Third, as a spectrum similarity measure, Kim et al.  used dot product and the other two used Pearson's correlation coefficient. More comparison among these three methods can be found in .
Our method, as a third generation method, is unique in that it is model-based approach. Compared to the second generation methods, our method is different in many respects. Compared to mSPA, our method considers rank distance of retention time instead of Euclidean distance which is used in mSPA. As a spectrum similarity measure, we use cosine score, angle between two spectra while mSPA uses dot product which is the cosine value of the angle. Also, our method covers both homogeneous and heterogeneous data while mSPA can handle homogeneous data only. For clarity, when we get data under the exactly same experiment configuration, we call it homogeneous. Otherwise, we call it heterogeneous. Most of all, our method uses posterior probability for metabolite matching based on an empirical Bayes model. The mSPA, however, defines an ad hoc likelihood function and maximises the function. Compared with DISCO, there are some aspects in common: both methods (1) can be applied to homogeneous and heterogeneous data, (2) address both peak matching and RT adjustment. On the other hand, they differ in four key ways: (1) our method does not need any RT transformation at the pairwise peak matching stage, (2) we use posterior probability as a matching confidence, (3) we use lattice-wise method for RT adjustment, not peak-wise, (4) as a similarity measure, we use mixture similarity score with cosine score involved, but DISCO use Pearson's correlation coefficient.
Since DISCO can be applied to heterogeneous as well as homogeneous data, we compare our method with DISCO. In what follows, we provide a brief description of the model. Then we demonstrate the performance of our method with a mixture of standard compounds and a rat plasma data.
Two different types of experiments are analyzed in this study: a mixture of standard compounds (Experiment I) and a rat plasma (Experiment II). We have three sets of homogeneous data from Experiment I corresponding to three different temperature gradients, respectively: dataset1 (5°C/min) with 10 replicates, dataset2 (7°C/min) with 2 replicates and dataset3 (10°C/min) with 4 replicates. To produce a heterogeneous dataset by using three homogeneous datasets available, we selected one technical replicate from each dataset and combined them, which is called dataset4. Thus, we have 10, 2, 4, 3 technical replicates in each dataset from Experiment I. From Experiment II, we have 5 homogeneous technical replicates but, no heterogeneous output. More details of experiments are given in Additional file 1.
Overview of algorithm
To help understanding of our results, we summarize our algorithm briefly because the order of results follows that of our algorithm. After peak merging, we select two experiment outputs and calculate matching confidence of peaks in the form of posterior probability through the empirical Bayes model. Based on these matching confidence (Equation 8 in Methods section), we select metabolite pairs with high matching confidence by applying cutoff value to the posterior probability. We then continue the same pair-wise process for all other experiment outputs and generate representative landmark peaks. Given landmark peaks, we adjust RT of all peaks with respect to these peaks.
Summary of Experiment I: number of compounds after/before peak merging
R 2 5
R 3 5
R 4 5
R 5 5
R 6 5
R 7 5
R 8 5
R 9 5
R 10 5
R 2 7
R 2 10
R 3 10
R 4 10
Summary of Experiment II: number of compounds after/before peak merging
Here we choose threshold value (h = 40) which is used to calculate a j, b j and b j * in layer 2 of our model (see Methods section). Also, we use weight (w = 0.1) for mixture score and apply cutoff value of 0.9 to matching confidence, posterior probability of correct match for landmark peak selection.
In Experiment I, 11, 40, 28 and 24 landmark peaks were selected for each dataset, respectively. In Experiment II, 31 landmark peaks were selected.
Peak alignment results
Averaged best F1 score: our method v.s. DISCO
Based on the results, it is clear that the performance of our method is better than DISCO except homogeneous case from Experiment I. As seen in the results, our method has better performance for complex data. More precisely, the difference in performance is getting bigger as the complexity of the data increases.
To investigate the discordance in the results of both methods, we manually checked the alignment results for homogeneous data from Experiment I (left panel in Figure 5). Since 67 common peak pairs were aligned by both methods, our focus was on the other parts. The four peaks aligned by our method, which have different names given by ChromaTOF, were further analyzed based on raw image data. It is clear that one of them is correctly aligned. However, the other three might be incorrect. In addition to that, six peak pairs which had the same compound name by ChromaTOF, but not aligned by our method were also manually examined. Among them, ChromaTOF made wrong decision for three of them while the other three are correct.
To further investigate the discordant results, we selected two peaks (CAS 193-39-5 and 629-92-5) in target. They were aligned by both method but, aligned to different compounds in library (denoted by * in Additional file 1, Table S8). Those peak pairs aligned by GS were not the best peak pairs by our method, but the second best. For example, a compound with CAS(193-39-5) in target peak list was aligned to the compound with CAS(191-24-2) in reference peak list by our method, not to the compound with CAS(193-39-5) because peak pair (193-39-5 and 191-24-2) had a similarity score of 9.67 while peak pair (193-39-5 and 193-39-5) had a similarity score of 9.69 (i.e., in our method, a pair with the smaller similarity score is selected). Similarly, even though the compound (CAS 629-92-5) was assigned to different compound in library, the difference in score was not substantial. More details about the four and six peaks aligned by each method only are summarized in Additional file 1 (see Section 4.5).
Compared to the existing peak alignment method DISCO that can also be applied to heterogeneous as well as homogeneous data, our method has many unique properties. First, our method is a model-based approach. Second, unlike the DISCO, our method does not need any type of data transformation such as z-score transformation at any stage of the process. For example, in case of heterogeneous data, DISCO first transforms retention times using z-scores in order to reduce the retention time variation among different GCxGC-MS experiments. In other words, data transformation converts heterogeneous data into (pseudo) homogeneous data. Although a well-defined transformation will definitely improve the peak alignment (especially, for heterogeneous data) through the reduction of false positives caused by retention time variation, it is not easy to find an optimal transformation function that can handle all kinds of variation. Therefore, requirement of transformation is usually considered as a downside and it is avoided. Third, once representative landmark peaks are obtained, we make grid net using their retention times and adjust retention times of all peaks with respect to the grid net sequentially in both dimensions (all peaks simultaneously in each dimension).
With a mixture of metabolite standards and a rat plasma data, it is shown that our method has good performance in terms of F1 score. The F1 score requires gold standard (GS) and we used ChromaTOF identification results as a tentative gold standard. In the case of mixture of metabolite standards, even though more variations are involved in heterogeneous data, F1 score is always good regardless of data type. On the other hand, F1 score for complicated data from rat plasma is not that much high even though it is homogeneous. Compared to standard mixture data, although we see some decrease in F1 value for complicated data, the value from our method is still higher than that of the other existing method.
We compared our matching results with the tentative GS. Through the comparison, even though there is some discordance, we see high level of concordance in matching results by both methods, resulting in high F1 score for our method.
In this paper, we developed a model-based peak alignment method to handle experiment variations, which can be applied to both homogeneous and heterogeneous experiments. Our method utilises a part of the output of ChromaTOF software as input data. The workflow of the method consists of two steps: pairwise peak matching and retention time adjustment. Due to the use of landmark peak lists composed of peak pairs with high matching confidence, our approach produces good quality of peak alignment.
In the peak alignment context, the excellent performance of our method at the data processing stage will have an enormous positive effect on subsequent analysis. For example, even though experiments are performed under the different experiment configurations or even at different times, the data aligned by our method can be used as input for further analysis: for example, time course metabolomic data analysis. Thus, the area to which our method can be applied might be extended to metabolite biomarker finding and metabolite clustering. Furthermore, as a future study, we we will study the relationship between peak alignment and peak identification to improve the accuracy of both preprocessing.
Our empirical Bayes model for pairwise peak matching between technical runs utilises the same structural hierarchy as the model constructed for metabolite identifications in . Here we briefly review the model. Suppose that we have two experiment outputs and consider one of them as reference and the other as target in the context of peak alignment.
We consider a hierarchical statistical model with four layers. All layers together address the process of our algorithm. Here is a brief overview of our hierarchical model: (i) we first check if a compound in library is in sample, (ii) depending on the information given in (i), we check if the compound is matched to any compound in sample, (iii) we then check if the match is correct because our matching using similarity score is not always correct, (iv) we finally estimate the distribution of similarity score.
where N is the number of the peaks in the reference.
where is a mixture similarity score between peaks q and k in the reference, C is the set of peaks in the reference, and I(·) is the indicator function.
where b j * includes metabolite j itself as a neighbor to account for the fact that Y j = 1.
Since is between 0 and 1, this implies that our matching is not always correct even though metabolite j is true positive.
where f is the mixture of densities f T and f F that are the distributions of the scores of the correct matches and incorrect matches, respectively, and ϕ T and ϕ F are corresponding parameters.
Rationale behind the model: the rationale behind the use of a logistic function in layer 2 results from logistic regression. In other words, we investigated the relationship between Z and corresponding competition scores by logistic regression. Then, we found that quadratic function is statistically significant (see Section 1-3 in Additional file 1). Note that score function (f) consists of two score density functions: f = πf T + (1-π)f F. According to the distribution of observed scores, the specification of the score functions is decided. Here we assume normality. According to the distribution of observed scores, either f T or f F could be assumed a normal mixture. The parameter vector to be estimated is More details about model description can be found in .
where is the estimated parameter vector. Note that this matching confidence plays a key role in the first step of peak alignment procedure.
Mixture similarity score
where S(A, B) is mixture similarity score between two peaks A and B. Note that w is weight (0≤w≤1), D is rank distance based on retention time, and C is cosine score, angle between two peaks in high dimensional space (see Additional file 1 for details). Clearly, there is unit imbalance between RT distance and spectrum similarity. To balance them, we rescale rank distance (D) and angle (C) in Equation (9). Since considering rank of RT as a measure of closeness in RT reduces false positive rate , we take into account the elution order of RT in Equation (9). When calculating similarity score, we prefer small value of w(0 ≤ w ≤ 0.5) in order for spectrum similarity to play a important role in similarity score.
Peak alignment algorithm
Pairwise peak matching
Retention time adjustment
Where Similarly, new value for the second dimensional retention time can be obtained. In summary, retention time adjustment process consists of two sequential steps: the first dimension RT adjustment and the second dimension RT adjustment. RTs for all peaks in each dimension are aligned simultaneously with respect to the reference lattice, which is constructed based on representative landmark peaks. More details are provided in Additional file 1.
This work was supported by the National Institutes of Health [1R01GM087735 to J.J., X.Z., and C.S.] and an Indiana University Showalter Research Trust Funding Award to J.J and C.S. and the Department of Defense grant [BC030400 to J.J. and C.S.] and the Department of Energy grant [DE-EM0000197 to S.K.]
- Kind T, Tolstikov V, Fiehn O, Weiss RH: A comprehensive urinary metabolomic approach for identifying kidney cancer. Analytical Biochemistry 2007, 363: 185–195. 10.1016/j.ab.2007.01.028View ArticlePubMedGoogle Scholar
- Sreekumar A, Poisson LM, Rajendiran T, Khan AP, Cao Q, Yu J, Laxman B, Mehra R, Lonigro RJ, Li Y, Nyati MK, Ahsam A, Kalyana-Sundaram S, Han B, Cao X, Byun J, Omenn GS, Ghosh D, Pennathur S, Alexander DC, Berger A, Shester JR, Wei JT, Varambally S, Beecher C, Chinnaiyan AM: Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature 2009, 457: 910–915. 10.1038/nature07762PubMed CentralView ArticlePubMedGoogle Scholar
- Hu JD, Tang HQ, Zhang Q, Fan J, Hong J, Gu JZ, Chen JL: Prediction of gastric cancer metastasis through urinary metabolomic investigation using GC/MS. World Journal of Gastroenterololy 2011, 17: 727–734. 10.3748/wjg.v17.i6.727View ArticleGoogle Scholar
- Oh C, Huang X, Regnier FE, Buck C, Zhang X: Comprehensive two-dimensional gas chromatography/time-of-flight mass spectrometry peak sorting algorithm. Journal of Chromatography 2008, 1179: 205–215. 10.1016/j.chroma.2007.11.101PubMed CentralView ArticlePubMedGoogle Scholar
- Wang B, Fang A, Heim J, Bogdanov B, Pugh S, Libardoni M, Zhang X: ISCO: distance and spectrum correlation optimization alignment for two-dimensional gas chromatography time-of-flight mass spectrometry-based metabolomics. Anal Chem 2010, 82: 5069–5081. 10.1021/ac100064bPubMed CentralView ArticlePubMedGoogle Scholar
- Kim S, Fang A, Wang B, Jeong J, Zhang X: An optimal peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry using mixture similarity measure. Bioinformatics 2011, 27: 1660–1666. 10.1093/bioinformatics/btr188PubMed CentralView ArticlePubMedGoogle Scholar
- Fraga CG, Prazen BJ, Synovec RE: Objective data alignment and chemometric analysis of comprehensive two-dimensional separations with run-to-run peak shifting on both dimensions. American Chemical Society 2001, 73: 5833–5840.Google Scholar
- Mispelaar VG, Tas AC, Smilde AK, Schoenmakers PJ, van Asten AC: Quantitative analysis of target components by comprehensive two-dimensional gas chromatography. Journal of Chematography A 2003, 1019: 15–29. 10.1016/j.chroma.2003.08.101View ArticleGoogle Scholar
- Pierce KM, Wood LF, Wright BW, Synovec RE: A comprehensive two-dimensional retention time alignment algorithm to enhance chemometric analysis of comprehensive two-dimensional separation data. Analytical Chemistry 2005, 77: 7735–7743. 10.1021/ac0511142View ArticlePubMedGoogle Scholar
- Horai H, Arita M, Kanaya S, Nihei Y, Ikeda T, Suwa K, Ojima Y, Tanaka K, Tanaka S, Aoshima K, Oda Y, Kakazu Y, Kusano M, Tohge T, Matsuda F, Sawada Y, Hirai YM, Nakanishi H, Ikeda K, Akimoto N, Maoka T, Takahashi H, Ara T, Sakurai N, Suzuki H, Shibata D, Neumann S, Lida T, Tanaka K, Funatsu K, Matsuura F, Soga T, Taguchi R, Saito K, Nishioka T: MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrometry 2010, 45: 703–714. 10.1002/jms.1777View ArticleGoogle Scholar
- Jeong J, Shi X, Zhang X, Kim S, Shen C: An empirical Bayes model using a competition score for metabolite identification in gas chromatography mass spectrometry. BMC Bioinformatics 2011, 12: 392. 10.1186/1471-2105-12-392PubMed CentralView ArticlePubMedGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J of the Royal Statistical Society B 1977, 39: 1–38.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.