MathDAMP: a package for differential analysis of metabolite profiles
© Baran et al; licensee BioMed Central Ltd. 2006
Received: 02 June 2006
Accepted: 13 December 2006
Published: 13 December 2006
With the advent of metabolomics as a powerful tool for both functional and biomarker discovery, the identification of specific differences between complex metabolite profiles is becoming a major challenge in the data analysis pipeline. The task remains difficult, given the datasets' size, complexity, and common shifts in migration (elution/retention) times between samples analyzed by hyphenated mass spectrometry methods.
We present a Mathematica (Wolfram Research, Inc.) package MathDAMP (Mathematica package for Differential Analysis of Metabolite Profiles), which highlights differences between raw datasets acquired by hyphenated mass spectrometry methods by applying arithmetic operations to all corresponding signal intensities on a datapoint-by-datapoint basis. Peak identification and integration is thus bypassed and the results are displayed graphically.
To facilitate direct comparisons, the raw datasets are automatically preprocessed and normalized in terms of both migration times and signal intensities. A combination of dynamic programming and global optimization is used for the alignment of the datasets along the migration time dimension.
The processed datasets and the results of direct comparisons between them are visualized using density plots (axes represent migration time and m/z values while peaks appear as color-coded spots) providing an intuitive overall view. Various forms of comparisons and statistical tests can be applied to highlight subtle differences. Overlaid electropherograms (chromatograms) corresponding to the vicinities of the candidate differences from any result may be generated in a descending order of significance for visual confirmation. Additionally, a standard library table (a list of m/z values and migration times for known compounds) may be aligned and overlaid on the plots to allow easier identification of metabolites.
Our tool facilitates the visualization and identification of differences between complex metabolite profiles according to various criteria in an automated fashion and is useful for data-driven discovery of biomarkers and functional genomics.
The identification of specific differences between metabolite profiles plays a prominent role in metabolomic data analysis and can be useful for the discovery of biomarkers or the characterization of specific biological activities. Hyphenated mass spectrometry methods (GC-MS, LC-MS, CE-MS, etc.) are among the most common analytical tools for metabolomics. Most produce large datasets that are not easily interpretable using the software provided by most instrument manufacturers. The common data analysis workflow, starting from raw data, usually includes the detection of peaks, their integration, matching of corresponding peaks across datasets and subsequent multivariate analysis . Several tools enabling automation of the procedure are available [2–7], but the overall task still proves challenging given the datasets' size, complexity, common shifts in migration times between datasets, and the need to identify metabolites. In addition, some of these tools either provide only partial solutions (generation of integrated peak lists) or were developed for a specific type of analysis (e.g. GC-MS) and some alignment algorithms may not be very robust when migration time differences are large and the composition of samples is highly variable. Moreover, automated peak picking and integration remains an important challenge that is complicated by the wide range of peak intensities, sometimes poor separation of compounds and the resulting distorted peak shapes, leading to multiple incorrect assignments of differences. While visual exploration of the raw data has been used to complement automated data analysis , this often comes at the expense of convenience and versatility. Direct chromatogram comparisons bypass peak picking and integration to select areas of interest from raw data or to locate differences between metabolite profiles . To apply direct chromatogram comparisons as a complement or an alternative to the multivariate analysis of integrated peak lists, automation of the processing of raw data along with suitable visualization and metabolite identification methods are desirable. With MathDAMP, we provide a complete series of such tools, capable of providing an overall view of the differences between metabolite profiles according to different criteria. The functionality of the package is demonstrated with CE-MS data, which is particularly challenging due to the more significant migration time shifts, but the tools can be used for other types of hyphenated mass spectrometry methods as well.
Differences between metabolite profiles in MathDAMP are highlighted by applying arithmetic operations to all corresponding signal intensities from whole raw datasets on a datapoint-by-datapoint basis. To facilitate this, the datasets are processed into rectangular matrices and normalized in terms of both migration time and signal intensities. The results are visualized on density plots (also referred to as color maps or heat maps) providing a global view of the differences between samples. The main features of the package are briefly outlined below. A detailed description of the implementation and usage is part of the online documentation .
Raw datasets are binned along the m/z dimension to a specified resolution upon loading. Baselines may be subtracted by fitting the individual electropherograms to any user specified function (first order polynomial by default) by robust nonlinear regression as described by Ruckstuhl et al. . However, the regression is performed in a global fashion in our implementation. Following baseline subtraction, noise may be removed from individual electropherograms by leveling to 0 all signal intensities falling within a threshold. By default, the threshold is calculated for every electropherogram as a specific multiple (5) of a standard deviation of signal intensities from a specified region of the electropherogram where no signals are expected (1 – 3 min). The datasets may be smoothed by applying predefined or user specified smoothing filters to all electropherograms in the dataset. Additionally, the datasets may be cropped along both the m/z and migration time dimensions.
A standard library table (a list of m/z values and migration times for known compounds) may be aligned to the reference dataset using the same procedure. The aligned standard library table can later be used to annotate the plots or for the automatic localization of the peaks of the internal standard as described below.
Signal intensities in the sample datasets may be normalized according to a specified list of normalization coefficients (e.g. originating from the sample weights). Additionally, the signal intensities may be normalized according to the peaks of the internal standard. These peaks are then integrated in all datasets after alignment. The location of the peak of the internal standard in the reference dataset may be either specified explicitly or it may be extrapolated from the aligned standard library table. In the latter case, the user specifies only the name of the compound to serve as the internal standard.
Following dataset normalization, various forms of direct comparisons may be performed to find differences between two or multiple datasets. Arithmetic operations or statistical tests are then applied to all corresponding signal intensities. The resulting dataset(s) has the same structure and dimensions as the compared datasets. Any processed, normalized, or result dataset can be easily exported as text or in binary format using Mathematica's built-in data export functionality.
To compare two datasets, the simplest way to highlight differences between them is to subtract the corresponding signal intensities (absolute difference). Alternately, dividing this difference by the larger of the two signal intensities provides a measure of the relative difference. Multiplying the corresponding signal intensities of the absolute and relative difference results highlights differences significant in both absolute and relative terms (absolute × relative difference).
Outlier signals within a group of datasets may be located by calculating z-scores or by quartile analysis. A specified number of outliers may be removed from a set of corresponding signal intensities prior to z-score calculation. This limits the disproportionate influence of eventual outliers on the mean and standard deviation of the set which may lead to undesirably low z-score values. The result for the quartile-based analysis, an alternative to z-scores, is calculated as a difference between a specific signal intensity value and the third quartile (if it is smaller than the value) or the first quartile (if it is greater than the value) of the corresponding set, divided by the interquartile range. The result is set to 0 if the value lies between the first and the third quartile.
An F ratio (one-way ANOVA) is used to find differences between multiple groups of replicated datasets. The F ratio, t-score, z-score, and quartile-based result datasets may be smoothed to suppress signals resulting from individual coinciding noise-related signal intensities. Details and template notebooks for the differential analysis approaches described above are part of the online documentation . Additionally, any custom function to process corresponding signal intensities from datasets under comparison may be defined to highlight any difference or any pattern of interest.
Results and discussion
The binning of raw datasets along the m/z dimension facilitates their processing into a rectangular matrix format. For datasets with high resolution along the m/z dimension (such as those originating from TOF instruments), binning provides a significant decrease in size and resolution suitable for visual inspection. Choosing a wider bin size may, however, lead to an undesirable dilution of weak signals in noise. This can be overcome by first performing the baseline subtraction and noise removal on datasets binned using a narrow binning window. The resulting datasets may then be binned to a resolution suitable for visual inspection.
A representative set of peaks is picked from the datasets for the purpose of migration time alignment. We modified the peak picking algorithm described above to use the maximum vertical distance from the line connecting two neighboring strategic points as a criteria for a new strategic point. This was done to avoid the necessity to normalize the migration time scale and the signal intensity scale prior to peak picking. However, by using the vertical distance, many datapoints within a peak fulfill the criteria of being above the vertical distance threshold. Excessive strategic point selection is suppressed by specifying a minimum distance between neighboring strategic points. The minimum distance is, by default, set to a quarter of a typical peak width so that the excluded time window in the vicinity of strategic points corresponding to the peak top and to the peak base partially or almost completely overlap.
Results of the migration time normalization/alignment are shown in Figure 1. As can be seen, the migration time shifts are significant between the original samples and the trend toward larger shifts with increasing migration time is apparent (Figure 1a). The quality of the overall alignment procedure can be seen in Figure 1b. The quality of the alignment is rather uniform over time. Isolated symbols do not correspond to misaligned peaks but rather to peaks higher than the peak picking threshold present only in certain datasets.
The alignment procedure described above proved robust to the presence of a large number of non-corresponding peaks between two aligned datasets. A small number of corresponding peaks picked from the datasets proved sufficient to find the optimal parameters of the time shift function. Given the robustness of the alignment procedure, missing peaks or erroneous peak picking do not significantly affect the quality of the alignment.
An iterative two-step alignment procedure proved beneficial for datasets with significant time shifts between corresponding peaks. To achieve a good alignment, a small gap penalty value is desirable to limit the number of non-corresponding peaks that are close enough in corresponding electropherograms to fall within the gap penalty and thus affect the alignment. However, when a small gap penalty is used for the alignment of datasets with large time shifts, the optimization procedure may not find the region of convergence to the global optimum. Therefore, a bigger gap penalty value is used first to generate an approximate alignment. The second alignment is then performed with a smaller gap penalty value starting with the parameters of the time shift function obtained from the primary approximate alignment. The DTW implementation of MathDAMP employs explicit time shift function specification. Generic time shift functions (such as polynomials or splines) may be specified for dataset alignment. Alternately, time shift functions incorporating a priori knowledge about the expected time shifts, as for example migration time normalization function for CE , may be used. Using explicit time shift functions provides the ability to control the flexibility or the rigidity of time warping. This may prove beneficial for some applications as improvement of alignment results of unconstrained DTW , by introducing rigid slope constraints, was reported . Additionally, other existing alignment algorithms [5, 17–20], that could also be implemented in MathDAMP, may hold advantages for specific applications.
As described above, the method to identify differences in profiles is not based on integrated peak lists and thus avoids common quantitative errors introduced by this task. It is important to realize that the described peak picking is used only for the purpose of alignment which as we described is very robust to possible peak picking errors.
The density plot visualizations provide an overall intuitive view of the differences between samples (Figure 2). Peaks appear as colored spots of intensity corresponding to the magnitude of the differences between samples/groups. As described above, multiple alternative approaches for highlighting a difference of interest are available. Since different approaches may possess different strengths and weaknesses, evaluating them in parallel decreases the chance of missing an important difference.
The proper alignment of the datasets is a necessary prerequisite to obtain clear results. Misaligned peaks can lead to ambiguous signals on the density plots (e.g. appearing as doublets of opposite polarity red-blue) but these fortunately can be ruled out as false positives by visually exploring the overlaid electropherograms of the top candidate differences (Figure 3). The confirmation plots are thus an essential and easy way to identify false-positive signals since at this point there are no simple means to automate this process. Specific sources of ambiguous signals and possible ways to suppress their occurrence, as well as potential strengths and weaknesses of alternative approaches, for different kinds of differential analysis, are described in more detail in the respective example notebooks which are part of the online documentation .
The ability of MathDAMP to identify specific differences in complex metabolite profiles has been successfully demonstrated, leading to the discovery of a biomarker for liver oxidative stress  as well as facilitating enzyme activity detection and discovery in known and non-characterized proteins using non-targeted CE-MS analysis of complex substrate cocktails . Overall, the MathDAMP tools can be seen as complementary to other methods that make use of integrated peak lists to find differences in profiles using downstream multivariate analysis. While both approaches may have limitations, MathDAMP can considerably simplify differential analysis of metabolite profiles that are very similar and is very robust to widely varying migration times and irregular peak shape.
The MathDAMP package is capable of highlighting differences between complex metabolite profiles according to various criteria in an automated fashion. Since the whole (preprocessed and normalized) raw datasets are compared, the possibility of loss of information (e.g. due to common but unavoidable mistakes in peak picking or peak identification) is limited. MathDAMP differs from most other existing tools by combining very robust migration time normalization with a point-by-point approach to the identification of differences in profiles, facilitates the identification of metabolites, and provides multiple different ways in which data and differences in profiles can be visualized and analyzed. Moreover, the open architecture of the MathDAMP modules allows user to adjust multiple options to fit particular purposes or type of analytical method and offers extensive customizability for any user willing to manipulate the Mathematica code and potentially quickly implement any desirable new features. The current release does not provide quantification of compounds per se, something that is planned for future development. However, it is especially well-suited to highlight subtle differences between complex metabolite profiles.
Availability and requirements
Project name: MathDAMP
Project home page: http://mathdamp.iab.keio.ac.jp/
Operating system(s): Platform independent
Programming language: Mathematica
Other requirements: Mathematica 4.2 or higher
time-of-flight mass spectrometry
The authors would like to thank Tsuyoshi Iwasaki, Gabor Bereczki, Masahiro Sugimoto, and Yuji Kakazu of the Institute for Advanced Biosciences and Takamasa Ishikawa and Yuki Ueno of Human Metabolome Technologies, Inc. for technical help and support. This work was supported in parts by grants from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) including Leading Project for Biosimulation and the 21st Century COE Program entitled "Understanding and Control of Life's Function via Systems Biology" as well as research funds from Tsuruoka City and the Yamagata Prefectural Government.
- Goodacre R, Vaidyanathan S, Dunn WB, Harrigan GG, Kell DB: Metabolomics by numbers: acquiring and understanding global metabolite data. Trends Biotechnol 2004, 22: 245–252. 10.1016/j.tibtech.2004.03.007View ArticlePubMedGoogle Scholar
- Duran AL, Yang J, Wang L, Sumner LW: Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics 2003, 19: 2283–2293. 10.1093/bioinformatics/btg315View ArticlePubMedGoogle Scholar
- Tikunov Y, Lommen A, de Vos CH, Verhoeven HA, Bino RJ, Hall RD, Bovy AG: A Novel Approach for Nontargeted Data Analysis for Metabolomics. Large-Scale Profiling of Tomato Fruit Volatiles. Plant Physiol 2005, 139: 1125–1137. 10.1104/pp.105.068130PubMed CentralView ArticlePubMedGoogle Scholar
- Katajamaa M, Miettinen J, Oresic M: MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics 2006, 22: 634–636. 10.1093/bioinformatics/btk039View ArticlePubMedGoogle Scholar
- Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 2006, 78: 779–787. 10.1021/ac051437yView ArticlePubMedGoogle Scholar
- Broeckling CD, Reddy IR, Duran AL, Zhao X, Sumner LW: MET-IDEA: data extraction tool for mass spectrometry-based metabolomics. Anal Chem 2006, 78: 4334–41. 10.1021/ac0521596View ArticlePubMedGoogle Scholar
- Nordstrom A, O'Maille G, Qin C, Siuzdak G: Nonlinear data alignment for UPLC-MS and HPLC-MS based metabolomics: quantitative analysis of endogenous and exogenous metabolites in human serum. Anal Chem 2006, 78: 3289–3295. 10.1021/ac060245fPubMed CentralView ArticlePubMedGoogle Scholar
- Katz JE, Dumlao DS, Clarke S, Hau J: A new technique (COMSPARI) to facilitate the identification of minor compounds in complex mixtures by GC/MS and LC/MS: tools for the visualisation of matched datasets. J Am Soc Mass Spectrom 2004, 15: 580–584. 10.1016/j.jasms.2003.12.011View ArticlePubMedGoogle Scholar
- Shellie RA, Welthagen W, Zrostlikova J, Spranger J, Ristow M, Fiehn O, Zimmermann R: Statistical methods for comparing comprehensive two-dimensional gas chromatography – time-of-flight mass spectrometry results: metabolomic analysis of mouse tissue extracts. J Chromatogr A 2005, 1086: 83–90. 10.1016/j.chroma.2005.05.088View ArticlePubMedGoogle Scholar
- Ruckstuhl AF, Jacobson MP, Field RW, Dodd JA: Baseline subtraction using robust local regression estimation. J Quant Spectrosc Radiat Transfer 2001, 68: 179–193. 10.1016/S0022-4073(00)00021-2View ArticleGoogle Scholar
- Wallace WE, Kearsley AJ, Guttman CM: An Operator-Independent Approach to Mass Spectral Peak Identification and Integration. Anal Chem 2004, 76: 2446–2452. 10.1021/ac0354701View ArticlePubMedGoogle Scholar
- Soga T, Baran R, Suematsu M, Ueno Y, Ikeda S, Sakurakawa T, Kakazu Y, Ishikawa T, Robert M, Nishioka T, Tomita M: Differential metabolomics reveals ophthalmic acid as an oxidative stress biomarker indicating hepatic glutathione consumption. J Biol Chem 2006, 281: 16768–16776. 10.1074/jbc.M601876200View ArticlePubMedGoogle Scholar
- Reijenga JC, Martens JH, Giuliani A, Chiari M: Pherogram normalization in capillary electrophoresis and micellar electrokinetic chromatography analyses in cases of sample matrix-induced migration time shifts. J Chromatogr B Analyt Technol Biomed Life Sci 2002, 770: 45–51. 10.1016/S0378-4347(01)00527-8View ArticleGoogle Scholar
- Bylund D, Danielsson R, Malmquist G, Markides KE: Chromatographic alignment by warping and dynamic programming as a pre-processing tool for PARAFAC modelling of liquid chromatography-mass spectrometry data. J Chromatogr A 2002, 961: 237–244. 10.1016/S0021-9673(02)00588-5View ArticlePubMedGoogle Scholar
- Tomasi G, Andersson C: Correlation optimized warping and dynamic time warping as preprocessing methods for chromatographic data. J Chemom 2004, 18: 231–241. 10.1002/cem.859View ArticleGoogle Scholar
- Nielsen NPV, Carstensen JM, Smedsgaard J: Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. J Chromatogr A 1998, 805: 17–35. 10.1016/S0021-9673(98)00021-1View ArticleGoogle Scholar
- Eilers PH: Parametric time warping. Anal Chem 2004, 76: 404–411. 10.1021/ac034800eView ArticlePubMedGoogle Scholar
- Pierce KM, Wood LF, Wright BW, Synovec RE: A comprehensive two-dimensional retention time alignment algorithm to enhance chemometric analysis of comprehensive two-dimensional separation data. Anal Chem 2005, 77: 7735–7743. 10.1021/ac0511142View ArticlePubMedGoogle Scholar
- Prince JT, Marcotte EM: Chromatographic Alignment of ESI-LC-MS Proteomics Data Sets by Ordered Bijective Interpolated Warping. Anal Chem 2006, 78: 6140–6152. 10.1021/ac0605344View ArticlePubMedGoogle Scholar
- Saito N, Robert M, Kitamura S, Baran R, Soga T, Mori H, Nishioka T, Tomita M: Metabolomics approach for enzyme discovery. J Proteome Res 2006, 5: 1979–1987. 10.1021/pr0600576View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.