Statistical methods and resources for biomarker discovery using metabolomics
BMC Bioinformatics volume 24, Article number: 250 (2023)
Metabolomics is a dynamic tool for elucidating biochemical changes in human health and disease. Metabolic profiles provide a close insight into physiological states and are highly volatile to genetic and environmental perturbations. Variation in metabolic profiles can inform mechanisms of pathology, providing potential biomarkers for diagnosis and assessment of the risk of contracting a disease. With the advancement of high-throughput technologies, large-scale metabolomics data sources have become abundant. As such, careful statistical analysis of intricate metabolomics data is essential for deriving relevant and robust results that can be deployed in real-life clinical settings. Multiple tools have been developed for both data analysis and interpretations. In this review, we survey statistical approaches and corresponding statistical tools that are available for discovery of biomarkers using metabolomics.
Overview of metabolomics
The term metabolome was first coined in 1998  and became widely established in the early 2000 . Metabolomics profiling is a high-throughput technique that quantifies the levels of endogenous metabolites in a sample (biological fluids, tissues, etc.). . The study of metabolites or metabolite profiling has been gaining popularity in the past decade, thanks to the recent advances in analytical platforms such as Fourier-Transform Infrared spectrometry (FT-IR), Nuclear magnetic resonance (NMR), mass spectrometry (MS) coupled to separation techniques such as gas-chromatography (GC–MS), liquid chromatography (LC–MS), Fourier Transform mass spectrometry (FT-MS), Ultra-high performance liquid chromatography (UPLC–MS), Capillary electrophoresis (CE–MS), Inductively coupled plasma (IPC–MS), Ion chromatography (IC–MS)  etc. Metabolites are key molecules in cellular functions. Many biological disturbances involve a cascade of metabolic changes, making metabolites close descriptors for the phenotype. There are two main analytical techniques that are used in the quantification of metabolites (in a cell, tissue, or body fluids): NMR and MS [5,6,7] through a process that can be untargeted or targeted. The former is a comprehensive technique measuring all metabolites in a sample without bias, including unknown chemical compounds. It is best suited for hypothesis-generating studies and leads to novel biomarker discovery, although the identification and categorisation of unknown compounds remains a great challenge. On the other hand, targeted metabolomics quantifies chemically known and annotated metabolites. Typically, the measured metabolites are labelled by comparing their masses to known compounds from spectral databases, which in addition to characteristic MS or NMR properties, also contain various information about nomenclature, compound concentrations, biological locations, enzyme and mutation data (see Table 1).
Since its introduction, metabolomics has been used in a wide range of applications such as health and disease biomarker and enzyme discoveries, food and nutrition, and plant biotechnology to name a few . Metabolomics has proven to be a valuable tool in biomedical research, enabling the assessment of disturbances in biological systems caused by environmental factors, aiding in the diagnosis of diseases, and facilitating the identification of biomarkers. Biomarkers, short for, biological markers are objective indicators that provide information about cellular or organismal processes and can be used to characterize patients in a clinical setting . Properties such as high specificity, sensitivity, repeatability, and clinical usefulness are necessary for a good biomarker. The process of biomarker validation entails in vitro and in vivo research followed by clinical trials in human cohorts. Biomarker discovery using metabolomics is considered to be a relatively improved method compared to traditional diagnostic approaches due to its sensitivity and specificity . Metabolites have been found to be eligible molecular biomarkers in several studies; for instance, an untargeted metabolomics approach was used to show that non-alcoholic fatty liver diseases (NAFLD), featuring a range of severity levels from simple steatosis to complex hepatocellular carcinoma, are characterised each with a distinct metabolic profile [13, 14]. Furthermore, metabolomics have shown their potential in diagnosis and management in early screening of oral cancer , pancreatic cancer , and breast cancer . Additionally, it was shown that recurrence can be monitored using metabolite biomarkers in various cancer patients [18,19,20]. Further to cancer, metabolomic studies have investigated potential biomarkers associated with fitness , telomere length , cardiovascular demand , steroid profile , etc. in elite athletes. Other studies evaluated biomarkers of metabolic diseases such as polycystic ovary syndrome , insulin resistance [26,27,28,29], and diabetes  (See Table 2 for examples of biomarkers from the mentioned studies). With the recent outbreak of COVID-19, emerging metabolomics data have provided insights into COVID-19 pathogenesis in patients with pre-existing chronic conditions such as diabetes, hypertension, hypothyroidism, etc. and revealed biomarkers linked to mechanisms of disease progression, severity, and side-effects of COVID-19 in affected individuals [31,32,33,34,35,36,37,38].
Early biochemical investigations, in the field of metabolomics, featured a low number of measured analytes to ease the interpretation of results [43, 44]. Today, information systems have matured tremendously, and many tools have been developed to assist in analysing and interpreting high throughput metabolomics data. With the continuous advances in instrumental techniques, adopting the correct statistical approach remains critical for proper interpretation and optimal utilization of data. The purpose of this review is to provide an overview of metabolomics data analysis in current research, with special emphasis on methods available for biomarker discovery in human disease.
Metabolomics: analytical challenges and pre-processing
Like other omics fields, the workflow of metabolomics comprises of (i) Experimental design, (ii) Sample collection and preparation, (iii) Data retrieval/acquisition and pre-processing, and (iv) Data analysis and interpretation [45, 46]. Experimental design aids in tightening confidence intervals, minimising confounders and controlling the obvious sources of variation. Sample collection, preparation and data retrieval are the stages where systematic and random errors occur, although these can be controlled via strict work environment and protocol design to some extent . It is during the pre-processing stage that the spectral data are converted to abundance of metabolites in each sample, a crucial link between raw data measurement and statistical analysis. Typical pre-processing steps include deconvolution, library-based identification, and alignment  which can be performed by a variety of analytical tools (refer to Table 3). For untargeted metabolomics, this step represents a major challenge due to the lack of spectra for the novel metabolites detected. However, methods to characterize the unknowns are being continuously explored. For example, Knowledge-guided multi-layer networks (KGMN), developed by Zhou et al., were used in untargeted metabolomics to enable global metabolite identification from knowns to unknowns by integrating knowledge-based metabolic reaction network, MS/MS similarity network as well as global peak correlation network . Global network optimization approach, NetID, was recently developed by Chen et al. to annotate untargeted LC–MS data. NetID develops chemically meaningful peak-peak correlations, improves peak assignment accuracy, and creates a single network connecting most observed ion peaks, even for peaks missing MS spectra . Statistical machine learning-based methods are geared towards the identification of unknowns based on feature similarity with the knowns: For instance, (MP-)IOKR , MetFrag  and CSI:FingerID employ fragmentation trees to learn rules for subclustering of metabolites. Methods like MetFusion  were developed to allow access to large spectral databases such as MassBank  to allow for improved optimization of predictive models.
Statistics in metabolomics
In addition to analysis challenges encountered with omics data such as high variable dimensionality and intercorrelation, metabolomics data are particularly prone to noise and can be influenced by environment factors, diet, exercise as well as sample handling and batch measurement. In addition, metabolomics data are characterised by a greater extent of data missingness which can compound multivariate analysis and classification techniques. As a consequence, careful application of appropriate statistical methods is required; otherwise, crucial information may get lost or false trends/models may be identified.
The format of metabolomics data is typically a data matrix, with metabolite abundance and samples given in columns and rows or vice-versa. Even though metabolomics profiling is highly sought, there are no standard protocols established for the statistical analysis of the produced data. In this review, we discuss some of the widely adopted statistical approaches in recent studies. A simple schematic representation of the steps involved in metabolomics data analysis is depicted in Fig. 1.
The metabolomics data matrix is prone to elevated metabolite missingness due to several reasons, most notably the inability to measure when metabolite levels are below the detection level as well as technical errors such as peak misalignment or metabolite structural instability. General statistical techniques for multiple imputation have been traditionally applied on metabolomics data but more tailored approaches that acknowledge the frequent non-random pattern of missingness in metabolomics have recently been developed: MetabImpute, an R package which can assess the missingness as completely or partially missing due to randomness and non-randomness (MCAR—missing completely at-random, MAR—missing at-random and MNAR—missing not-at-random) . Indeed, there is no general opinion on the right filter percentage, but cut-offs have been traditionally chosen within the range of 20–50% of metabolite missingness [69, 70]. Imputation is crucial when multivariate techniques, including classification, are applied on metabolomics data.
Complex metabolomics data is heteroscedastic and right skewed and requires normalization. The go-to method for correcting the skewness is log-transformation . Furthermore, filtering of overly heterogenous or bad quality samples is a good practise to avoid the propagation of errors throughout the dataset and can be achieved by means of multivariate techniques such as principal component analysis and clustering. Data normalization, based on aligning the median or more generally quantiles, is crucial to eliminate between-sample variation. It should be noted that using a wrong pre-analytical method to normalize/transform the data will result in poor results and may impact the ranks of relevant metabolites. Additionally, data points should only be removed if there are valid biological justifications for considering them as outliers. It is possible to produce a model that seems to work well by excluding difficult-to-model data points, but that is not actually representational of the real biological system.
Two main statistical approaches are available for metabolite differential level analysis: univariate and multivariate. Combination of both methodologies is common to metabolomic biomarker-based studies but this review shall focus on the advantages and increased power gained from multivariate analysis (MVA).
MVA is an essential part of metabolomics data analysis. Biological systems are not limited to single variable changes between healthy and diseased states. Investigation of system level changes is pivotal to deriving definitive conclusions about a certain condition and its potential biomarkers. MVA techniques incorporate all variables simultaneously and assess the relationships among them  as well as their joint contribution to the phenotype under study.
Unsupervised and supervised models of multivariate analysis are currently employed in metabolomics. One popular unsupervised technique is Principal component analysis (PCA) which identifies independent components in the data based on linear combinations of correlated features. Due to it unsupervised nature, PCA serves little purpose in biomarker discovery. PCA components are often fed into the univariate tests as a means for correcting for hidden unmeasured confounder effects. Moreover, PCA is often used as a checkpoint during QC to screen for outlier data points . For example, when Al-Khelaifi et al. conducted PCA to obtain a global perspective of the data, they noted that PC1 captured the extent of haemolysis between the samples, while PC2 suggested effect of exercise. Incorporation of these components in the regression model greatly improved the detection of marker metabolites in association with the biological groups of interest to their study.
PLS-DA or Partial Least Square Discriminant Analysis  is a supervised MVA technique, that has been incorporated in numerous metabolomics studies for the discovery of biomarkers in different health conditions [75,76,77]. PLS-DA attempts for optimal break-down of predictor variable X to best explain the response variable Y [73, 78, 79]. An upgraded version of PLS-DA called OPLS-DA (“orthogonal” PLS-DA) has also gained popularity [80,81,82,83,84]. This model recapitulates the variance into parts that are predictive of the experimental groups and parts that are purely due to noise, also referred to as ‘orthogonal’ [85,86,87]. Therefore OPLS-DA creates decipherable models with ease in comparison to its previous version [88, 89]. Once the PLS/OPLS-DA model is built, the VIP (variable influence of projection) measure can be obtained for the metabolites based on their association with the identified predictive components. Certain studies use VIP > 1 as a threshold and select the metabolites for further analysis using linear regression models to correct for measured confounders.
Support vector machines (SVM) is a supervised machine-learning algorithm that can be used for regression and classification of non-linear data. SVM can detect non-linear relationships in the data that do not comply with the assumptions of PCA and OPLS, making it versatile. It identifies support vectors or samples on the margin between two classes to search for a maximum margin hyperplane. The use of kernels simplifies separation of classes for difficult cases by providing non-linear solution in the original space. SVM has different extensions for classification with overlapping groups, multi-class classification, regression, and specializations. Importantly, once the hyper-plane partition is found, feature importance values can be derived which can aid in biomarker discovery. However, basic SVM algorithms are not time efficient to tune complex separating hyperplanes as they do not take into account prior knowledge about probability of class-member. Despite often producing good results, faster and more stable methods can outperform SVM [90, 91]. A limitation of SVM is its restriction to binary classification. Alternative methods have been proposed to extend the use of SVM to multi-class problems, the models are built on breaking down the dataset into units of binary groups which causes oversimplification and may lead to uninformative models.
Random forest (RF) is a supervised machine learning method which is based on a decision tree algorithm. It is considered an excellent classifier for its ease of implementation, speed, stability robustness against overfitting and most importantly its ability to handle datasets with biased number of classes/groups . Developed by Breiman, RF is a combination of decision trees, with each tree trained using a random subset of the data and input features. The algorithm uses a bootstrap sampling technique to select the data subsets. A simple RF with random features is created by randomly selecting a small group of input variables (with a fixed size) at each node to split on . RF algorithm is highly adaptable to real-world datasets as it remains unaffected by scaling and normalization. However, a major challenge includes the requirement of excessive tuning of default parameters by the researcher to produce the best model, and eventual difficulty in visualizing the decision tree [95, 96]. Importantly, whilst classifying samples, RF performs a variable selection step which helps reduce the search space and aid in the process of pinpointing candidate metabolite biomarkers.
Variational autoencoders is an unsupervised deep learning method that operates by encoding input data into a non-linear, lower-dimensional latent space that can be used to reproduce the original data without loss of information. It has recently been advocated for use with metabolomics data to learn its transferable latent representations; which can help expose clusters of samples with specific metabolite levels .
The classification methods outlined in this review can be prioritised based on the research question and the characteristics of the data. SVM can handle binary and multinominal data with non-linear relationships between variables. RF, on the other hand, is compatible with continuous and categorical data and is used to create an ensemble of decision trees that can capture complex interaction between features, while being robust to outliers and normalization techniques. VAE is notably novel in the field of metabolomics and any added advantage to its use are yet to be shown.
OPLS/OPLS-DA is an excellent choice for small sized and highly correlated data with few groups of samples. It can explain most of the variation in the data by reducing the high dimensionality into predictive and orthogonal latent variables. It handles the missing values in the data efficiently and is robust to outliers . One can argue that both RF and OPLS-DA methods are a good starting point for exploring metabolomics data due to their easiness of use and interpretability. Table 4 provides an overview of the methods, their strengths and weaknesses to be considered in metabolomics data analysis.
It is important to note that classification, prediction and biomarker discovery methods for metabolomics data extend to other models including logistic regression models, LASSO, CCPLS, ASCA + and APCA + (extension of ANOVA to multivariate classes) , multivariate curve resolution (MCR), neural networks, Gaussian mixture modelling etc. More details about these methods and how they have been deployed in the field of metabolomics can be found in [99,100,101,102,103].
Validation of model performance
Several metrices exist for assessment of model performance. With OPLS and PLS models, typical measures are R2 which captures the goodness of fit, and the Q2 that computes the predictive ability of the model, defined as the congruence of cross-validation of predicted data with the original data. OPLS further splits R2X into R2Xp and R2Xo which respectively measure the explained sum of squared of the Y-predictive and Y-uncorrelated parts of X. . Q2 > 0.4 provides a satisfactory predictability of the model [105, 106]. Q2 and R2 values that are closer to 1 ensure a reliable model, while large discrepancy between the two scores depict an unreliable model . Permutation tests are used to estimate Q2 and provide a possibility of calculating significance (p-values) for these MV models [108,109,110].
Brier score is another CV procedure that measures the accuracy of binary outcome predictions by calculating the squared difference between the actual outcome and predicted probability. A perfect model has a score of 0 and a non-informative model has a score of 0.25. Harrell’s C-index is also a performance measure used with survival analyses. The index is driven by Kendall's tau statistic, depends on the censoring distribution, and considers the rankings of pairings of subjects in the data. The index ranges from 0 to 1 (indicating worst to best performance) and a value of 0.6 or higher is acceptable for clinical datasets .
The receiver operating characteristics (ROC) curve analysis assesses the specificity and sensitivity of a potential biomarker by plotting the true positive rate (y axis) as a function of the false positive rate (x axis). It produces the area under the curve (AUC) measure that indicates the ability of a biomarker to distinguish between two study groups. Multivariate receiver operating characteristic analysis (MultiROC)  is an extension of ROC analysis that allows for different combinations of biomarkers to be clinically explored [114, 115] and is compatible with the inherent nature of multivariate classifiers such as PLS/OPLS-DA models.
There are other cross-validation procedures employed in predictive analysis such as leave-n-out, Monte Carlo cross-validation (MCCV), corrected-MCCV (CMCCV) etc. For detailed information, readers are referred to Sammut et al. and Xu et al. [116, 117].
The metrices outlined above have been instrumental in assessing the performance of MV classification methods to ensure validity and reliability of the results. For example, a study by Chen et al. compared four classifiers, PCA, SVM, LDA and RF using several methods including cross-validation, R2/Q2 plot, ROC curve and Pearson corelation. RF was found to be associated with better performance with respect to sample classification and biomarker selection .
Tools available for the statistical analysis of metabolomics data
Several tools are available for data analysis in metabolomics. The tools required for highly intricate metabolomics data analysis should be able to handle the large data size, perform pre-processing steps adequately, conduct statistical methods to identify significantly different metabolites, and provide striking visualization techniques such as heatmaps, correlation and pathway networks. We intend to cover some of the widely used tools that provide data pre-processing, univariate and multivariate methodologies used for biomarker discovery. Table 3 provides a quick view of the methods available in the tools discussed below.
(i) MetaboAnalyst: Extensive web-based toolkit for complete data analysis of metabolomics data. It provides multiple statistical workflows for one-factor, two-factor/timeseries, meta-analysis data formats, which include univariate (t-tests, ANOVA) and multivariate (PCA, PLS-DA, OPLS-DA). The latest version (MetaboAnalyst 5.0) is user-friendly compared to its predecessor. It contains a biomarker discovery option using ROC analyses with straightforward data input and user-defined options for pre-processing steps and normalization. This web-based platform has been utilized in various studies for biomarker identification due to its amenable nature [119,120,121,122,123].
(ii) MZmine 3:
Built on the success and popularity of MZmine 2, MZmine 3 is an open-source platform for data pre-processing and analyses with LC-MS in mind. The updated version has focused on improving the user-friendly graphics with the original eight modules .
To our knowledge, a comprehensive and unmatched tool that brings metabolomics to cloud computing after Galaxy. Ongoing immense data generation requires cloud-based tools to reduce the load on personal or workplace environment by storing the data onto cloud space. Data analysis tools are tested and stored as Docker containers . PhenoMeNal has successfully developed sophisticated data analysis workflows, which reduces the burden on the researcher.
(v) SECIMTools (SouthEast Center for Integrated Metabolomics):
Designed to complement both the previous Galaxy metabolomics tools, Galaxy-M and Workflow4metabolomics, SECIMtools begins with features which follows quality control (QC) and advanced statistical assessment. It has four major functionalities: data pre-processing, QC, data analysis and utilities . A guide to use the galaxy interface of SECIMTools can be found here. [https://ctsi-secim.sites.medinfo.ufl.edu/files/2015/08/7_7_2015_Galaxy_UserGuide.pdf]
By Sartorius AG, SIMCA is the tool of choice for multivariate analysis by many studies . It is user-friendly, with multiple interactive visualization methods, has the ability to fit models that best suit the data at hand, perform ROC analysis, analyse multiple datasheets, to name a few. For metabolomics, investigation of metabolites with significantly different abundances, metabolite pathways (if present in the datasheet) associated with experimental groups, examining relationship between variables and quick identification of potential biomarkers are relatively easy for non-programmers. SIMCA contains in-built cross validation steps that provide the predictive ability of the model. Although this tool is not suited to univariate analysis and is not in an open-source format and requires license purchase prior to use.
(vii) R (R foundation for statistical computing, Vienna, Austria)  packages for metabolomics:
For statisticians who are well-versed in programming languages, R is the best option for metabolomics data analysis as it provides a more flexible work environment as opposed to rigid online tools with limited user-defined options. There are several packages for normalization, imputation, univariate hypothesis testing, multivariate exploratory analysis in R.
R based powerful tool for processing of LC-MS data using retention time correction, peak identification and matching to derive necessary information. It can be combined with base R functions to perform all statistical methods for a comprehensive data analysis.
Corresponding R package of web-based MetaboAnalyst, with more adjustable programming feature to enable autonomy of metabolomics data analysis.
(c) MAIT (Metabolite Automatic Identification Toolkit):
Provides a comprehensive end-to-end analysis for LC-MS data. Although it is more suited to peak identification and annotation. Parametric and non-parametric univariate statistical tools and multivariate analyses such as PLS-DA are available with user defined grouping option .
Performs simple t-tests, ANOVA, PCA and combine functional information and the associated gene names of the metabolites in the dataset using KEGG. It was developed for inexperienced R users to analyze metabolite count data. The input format should contain KEGG IDs to process the data. The package contains multiple visualization techniques such boxplots, heatmaps, volcano plots etc. .
Multi-level analysis is available in this package, which includes pre-processing, metabolite annotation, uni- and multivariate analyses, ML (machine learning) and selection of significant features .
Quality control processing, statistical analysis of metabolomics, lipidomis and proteomics data can be performed using pmartR. Analyses such as transformation, normalization, simple univariate and summarising PCA and correlation analyses are available .
Built with non-programmers in mind, muma provides user-friendly stepwise univariate and multivariate analysis via R program. Data pre-processing, imputation, data exploration through various visualizations and statistical analysis are available in this package .
Limitations of statistics in biomarker discovery
Biomarkers are measured indicators of biological and/or pathogenic processes, or response to therapies . Metabolite biomarkers are quantified at a cheaper rate compared to other types of biomarkers . There is certainly a rapid increase in the number of metabolite biomarkers discovered due to improvements in the analytical procedures but are not in practical use due to limitations in experimental design, statistical rigor, and efficacy [130, 131]. Biomarkers in clinical practice should be easy to quantify and should bring value in relation to early detection of disease, improvement in treatment outcomes, reduction in the reliance on expensive treatment options, or decrease in disease-related fatalities. Unfortunately, appropriate biomarkers with appreciable specificity and sensitivity are hard to come by. Using the combinatorial capacity of a variety of distinct biomarkers is one possibility to improve the overall specificity [132, 133]. Present-day metabolomics have substantially benefited from upgraded study design that contributed to the decrease in the demographic differences and sources of bias. This approach has been applied to all sorts of study designs such as interventional, observational, and with multi-tiers. Study enrolment with balanced demographic attributes under a multi-cohort setting should have sufficient sample size to comply with the requirements for adequate statistical power . Improvised prospective trials are required to verify biomarkers’ ability to detect physiological changes before onset of phenotype. Validation of biomarkers has been carried out in small, unbridled trials so far . However, large scale validation remains inadequate leading to very few metabolomics biomarkers finding their way to clinical practice [136, 137]. More insights on ways in which metabolomics research can be advanced to meet the challenges of biomarker discovery can be in found in Poste et al. .
This mini review has introduced the user to standard methodologies with easy-to-use tools for analysis of metabolomics datasets and biomarker discovery. Metabolite biomarkers are constantly growing interest in the omics field as they depict a phenotype as close to accurate as possible from the physiological or pathological state. In the future, we expect the evolution of existing statistical methods to provide even deeper insights into metabolite biomarkers from the larger perspective of systems’ biology and precision medicine. In this context, biomarkers identified using multi-omics techniques can broaden the scope of individualized treatment plans by providing markers for patient stratification, early diagnosis, prediction, and progression monitoring, etc. To this end, advanced statistical and machine learning methods are being developed to provide effective approaches for multi-omics data integration . Aligning the biological information from multi-level omics analysis has the advantage of reducing noise and provides an extra level of biomarker validation. More importantly, integration with genotype data can help distinguish biomarkers associated with causal effects as opposed to those of secondary nature, that occur because of the disease or pathology of interest as well as those contributed by the environment. Methods for stratification of patients into homogeneous groups with unified analyte levels, such as supervised biclustering [132, 133], have been recently applied in the field of transcriptomics and offer an interesting opportunity for metabolomics to embark in the field of precision medicine.
In parallel to technological advancement, progress in computational and statistical analysis is also required to tackle some of the remaining limitations in the field of metabolomics; notably with regard to annotation/identification of unknown compounds with untargeted metabolomics. Machine learning approaches are of great value in this respect and can offer better performance with improved and more accurate information on compound masses, retention time, fragment mass spectra, and isotopic properties .
It should be noted that all statistical methods incorporated in the field of omics are simply hypothesis creators, essentially shortening a seemingly limitless list of metabolites to a manageable set whose properties and merits should be evaluated by downstream experimental work. Standardization of validation protocols including replication and experimental validation in animal models is essential for metabolite biomarkers to make their way to pre-clinical settings.
Availability of data and materials
Area under the curve
Capillary electrophoresis–mass spectrometry
Corrected- Monte Carlo cross-validation
Fourier transform–mass spectrometry
Gas chromatography–mass spectrometry
Inductively coupled plasma mass spectrometry
Ion chromatography–mass spectrometry
Knowledge-guided multi-layer networks
Kyoto encyclopedia of genes and genomes
Liquid chromatography–mass spectrometry
Linear dimension analysis
Monte Carlo cross-validation (MCCV)
Nuclear magnetic resonance
Receiver operating characteristics
Ultra-high performance liquid chromatography–mass spectrometry
Oliver SG, et al. Systematic functional analysis of the yeast genome. Trends Biotechnol. 1998;16(9):373–8.
Griffin JL. The Cinderella story of metabolic profiling: Does metabolomics get to go to the functional genomics ball? Philos Trans R Soc Lond B Biol Sci. 2006;361(1465):147–61.
Clish CB. Metabolomics: an emerging but powerful tool for precision medicine. Cold Spring Harb Mol Case Stud. 2015;1(1): a000588.
Macedo AN, et al. Analytical platforms for mass spectrometry-based metabolomics of polar and ionizable metabolites. Adv Exp Med Biol. 2021;1336:215–42.
Schrimpe-Rutledge AC, et al. Untargeted metabolomics strategies-challenges and emerging directions. J Am Soc Mass Spectrom. 2016;27(12):1897–905.
Wang JH, Byun J, Pennathur S. Analytical approaches to metabolomics and applications to systems biology. Semin Nephrol. 2010;30(5):500–11.
Johnson CH, Ivanisevic J, Siuzdak G. Metabolomics: beyond biomarkers and towards mechanisms. Nat Rev Mol Cell Biol. 2016;17(7):451–9.
Wishart DS, et al. HMDB 5.0: the human metabolome database for 2022. Nucl Acids Res. 2021;50(D1):D622–31.
Guijas C, et al. METLIN: a technology platform for identifying knowns and unknowns. Anal Chem. 2018;90(5):3156–64.
Gomez-Casati DF, Zanor MI, Busi MV. Metabolomics in plants and humans: applications in the prevention and diagnosis of diseases. Biomed Res Int. 2013;2013: 792527.
Strimbu K, Tavel JA. What are biomarkers? Curr Opin HIV AIDS. 2010;5(6):463–6.
Kotlowska A, Szefer P. Recent advances and challenges in steroid metabolomics for biomarker discovery. Curr Med Chem. 2019;26(1):29–45.
Masarone M, et al. Untargeted metabolomics as a diagnostic tool in NAFLD: discrimination of steatosis, steatohepatitis and cirrhosis. Metabolomics. 2021;17(2):12.
Masoodi M, et al. Metabolomics and lipidomics in NAFLD: biomarkers and non-invasive diagnostic tests. Nat Rev Gastroenterol Hepatol. 2021;18(12):835–56.
Ishikawa S, et al. Identification of salivary metabolomic biomarkers for oral cancer screening. Sci Rep. 2016;6:31520.
Kobayashi T, et al. A novel serum metabolomics-based diagnostic approach to pancreatic cancer. Cancer Epidemiol Biomarkers Prev. 2013;22(4):571–9.
Tenori L, et al. Serum metabolomic profiles evaluated after surgery may identify patients with oestrogen receptor negative early breast cancer at increased risk of disease recurrence. Results from a retrospective study. Mol Oncol. 2015;9(1):128–39.
Loras A, et al. Bladder cancer recurrence surveillance by urine metabolomics analysis. Sci Rep. 2018;8(1):9172.
Zhang F, et al. Metabolomics for biomarker discovery in the diagnosis, prognosis, survival and recurrence of colorectal cancer: a systematic review. Oncotarget. 2017;8(21):35460–72.
Alberice JV, et al. Searching for urine biomarkers of bladder cancer recurrence using a liquid chromatography-mass spectrometry and capillary electrophoresis-mass spectrometry metabolomics approach. J Chromatogr A. 2013;1318:163–70.
AlMuraikhy S, et al. Comparing the metabolic profiles associated with fitness status between insulin-sensitive and insulin-resistant non-obese individuals. Int J Environ Res Public Health, 2022. 19(19).
Al-Muraikhy S, et al. Metabolic signature of leukocyte telomere length in elite male soccer players. Front Mol Biosci. 2021;8: 727144.
Al-Khelaifi F, et al. Metabolic profiling of elite athletes with different cardiovascular demand. Scand J Med Sci Sports. 2019;29(7):933–43.
Tarkhan AH, et al. Comparing metabolic profiles between female endurance athletes and non-athletes reveals differences in androgen and corticosteroid levels. J Steroid Biochem Mol Biol. 2022;219: 106081.
Diboun I, et al. Metabolomic profiling of pregnancies with polycystic ovary syndrome identifies a unique metabolic signature and potential predictive biomarkers of low birth weight. Front Endocrinol (Lausanne). 2021;12: 638727.
Diboun I, et al. Metabolomics of lean/overweight insulin-resistant females reveals alterations in steroids and fatty acids. J Clin Endocrinol Metab. 2021;106(2):e638–49.
Al-Sulaiti H, et al. Metabolic signature of obesity-associated insulin resistance and type 2 diabetes. J Transl Med. 2019;17(1):348.
Al-Sulaiti H, et al. Triglyceride profiling in adipose tissues from obese insulin sensitive, insulin resistant and type 2 diabetes mellitus individuals. J Transl Med. 2018;16(1):175.
Helaleh M, et al. Association of polybrominated diphenyl ethers in two fat compartments with increased risk of insulin resistance in obese individuals. Chemosphere. 2018;209:268–76.
Diboun I, et al. Metabolic profiling of pre-gestational and gestational diabetes mellitus identifies novel predictors of pre-term delivery. J Transl Med. 2020;18(1):366.
Song JW, et al. Omics-driven systems interrogation of metabolic dysregulation in COVID-19 pathogenesis. Cell Metab. 2020;32(2):188-202 e5.
Shen B, et al. Proteomic and metabolomic characterization of COVID-19 patient sera. Cell. 2020;182(1):59-72 e15.
Hasan MR, Suleiman M, Pérez-López A. Metabolomics in the diagnosis and prognosis of COVID-19. Front Genet. 2021;12: 721556.
Shi D, et al. The serum metabolome of COVID-19 patients is distinctive and predictive. Metabolism. 2021;118: 154739.
Dogan HO, et al. Understanding the pathophysiological changes via untargeted metabolomics in COVID-19 patients. J Med Virol. 2021;93(4):2340–9.
Elrayess MA, et al. Metabolic signatures of type 2 diabetes mellitus and hypertension in COVID-19 patients with different disease severity. Front Med (Lausanne). 2021;8: 788687.
Diboun I, et al. Identification of prognostic metabolomic biomarkers at the interface of mortality and morbidity in pre-existing TB cases infected With SARS-CoV-2. Front Cell Infect Microbiol. 2022;12: 929689.
Taleb S, et al. Predictive biomarkers of intensive care unit and mechanical ventilation duration in critically-ill coronavirus disease 2019 patients. Front Med (Lausanne). 2021;8: 733657.
Zhu Q, et al. Metabolomic analysis of exosomal-markers in esophageal squamous cell carcinoma. Nanoscale. 2021;13(39):16457–64.
Morine Y, et al. Essential amino acids as diagnostic biomarkers of hepatocellular carcinoma based on metabolic analysis. Oncotarget. 2022;13(1):1286.
Liesenfeld DB, et al. Metabolomics and transcriptomics identify pathway differences between visceral and subcutaneous adipose tissue in colorectal cancer patients: the ColoCare study. Am J Clin Nutr. 2015;102(2):433–43.
Zhang C, et al. Metabolomic profiling identified serum metabolite biomarkers and related metabolic pathways of colorectal cancer. Dis Markers. 2021;2021:6858809.
Bhattacharya M, et al. Single-run separation and detection of multiple metabolic intermediates by anion-exchange high-performance liquid chromatography and application to cell pool extracts prepared from Escherichia coli. Anal Biochem. 1995;232(1):98–106.
Tweeddale H, Notley-McRobb L, Ferenci T. Effect of slow growth on metabolism of <i>Escherichia coli</i>, as revealed by global metabolite pool (“Metabolome”) analysis. J Bacteriol. 1998;180(19):5109–16.
Manchester M, Anand A. Metabolomics: Strategies to define the role of metabolism in virus infection and pathogenesis. Adv Virus Res. 2017;98:57–81.
Nalbantoglu, S. (2019) Metabolomics: basic principles and strategies. Molecular Medicine, IntechOpen
Korman A, et al. Statistical methods in metabolomics. Methods Mol Biol. 2012;856:381–413.
Mastrangelo A, et al. From sample treatment to biomarker discovery: a tutorial for untargeted metabolomics based on GC-(EI)-Q-MS. Anal Chim Acta. 2015;900:21–35.
Zhou Z, et al. Metabolite annotation from knowns to unknowns through knowledge-guided multi-layer metabolic networking. Nat Commun. 2022;13(1):6656.
Chen L, et al. Metabolite discovery through global annotation of untargeted metabolomics data. Nat Methods. 2021;18(11):1377–85.
Brouard C, et al. Magnitude-preserving ranking for structured outputs, in Proceedings of the Ninth Asian Conference on Machine Learning, Z. Min-Ling and N. Yung-Kyun, Editors. 2017, PMLR: Proceedings of Machine Learning Research. p. 407--422.
Ruttkies C, Neumann S, Posch S. Improving MetFrag with statistical learning of fragment annotations. BMC Bioinformatics. 2019;20(1):376.
Dührkop K, et al. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci U S A. 2015;112(41):12580–5.
Gerlich M, Neumann S. MetFusion: integration of compound identification strategies. J Mass Spectrom. 2013;48(3):291–8.
Horai H, et al. MassBank: a public repository for sharing mass spectral data for life sciences. J Mass Spectrom. 2010;45(7):703–14.
MetaboAnalyst 5.0. Available from: https://www.metaboanalyst.ca/.
Mzmine 3. Available from: http://mzmine.github.io/.
MetaboLyzer. Available from: https://sites.google.com/a/georgetown.edu/fornace-lab-informatics/home/metabolyzer.
PhenoMeNal. Available from: https://phenomenal-h2020.eu/home/.
SECIMTools. Available from: http://secim.ufl.edu/secim-tools/secim-galaxy-tools/.
XCMS online. https://xcmsonline.scripps.edu/.
Fernández-Albert F, et al. An R package to analyse LC/MS metabolomic data: MAIT (Metabolite Automatic Identification Toolkit). Bioinformatics. 2014;30(13):1937–9.
Tiffany CR, Bäumler AJ. omu, a metabolomics count data analysis tool for intuitive figures and convenient metadata collection. Microbiol Resour Announc. 2019;8(15):e00129-e219.
Costa C, Maraschin M, Rocha M. An R package for the integrated analysis of metabolomics and spectral data. Comput Methods Progr Biomed. 2016;129:117–24.
Stratton KG, et al. pmartR: quality control and statistics for mass spectrometry-based biological data. J Proteome Res. 2019;18(3):1418–25.
Gaude E, et al. muma, an R package for metabolomics univariate and multivariate statistical analysis. Curr Metabol. 2013;1(2):180–9.
Davis TJ, et al. Addressing missing data in GC x GC metabolomics: Identifying missingness type and evaluating the impact of imputation methods on experimental replication. Anal Chem. 2022;94(31):10912–20.
Payne TG, et al. A signal filtering method for improved quantification and noise discrimination in fourier transform ion cyclotron resonance mass spectrometry-based metabolomics data. J Am Soc Mass Spectrom. 2009;20(6):1087–95.
Bijlsma S, et al. Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal Chem. 2006;78(2):567–74.
Antonelli J, et al. Statistical workflow for feature selection in human metabolomics data. Metabolites. 2019;9(7):143.
Dillon WR, Goldstein M. Multivariate analysis: methods and applications. New York: Wiley; 1984.
Chen Y, Li EM, Xu LY. Guide to metabolomics analysis: a bioinformatics workflow. Metabolites. 2022;12(4):357.
Barker M, Rayens W. Partial least squares for discrimination. J Chemom. 2003;17(3):166–73.
Broughton-Neiswanger LE, et al. Urinary chemical fingerprint left behind by repeated NSAID administration: discovery of putative biomarkers using artificial intelligence. PLoS ONE. 2020;15(2): e0228989.
Lopez-Hernandez Y, et al. Targeted metabolomics identifies high performing diagnostic and prognostic biomarkers for COVID-19. Sci Rep. 2021;11(1):14732.
Kelly RS, et al. Partial least squares discriminant analysis and Bayesian networks for metabolomic prediction of childhood asthma. Metabolites. 2018;8(4):68.
Worley B, Powers R. PCA as a practical indicator of OPLS-DA model reliability. Curr Metabolomics. 2016;4(2):97–103.
Brereton RG, Llyod GR. Partial least squares discriminant analysis: taking the magic away. J Chemom. 2014;28(4):213–25.
Tonoyan NM, et al. Alterations in lipid profile upon uterine fibroids and its recurrence. Sci Rep. 2021;11(1):11447.
Minale G, et al. Characterization of metabolites in plasma, urine and feces of healthy participants after taking brahmi essence for twelve weeks using LC-ESI-QTOF-MS metabolomic approach. Molecules. 2021;26(10):2944.
Liu H, et al. UHPLC-Q-Orbitrap-HRMS-based global metabolomics reveal metabolome modifications in plasma of young women after cranberry juice consumption. J Nutr Biochem. 2017;45:67–76.
Pang Z, et al. Serum metabolomics analysis of asthma in different inflammatory phenotypes: a cross-sectional study in Northeast China. Biomed Res Int. 2018;2018:2860521.
Do E, et al. Metabolomic analysis of healthy human urine following administration of glimepiride using a liquid chromatography-tandem mass spectrometry. Transl Clin Pharmacol. 2017;25:67.
Gromski PS, et al. Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites. 2014;4(2):433–52.
Broadhurst DI, Kell DB. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics. 2006;2(4):171–96.
Steuer AE, Brockbals L, Kraemer T. Metabolomic strategies in biomarker research-new approach for indirect identification of drug consumption and sample manipulation in clinical and forensic toxicology? Front Chem. 2019;7:319.
Wiklund S, et al. Visualization of GC/TOF-MS-based metabolomics data for identification of biochemically interesting compounds using OPLS class models. Anal Chem. 2008;80(1):115–22.
Kim K, et al. Urine metabolomics analysis for kidney cancer detection and biomarker discovery. Mol Cell Proteomics. 2009;8(3):558–70.
Chen T, Cao Y, Zhang Y, Liu J, Bao Y, Wang C, Jia W, Zhao A. Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection. Evid-Based Complementary Altern Med. 2013. https://doi.org/10.1155/2013/298183.
Liland KH. Multivariate methods in metabolomics: from pre-processing to dimension reduction and statistical analysis. TrAC Trends Anal Chem. 2011;30(6):827–41.
Hsu CW, Lin CJ. A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw. 2002;13(2):415–25.
Amaratunga D, Cabrera J, Lee YS. Enriched random forests. Bioinformatics. 2008;24(18):2010–4.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Gromski PS, et al. A tutorial review: metabolomics and partial least squares-discriminant analysis: a marriage of convenience or a shotgun wedding. Anal Chim Acta. 2015;879:10–23.
Riekeberg E, Powers R. New frontiers in metabolomics: from measurement to insight. F1000Res. 2017;6:1148.
Gomari DP, et al. Variational autoencoders learn transferrable representations of metabolomics data. Commun Biol. 2022;5(1):645.
Thiel M, Féraud B, Govaerts B. ASCA+ and APCA+: extensions of ASCA and APCA in the analysis of unbalanced multifactorial designs. J Chemom. 2017;31(6): e2895.
Tian X, et al. Towards enhanced metabolomic data analysis of mass spectrometry image: multivariate curve resolution and machine learning. Anal Chim Acta. 2018;1037:211–9.
Olsson M, et al. Metabolomics analysis for diagnosis and biomarker discovery of transthyretin amyloidosis. Amyloid. 2021;28(4):234–42.
Efimenko M, Ignatev A, Koshechkin K. Review of medical image recognition technologies to detect melanomas using neural networks. BMC Bioinform. 2020;21(11):270.
Perng W, et al. Metabolomic profiles and development of metabolic risk during the pubertal transition: a prospective study in the ELEMENT Project. Pediatr Res. 2019;85(3):262–8.
Vasquez MM, et al. Least absolute shrinkage and selection operator type methods for the identification of serum biomarkers of overweight and obesity: simulation and application. BMC Med Res Methodol. 2016;16(1):154.
Worley B, Powers R. Multivariate analysis in metabolomics. Curr Metabol. 2013;1(1):92–107.
Zheng X, et al. Metabolic signature of pregnant women with neural tube defects in offspring. J Proteome Res. 2011;10(10):4845–54.
Cai H-L, et al. Metabolomic analysis of biochemical changes in the plasma and urine of first-episode neuroleptic-naïve schizophrenia patients after treatment with risperidone. J Proteome Res. 2012;11(8):4338–50.
Bevilacqua M, Bro R. Can we trust score plots? Metabolites. 2020;10(7):278.
Triba MN, et al. PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Mol BioSyst. 2015;11(1):13–9.
Szymanska E, et al. Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics. 2012;8(Suppl 1):3–16.
Eriksson L, Trygg J, Wold S. CV-ANOVA for significance testing of PLS and OPLS® models. J Chemom. 2008;22(11–12):594–600.
Pepe MS, et al. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol. 2004;159(9):882–90.
Zhang Y et al. SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data. bioRxiv, 2021: p. 2021.07.11.451967.
Shultz EK. Multivariate receiver-operating characteristic curve analysis: prostate cancer screening as an example. Clin Chem. 1995;41(8 Pt 2):1248–55.
Rahman MA, et al. LC-HRMS based non-targeted metabolomic profiling of wheat (Triticum aestivum L.) under post-anthesis drought stress. Am J Plant Sci. 2017;08:3024–61.
Tyagi R, et al. Urine metabolomics based prediction model approach for radiation exposure. Sci Rep. 2020;10(1):16063.
Leave-one-out cross-validation, In: C. Sammut and G.I. Webb (Eds.) Encyclopedia of Machine Learning, 2010, Springer US: Boston, MA. p. 600–601.
Xu Q-S, Liang Y-Z, Du Y-P. Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. J Chemom. 2004;18(2):112–20.
Chen T, et al. Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection. Evid Based Complement Alternat Med. 2013;2013: 298183.
Sun Y, et al. Metabolomics signatures in type 2 diabetes: a systematic review and integrative analysis. J Clin Endocrinol Metab. 2020;105(4):1000.
Schmidt JC, et al. Metabolomics as a truly translational tool for precision medicine. Int J Toxicol. 2021;40(5):413–26.
Yao M, et al. Identification of biomarkers for preeclampsia based on metabolomics. Clin Epidemiol. 2022;14:337–60.
Lai W, Du D, Chen L. Metabolomics provides novel insights into epilepsy diagnosis and treatment: a review. Neurochem Res. 2022;47(4):844–59.
Luo J, et al. Human plasma metabolomics identify 9-cis-retinoic acid and dehydrophytosphingosine levels as novel biomarkers for early ventricular fibrillation after ST-elevated myocardial infarction. Bioengineered. 2022;13(2):3334–50.
Pluskal T, et al. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinform. 2010;11:395.
Mak TD, et al. MetaboLyzer: a novel statistical workflow for analyzing postprocessed LC–MS metabolomics data. Anal Chem. 2014;86(1):506–13.
Peters K, et al. PhenoMeNal: processing and analysis of metabolomics data in the cloud. GigaScience. 2018;8(2):giy149.
Kirpich AS, et al. SECIMTools: a suite of metabolomics data analysis tools. BMC Bioinform. 2018;19(1):151.
R Core Team (R Foundation for Statistical Computing, A., R: A Language and Environment for Statistical Computing. 2013.
Goldansaz SA, et al. Livestock metabolomics and the livestock metabolome: a systematic review. PLoS ONE. 2017;12(5): e0177675.
Trivedi DK, Hollywood KA, Goodacre R. Metabolomics for the masses: the future of metabolomics in a personalized world. New Horiz Transl Med. 2017;3(6):294–305.
Broadhurst D, et al. Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. Metabolomics. 2018;14(6):72.
Nezhad MZ et al. SUBIC: A supervised bi-clustering approach for precision medicine. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). 2017.
Yi H, et al. COBRAC: a fast implementation of convex biclustering with compression. Bioinformatics. 2021;37(20):3667–9.
Tolstikov V, et al. Current status of metabolomic biomarker discovery: impact of study design and demographic characteristics. Metabolites. 2020;10(6):224.
Munafo MR, et al. A manifesto for reproducible science. Nat Hum Behav. 2017;1:0021.
Poste G. Bring on the biomarkers. Nature. 2011;469(7329):156–7.
Kohler I, et al. Integrating clinical metabolomics-based biomarker discovery and clinical pharmacology to enable precision medicine. Eur J Pharm Sci. 2017;109:S15–21.
Pedersen HK, et al. A computational framework to integrate high-throughput “-omics” datasets for the identification of potential mechanistic links. Nat Protoc. 2018;13(12):2781–800.
Authors would like to thanks Qatar National Research Fund (QNRF) for funding this project and open access funding is provided by the Qatar National Library.
Open Access funding provided by the Qatar National Library. This research was funded by the Qatar National Research Fund (QNRF), grant number NPRP13S-1230-190008.
Ethics approval and consent to participate
Consent to publish
Authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Anwardeen, N.R., Diboun, I., Mokrab, Y. et al. Statistical methods and resources for biomarker discovery using metabolomics. BMC Bioinformatics 24, 250 (2023). https://doi.org/10.1186/s12859-023-05383-0