Skip to main content

MetHoS: a platform for large-scale processing, storage and analysis of metabolomics data



Modern mass spectrometry has revolutionized the detection and analysis of metabolites but likewise, let the data skyrocket with repositories for metabolomics data filling up with thousands of datasets. While there are many software tools for the analysis of individual experiments with a few to dozens of chromatograms, we see a demand for a contemporary software solution capable of processing and analyzing hundreds or even thousands of experiments in an integrative manner with standardized workflows.


Here, we introduce MetHoS as an automated web-based software platform for the processing, storage and analysis of great amounts of mass spectrometry-based metabolomics data sets originating from different metabolomics studies. MetHoS is based on Big Data frameworks to enable parallel processing, distributed storage and distributed analysis of even larger data sets across clusters of computers in a highly scalable manner. It has been designed to allow the processing and analysis of any amount of experiments and samples in an integrative manner. In order to demonstrate the capabilities of MetHoS, thousands of experiments were downloaded from the MetaboLights database and used to perform a large-scale processing, storage and statistical analysis in a proof-of-concept study.


MetHoS is suitable for large-scale processing, storage and analysis of metabolomics data aiming at untargeted metabolomic analyses. It is freely available at: Users interested in analyzing their own data are encouraged to apply for an account.


Metabolomics is a unique part of modern life science and molecular biology due to its multidisciplinary requirements: knowledge from biology, chemistry, physics as well as mathematics and statistics needs to be integrated. It deals with the quantification and identification of small molecules called metabolites (< 1500 Da) which are the intermediate and ending products in cellular processes of an organism [1]. The key technology to investigate the metabolites that are abundant in an organism or a tissue is mass spectrometry, where state-of-the-art methods allow gaining tens of thousands of mass spectra within a few minutes which in turn characterize hundreds of potential compounds. The entirety of these metabolites in a cell at a specific moment can be considered as a high-dimensional molecular snapshot of the organism which carries an imprint of all genetic, epigenetic and environmental factors. Thus, one of the main goals of metabolomics research is to bridge the gap between the genotype and phenotype in order to get a complete picture of the internal structure and behavior of a cell [2]. Metabolomics has a wide application in many different fields such as toxicology assessment, nutritional genomics, biomarker discovery and identification, drug development and disease prognosis [3].

High-throughput metabolomics experiments, generally, follow an untargeted approach which is characterized by the simultaneous measurement of a large number of metabolites from each sample, thus analyzing the global metabolomic profile [4]. In this, raw datasets obtained from compound separation and detection techniques, such as Gas Chromatography (GC) or Liquid Chromatography (LC) coupled to Mass spectrometry (MS), are transformed to quantitative metabolite information [5,6,7]. The general processing strategy includes noise filtering and baseline reduction [8] followed by peak detection [9] and deconvolution [10], chromatographic alignment [11], identification of metabolomic features [12], substitution of missing values [13], normalization [14] and statistical analysis [15, 16].

To support researchers in the complicated and complex analytical workflow, a large number of software tools have been developed. Some of the available tools are focused either on the quantification or identification of metabolites. For instance, iMet-Q [17] and apLCMS [18] deal with the quantification step while Metabolyzer [19] with the identification of metabolites. Similarly, MetaBox [20] puts the focus only on the statistical analysis that follows the processing step. Other software packages include both the quantification and identification step like XCMS [21], MetAlign 3.0 [22], MZmine 2 [23], MAVEN [24], mzMatch [25] or MS-Dial [26]. However, they either lack statistical capabilities or are not web-based, which confines them at the usage of computational resources. There are also web-based tools like MeltDB [27], XCMS Online [28] and MetaboAnalyst [29] that not only offer support in data storage and retrieval but also analytical tools for quantification, identification and statistical analysis. However, none of these tools is able to deal with large collections of data sets, referred to as large-scale metabolomics. In the context of this work, we categorize data sets ranging from more than a couple of hundreds to thousands of files to be a large-scale metabolomics data set. For instance, MeltDB is a semi-automated system not prepared for big amounts of data and XCMS Online places the focus on processing with restrictions in the storage and limited statistics, while the recently released MetaboAnalyst 5.0 at least supports up to 200 files.

With repositories giving access to hundreds or even thousands of files from different experiments we see a clear demand for an easy-to-use software solution capable of handling large amounts of metabolomics data in short processing and analysis times. In recent years, first big data frameworks and libraries have been developed which have proven to be capable of dealing with such amounts of data. PhenoMeNal [30] is a framework that can handle large volumes of data as it utilizes a cluster of computers. However, it follows the Infrastructure-as-a-Service cloud model and requires prior knowledge of cloud computing concepts, which most metabolomics researchers may not be familiar with. In a similar manner, workflow4metabolomics [31] is a repository for Galaxy-based workflows that covers many aspects of metabolomics data processing and analysis. Yet again this requires set up of the technical infrastructure and does not include a scalable storage solution. In this study, we present MetHoS, a ready-to-use web-based platform for large-scale processing, storage and analysis of metabolomics data sets. MetHoS is based on big data frameworks and provides users efficient and user-friendly handling of their own experimental metabolomics data. With Apache Spark [32], Apache Cassandra [33] and KNIME (Konstanz Information Miner) [34] as our fundament we propose a different way of handling large-scale datasets with the prospect of handling even largest amounts of data.


Application architecture

Cloud computing allows for the parallel execution of tasks on a large number of virtual machines. Moreover, it allows for scalability: if the size of the problem increases, more machines can easily be added. MetHoS supports horizontal scaling, thus ensuring performance by not being limited to the capacity of a single unit and redundancy with no single point of failure. It is written in Java and utilizes a set of software tools that enable parallel processing, distributed storage and distributed analysis on the cloud (Fig. 1). MetHoS uses a variable number of computer nodes and has been designed for OpenStack, the popular open-source software platform for cloud computing.

Fig. 1
figure 1

This figure shows the data-flow during the processing and analysis steps of the combination of Apache Spark, Apache Cassandra and KNIME

The general processing of uploaded metabolomics data is relying on the software KNIME, which is a well-known open-source data workflow engine and analytics platform. Apache Cassandra, a NoSQL database for distributed storage, was selected to store the results of the processing across the cluster. We integrated Apache Spark, a fast in-memory cluster-computing engine, that can perform distributed processing and analysis of big amounts of data. Finally, we combined all three software tools together with the Spring Application framework providing users a friendly and easy-to-understand web-based application.

Project and data management

MetHoS provides sophisticated user and project management, in which uploaded samples can be grouped to create experiments in already pre-defined projects. The owner of a project has the ability to edit it and also manage access rights. MetHoS makes use of OpenStack’s object storage through the swift API, where the experiments are stored in a shared storage space (Fig. 1).

An experiment in MetHoS refers to a biological experiment that consists of many biological replicates. After uploading, experiments can be processed (quantification and identification step) by selecting one of the automated KNIME workflows. In result, each workflow is also responsible for storing the results straight into the Cassandra database. Apache Spark verifies fair job-scheduling and workload distribution, e.g. in terms of KNIME workflow processing, giving thus the ability to process thousands of experiments in a matter of hours.

Results of the processing are presented in the View section. The representation can be either in a form of a table by selecting an experiment and observing all the identified and unidentified metabolomic features or with boxplots by selecting more experiments and observing the metabolomic features among them. Various statistical tests enable further investigation and analysis of the processed metabolomics data within an experiment but, in particular, across a multitude of experiments. The results are presented and visualized with the help of the D3 Javascript library.

Parallel processing

The architecture and design of MetHoS are based on flexibility allowing for an easy integration of workflows implemented in KNIME Analytics Platform. KNIME has integrated OpenMS [35], an open-source software C++ library for management and analyses of metabolomics data, allowing for quantification and identification of metabolites, against spectral databases, in each sample or group of samples.

Once the raw chromatographic data are uploaded to the Openstack object storage, they can be selected for processing (Fig. 1, step 1) with a pre-defined workflow. The Spark master, which is a single coordinator that acquires cluster nodes (Spark Workers), receives the request and distributes the tasks to each node (Fig. 1, steps 2 and 3). Then, each Spark Worker activates a Spark Executor, an agent responsible for carrying out the task and activating KNIME on every node in order to process the selected experiments (Fig. 1, steps 4 and 5).

MetHos supports the well-known and accepted open-source file formats .mzML, .mzData and .mzXML as inputs and produces information about identified and unidentified metabolites which is automatically being stored in the Cassandra database (Fig. 1, step 6). The workflows currently provided in MetHoS target MS1 and MS2 level data and support identification by exact mass search as well as spectral matching. Figure 2 gives an example of two of these workflows which cover all steps required to analyze metabolomics data: conversion of mass spectra in an appropriate format, the quantification and normalization (Additional file 4: Table S1), the control of ionization mode, corrections of retention time distortions (Additional file 5: Table S2 and Additional file 6: Table S3), the identification of metabolites and the storage of metabolite measurements into the Cassandra database. For the quantification of metabolites both workflows rely on the software tool FeatureFinderMetabo [36] of OpenMS. Concerning the identification of metabolites the first workflow uses the AccurateMassSearch algorithm, which identifies metabolite features by comparing their exact mass to databases like HMDB [37], MassBank [38] and MoNA:Fiehn [39], while the second workflow uses the MetaboliteSpectralMatcher [40], which identifies small molecules from tandem MS spectra using a spectral library such as MassBank.

Fig. 2
figure 2

This figure shows the steps of two of the workflows currently implemented in MetHoS. a Identification by exact mass, b Identification by spectral matching

Distributed storage

The main advantage of a NoSQL database over an SQL database is horizontal scalability and distributed storage, giving the possibility to store any amount of data, just limited by the number of utilized storage nodes. Apache Cassandra is such a NoSQL distributed database. Its masterless architecture appoints it unique among other NoSQL databases as it ensures high availability of data at all times. In combination with adjusting the replication factor to three in our Cassandra model, we are able to provide a no single point of failure model.

Cassandra-specific, our data model (Fig. 3) was modeled around queries, with the experiment column family being the most important where the processing results of millions of metabolomic features are stored with the KNIME workflows. Furthermore, with an appropriate partitioning within Cassandra, it is made sure that data is equally distributed and stored efficiently.

Fig. 3
figure 3

This figure shows the conceptual data model of the Cassandra database

Distributed statistical analysis

Apache Spark is a fast, distributed in-memory large-scale data processing engine providing powerful Machine Learning algorithms, with their own library, which perform statistical tests in a distributed manner. MetHoS uses the Apache Spark Machine Learning library (Spark MLlib) in combination with the recently developed data structure called Dataset, which provides the convenience of an RDD, less memory consumption and automatic optimization. For every statistical test of our application, Apache Spark creates jobs which are comprised from tasks that are distributed from the Spark master and executed from the Spark executors on data partitions. Each executor of a node has been assigned a number of cores and the more cores can be used the more tasks can be performed in parallel.

The statistical methods for analysis available in MetHoS are:

  • Basic statistics (mean and standard deviation)

  • Metabolite filtering (mean and standard deviation between two defined groups of experiments)

  • Pearson correlation

  • Spearman correlation

  • Principal component analysis (PCA)

  • Clustering (K-means, Bisecting k-means)

In our analyses, we provide a set of choices depending on the desired depth of analysis. Specific metabolites can be selected to set the focus of the analysis. If required, handling of missing values can be incorporated. In the default settings, analyses are conducted on any common metabolite that exists in every selected experiment. Respectively, for the replicate level, the list of the metabolites that exist at least once in every replicate of the selected experiments will be selected for analysis. Missing values strategies include replacement with zero, mean or median. Results of an analysis, e.g. basic statistics such as mean and standard deviation values of all or common metabolites in experiments can be exported in .csv format for further analysis and observation.


The capabilities of MetHoS are presented with more detail in the following evaluation and a use case in which we performed a large-scale processing, storage and statistical analysis in thousands of experiments downloaded from the MetaboLights database belonging to 38 different studies [36, 41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73].

Evaluation of scalability

Processing thousands of experiments in a linear manner with the traditional methods would take several days or weeks depending on the complexity and size of the files. In order to prove the efficiency of parallel processing and the horizontal scalability of MetHoS, we processed 200 experiments originating from a study of MetaboLights repository, under the study identifier MTBLS28, with a variable number of Spark workers (Fig. 4). The results indicated that the more worker nodes are added in the cluster, the faster the processing is completed.

Fig. 4
figure 4

This figure shows the scalability of MetHoS with 1, 2, 4, 8 and 16 Spark workers compare to the time it takes to process 200 experiments. The processing was performed 3 times on the same 200 experiments for every number of workers

Use case data

In our comprehensive use case, we extended to all the samples of studies from 2012 to 2020 which were downloaded and grouped in experiments, after excluding problematic and corrupted files. They originate from mass spectrometry experiments of the human organism which contain .mzData, .mzML and .mzXML files (Additional file 7: Table S4). Uploading the experiments in MetHoS resulted in 4827 experiments occupying approximately 1.1 Tb of disk space in the Openstack object storage space.


The processing with MetHoS lasted approximately 12 hours and ended up quantifying and identifying more than 2 billion metabolite features. The results were automatically stored in MetHoS (Fig. 5a).

Fig. 5
figure 5

a Web interface of a project. b K-means clustering on all 4827 experiments (re-scaled). c PCA analysis of 144 experiments originating from whole blood, blood plasma and erythrocyte samples and 57 experiments originating from urine samples. d Pearson Correlation of 90 experiments on 112 compounds

Analysis results

For our analysis, we performed a k-means clustering in all 4.827 experiments using 15 clusters, with all metabolites, on experiment level and replacing missing values with the mean (Fig. 5b, Additional file 1: Fig. S1). Results show that two big clusters were formed, clusters 1 and 14 (Additional file 8: Table S5, Additional file 9: Table S6). Although there are similarities between several urine samples and blood or blood plasma or blood serum samples, the majority of samples in cluster 1 belong to blood samples and for cluster 14 to urine samples. Clusters 3, 6, 7, 8, 11, 12, 13 and 15 consist exclusively of urine samples, while clusters 2 and 9 only of solvent samples. Moreover, cluster 4 consists only of blood serum samples and cluster 10 only of blood plasma samples. Samples originating from lung, feces, renal tubule or cerebrospinal fluid have been split almost equally in clusters 1 and 14 indicating the formation of two distinct groups possibly originating from two different conditions in each group of experiments. Furthermore, all samples originating from THP-1-cell and Breath have been clustered together in cluster 1 and all samples from MCF-10A-cell and umbilical vein endothelial cell line in cluster 14. Finally, cluster 5 suggests similarities between some solvent and urine samples.

We selected 201 experiments (Additional file 7: Table S4) of blood and urine and performed a Principal Component Analysis on experiment level, selecting all metabolites, replacing missing values with zero and using the z-score normalization. The analysis ended up differentiating the urine samples from the blood samples successfully in an interactive PCA plot (Fig. 5c). It is shown that the intensities of 40.101 metabolites from 201 experiments took part in the calculation of the PCA, while metabolites that were not present in all the experiments are depicted in the table next to the PCA plot.

Afterwards, 90 experiments (Additional file 7: Table S4) were selected, containing blood samples from 30 individuals, 15 young and 15 elderly (30 whole blood, 30 erythrocyte and 30 blood plasma samples). Pearson Correlation was implemented on 112 compounds (Additional file 10: Table S7) that according to literature [47], show age-related increases or decreases while replacing the missing values with zero (Fig. 5d).

The results indicated that Fructose 6-phosphate and Glucose 6-phospate are highly positively correlated. The same stands for 2-Phosphoglyceric acid and 2,3-Diphosphoglyceric acid which have a very strong correlation. Closely correlated are also Nicotinamide Adenine Dinucleotide (NAD) and Nicotinamide Adenine Dinucleotide Phosphate (NADP). Last but not least, there were missing values of NADP and Uridine Triphosphate which indicates that they were not present in all 90 experiments.

Thereafter, we selected 45 of the 90 aforementioned experiments that contain samples originating from the 15 young individuals and performed a k-means clustering with 10 clusters, on metabolite level and replacing missing values with zero. The same was implemented for the rest 45 experiments of the 15 elder individuals (Additional file 2: Fig. S2, Additional file 3: Fig. S3).

The results suggest that metabolites like Leucine and Isoleucine, which may play a distinct role in supporting skeletal muscle activity, are clustered together in both cases (Additional file 11: Table S8, Additional file 12: Table S9). Ergothioneine is clustered alone in both cases showing more fluctuations in the elder individuals. In both young and elder clusters, Adenosine Diphosphate (ADP) and NAD are clustered together while Adenosine Triphosphate (ATP) is clustered separately in both cases. L-Acetylcarnitine is clustered separately for the young people while for the elder people it is clustered together with metabolites that are involved in the glucose metabolism (2-phosphoglyceric acid, 3-phosphoglyceric acid, Guanosine Triphosphate (GTP), NADP and Uridine Diphosphate Glucose) and shows a decrease of its abundance in the elder people.

Cluster setup

MetHoS is using Apache Spark 3.0.1 in standalone mode and currenctly uses 16 worker nodes and 1 master node. Each Spark worker has one executor with two cores making it possible to parallelize two tasks per worker or 32 tasks in total. We provide 23 Gb of RAM to every executor and 115 Gb to the master node. Spark is able to access the Cassandra database through the Spark-Cassandra-Connector 3.0.0 while it is also authorized to access Openstack Object Storage for downloading the experiments to be processed every time.

In the same 16 computer nodes, we have also installed KNIME providing it with 2 Gb of RAM. Spark downloads an experiment locally on the node and activates KNIME so that it can be processed on the same node. Consequently, a number of 32 experiments can be processed in parallel on the cluster. For our Cassandra setup, we are using the same 16 nodes used as Spark workers.


Here we introduced MetHoS as a flexible and easy-to-use web-based platform, based on big data frameworks, that provides automated processing, distributed storage and distributed analysis in short processing and analysis times (Additional file 13). Our aim was to provide users a bioinformatics platform for the efficient and user-friendly handling of experimental data originating from different metabolomics studies allowing in that way the integration of metabolomics data. MetHoS is built on Apache Spark to enable metabolomics data processing and analysis using KNIME and SparkML but also to constitute the basis for future analysis functionalities. We evaluated the scalability of the platform using a variable number of Spark workers and demonstrated its capabilities by handling 1.1 Tb of data which were processed in only 12 h ending up to more than 2 billion metabolite features. MetHoS allows for automated processing of large numbers of chromatographic datasets in terms of untargeted metabolite profiling, quantification and de novo identification and by that reaches a time-efficient and target-oriented integration and interpretation of metabolomics data.

Availability of data and materials

The datasets used and analysed during the current study are publicly available in the MetaboLights repository at and their study identifiers can be found in the Additional file 7: Table S4. Source code is available at Project name: MetHoS; Project home page:; Operating system(s): Platform independent; Programming language: Java, Javascript, HTML, CQL; Other requirements: Apache Spark, Apache Cassandra, Knime, Spring Framework, OpenStack; License: GPLv3; Any restrictions to use by non-academics: None.



Mass spectrometry


Gas chromatography


Liquid chromatography


Konstanz Information Miner


Machine Learning library


Principal component analysis


Nicotinamide Adenine Dinucleotide


Nicotinamide Adenine Dinucleotide Phosphate


Adenosine Diphosphate


Adenosine Triphosphate


Guanosine Triphosphate


  1. Bino RJ, Hall RD, Fiehn O, Kopka J, Saito K, Draper J, Nikolau BJ, Mendes P, Roessner-Tunali U, Beale MH, Trethewey RN, Lange BM, Wurtele ES, Sumner LW. Potential of metabolomics as a functional genomics tool. Trends Plant Sci. 2004;9(9):418–25.

    CAS  Article  PubMed  Google Scholar 

  2. Fiehn O. Metabolomics—the link between genotypes and phenotypes. Plant Mol Biol. 2002;48(1):155–71.

    CAS  Article  PubMed  Google Scholar 

  3. Trivedi DK, Hollywood KA, Goodacre R. Metabolomics for the masses: the future of metabolomics in a personalized world. New Horiz Transl Med. 2017;3(4):294–305.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Alonso A, Marsal S, Julià A. Analytical methods in untargeted metabolomics: state of the art in 2015. Front Bioeng Biotechnol. 2015;3:23.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Villas-Bôas SG, Højer-Pedersen J, Akesson M, Smedsgaard J, Nielsen J. Global metabolite analysis of yeast: evaluation of sample preparation methods. Yeast. 2005;22(14):1155–69.

    CAS  Article  PubMed  Google Scholar 

  6. Dettmer K, Aronov PA, Hammock BD. Mass spectrometry-based metabolomics. Mass Spectrom Rev. 2007;26(1):51–78.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. El-Aneed A, Cohen A, Banoub J. Mass spectrometry, review of the basics: electrospray, MALDI, and commonly used mass analyzers. Appl Spectrosc Rev. 2009;44(3):210–30.

    CAS  Article  Google Scholar 

  8. Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR, Norton S, Kumar P, Anderle M, Becker CH. Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal Chem. 2003;75(18):4818–26.

    CAS  Article  PubMed  Google Scholar 

  9. Tautenhahn R, Böttcher C, Neumann S. Highly sensitive feature detection for high resolution LC/MS. BMC Bioinform. 2008;9(1):504.

    CAS  Article  Google Scholar 

  10. Du X, Zeisel SH. Spectral deconvolution for gas chromatography mass spectrometry-based metabolomics: current status and future perspectives. Comput Struct Biotechnol J. 2013;4(5):201301013.

    Article  Google Scholar 

  11. Zhou B, Xiao JF, Tuli L, Ressom HW. LC–MS-based metabolomics. Mol BioSyst. 2012;8:470–81.

    CAS  Article  PubMed  Google Scholar 

  12. Smith CA, O’Maille G, Want EJ, Qin C, Trauger SA, Brandon TR, Custodio DE, Abagyan R, Siuzdak G. METLIN: a metabolite mass spectral database. Ther Drug Monit. 2005;27(6):747–51.

    CAS  Article  PubMed  Google Scholar 

  13. Steinfath M, Groth D, Lisec J, Selbig J. Metabolite profile analysis: from raw data to regression and classification. Physiol Plant. 2008;132(2):150–61.

    CAS  Article  PubMed  Google Scholar 

  14. Xi B, Gu H, Baniasadi H, Raftery D. Statistical analysis and modeling of mass spectrometry-based metabolomics data. Methods Mol Biol (Clifton, NJ). 2014;1198:333–53.

    CAS  Article  Google Scholar 

  15. Vinaixa M, Samino S, Saez I, Duran J, Guinovart JJ, Yanes O. A guideline to univariate statistical analysis for LC/MS-based untargeted metabolomics-derived data. Metabolites. 2012;2(4):775–95.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  16. Worley B, Powers R. Multivariate analysis in metabolomics. Metabolomics. 2013;1(1):92–107.

    CAS  Google Scholar 

  17. Chang H-Y, Chen C-T, Lih TM, Lynn K-S, Juo C-G, Hsu W-L, Sung T-Y. iMet-Q: a user-friendly tool for label-free metabolomics quantitation using dynamic peak-width determination. PLoS ONE. 2016;11(1):1–18.

    CAS  Article  Google Scholar 

  18. Yu T, Park Y, Johnson JM, Jones DP. apLCMS—adaptive processing of high-resolution LC/MS data. Bioinformatics (Oxford, England). 2009;25(15):1930–6.

    CAS  Article  Google Scholar 

  19. Mak TD, Laiakis EC, Goudarzi M, Fornace AJ. Metabolyzer: a novel statistical workflow for analyzing postprocessed LC–MS metabolomics data. Anal Chem. 2014;86(1):506–13.

    CAS  Article  PubMed  Google Scholar 

  20. Wanichthanarak K, Fan S, Grapov D, Barupal DK, Fiehn O. Metabox: a toolbox for metabolomic data analysis, interpretation and integrative exploration. PLoS ONE. 2017;12(1):1–14.

    CAS  Article  Google Scholar 

  21. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem. 2006;78(3):779–87.

    CAS  Article  PubMed  Google Scholar 

  22. Lommen A, Kools HJ. Metalign 3.0: performance enhancement by efficient use of advances in computer hardware. Metabolomics Off J Metabolomic Soc. 2012;8(4):719–26.

    CAS  Article  Google Scholar 

  23. Pluskal T, Castillo S, Villar-Briones A, Orešič M. Mzmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinform. 2010;11(1):395.

    CAS  Article  Google Scholar 

  24. Clasquin MF, Melamud E, Rabinowitz JD. LC–MS data processing with maven: a metabolomic analysis and visualization engine. Curr Protoc Bioinform. 2012;Chapter 14:14–11.

    Article  Google Scholar 

  25. Scheltema RA, Jankevics A, Jansen RC, Swertz MA, Breitling R. PeakML/mzMatch: a file format, java library, r library, and tool-chain for mass spectrometry data analysis. Anal Chem. 2011;83(7):2786–93.

    CAS  Article  PubMed  Google Scholar 

  26. Tsugawa H, Cajka T, Kind T, Ma Y, Higgins B, Ikeda K, Kanazawa M, VanderGheynst J, Fiehn O, Arita M. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat Methods. 2015;12(6):523–6.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  27. Kessler N, Neuweger H, Bonte A, Langenkämper G, Niehaus K, Nattkemper TW, Goesmann A. MeltDB 2.0—advances of the metabolomics software system. Bioinformatics. 2013;29(19):2452–9.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  28. Tautenhahn R, Patti GJ, Rinehart D, Siuzdak G. XCMS online: a web-based platform to process untargeted metabolomic data. Anal Chem. 2012;84(11):5035–9.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  29. Pang Z, Chong J, Zhou G, de Lima Morais DA, Chang L, Barrette M, Gauthier C, Jacques P-E, Li S, Xia J. MetaboAnalyst 5.0: narrowing the gap between raw spectra and functional insights. Nucleic Acids Res. 2021.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Peters K, Bradbury J, Bergmann S, Capuccini M, Cascante M, de Atauri P, Ebbels TMD, Foguet C, Glen R, Gonzalez-Beltran A, Günther UL, Handakas E, Hankemeier T, Haug K, Herman S, Holub P, Izzo M, Jacob D, Johnson D, Jourdan F, Kale N, Karaman I, Khalili B, Emami Khonsari P, Kultima K, Lampa S, Larsson A, Ludwig C, Moreno P, Neumann S, Novella JA, O’Donovan C, Pearce JTM, Peluso A, Piras ME, Pireddu L, Reed MAC, Rocca-Serra P, Roger P, Rosato A, Rueedi R, Ruttkies C, Sadawi N, Salek RM, Sansone S-A, Selivanov V, Spjuth O, Schober D, Thévenot EA, Tomasoni M, van Rijswijk M, van Vliet M, Viant MR, Weber RJM, Zanetti G, Steinbeck C. Phenomenal: processing and analysis of metabolomics data in the cloud. GigaScience. 2018;8(2):giy149.

    Article  PubMed Central  Google Scholar 

  31. Giacomoni F, Le Corguillé G, Monsoor M, Landi M, Pericard P, Pétéra M, Duperier C, Tremblay-Franco M, Martin J-F, Jacob D, Goulitquer S, Thévenot EA, Caron C. Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics. Bioinformatics. 2014;31(9):1493–5.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  32. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M. A view of cloud computing. Commun ACM. 2010;53(4):50–8.

    Article  Google Scholar 

  33. Lakshman A, Malik P. Cassandra: a decentralized structured storage system. SIGOPS Oper Syst Rev. 2010;44(2):35–40.

    Article  Google Scholar 

  34. Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B. KNIME: the konstanz information miner. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R, editors. Data analysis, machine learning and applications. Berlin: Springer; 2008. p. 319–26.

    Chapter  Google Scholar 

  35. Röst HL, Sachsenberg T, Aiche S, Bielow C, Weisser H, Aicheler F, Andreotti S, Ehrlich H-C, Gutenbrunner P, Kenar E, Liang X, Nahnsen S, Nilse L, Pfeuffer J, Rosenberger G, Rurik M, Schmitt U, Veit J, Walzer M, Wojnar D, Wolski WE, Schilling O, Choudhary JS, Malmström L, Aebersold R, Reinert K, Kohlbacher O. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Methods. 2016;13(9):741–8.

    CAS  Article  PubMed  Google Scholar 

  36. Kenar E, Franken H, Forcisi S, Wörmann K, Häring H-U, Lehmann R, Schmitt-Kopplin P, Zell A, Kohlbacher O. Automated label-free quantification of metabolites from liquid chromatography–mass spectrometry data. Mol Cell Proteomics MCP. 2014;13(1):348–59.

    CAS  Article  PubMed  Google Scholar 

  37. Wishart DS, Jewison T, Guo AC, Wilson M, Knox C, Liu Y, Djoumbou Y, Mandal R, Aziat F, Dong E, Bouatra S, Sinelnikov I, Arndt D, Xia J, Liu P, Yallou F, Bjorndahl T, Perez-Pineiro R, Eisner R, Allen F, Neveu V, Greiner R, Scalbert A. HMDB 3.0—the human metabolome database in 2013. Nucleic Acids Res. 2013;41(Database issue):801–7.

    CAS  Article  Google Scholar 

  38. Horai H, Arita M, Kanaya S, Nihei Y, Ikeda T, Suwa K, Ojima Y, Tanaka K, Tanaka S, Aoshima K, Oda Y, Kakazu Y, Kusano M, Tohge T, Matsuda F, Sawada Y, Hirai MY, Nakanishi H, Ikeda K, Akimoto N, Maoka T, Takahashi H, Ara T, Sakurai N, Suzuki H, Shibata D, Neumann S, Iida T, Tanaka K, Funatsu K, Matsuura F, Soga T, Taguchi R, Saito K, Nishioka T. MassBank: a public repository for sharing mass spectral data for life sciences. J Mass Spectrom. 2010;45(7):703–14.

    CAS  Article  PubMed  Google Scholar 

  39. Kind T, Wohlgemuth G, Lee DY, Lu Y, Palazoglu M, Shahbaz S, Fiehn O. FiehnLib: mass spectral and retention index libraries for metabolomics based on quadrupole and time-of-flight gas chromatography/mass spectrometry. Anal Chem. 2009;81(24):10038–48.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  40. Fenyö D, Beavis RC. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal Chem. 2003;75(4):768–74.

    CAS  Article  PubMed  Google Scholar 

  41. Haug K, Cochrane K, Nainala VC, Williams M, Chang J, Jayaseelan KV, O’Donovan C. MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res. 2019;48(D1):440–4.

    CAS  Article  Google Scholar 

  42. Capellades J, Navarro M, Samino S, Garcia-Ramirez M, Hernandez C, Simo R, Vinaixa M, Yanes O. geoRge: a computational tool to detect the presence of stable isotope labeling in LC/MS-based untargeted metabolomics. Anal Chem. 2016;88(1):621–8.

    CAS  Article  PubMed  Google Scholar 

  43. Cai X, Dong J, Liu J, Zheng H, Kaweeteerawat C, Wang F, Ji Z, Li R. Multi-hierarchical profiling the structure–activity relationships of engineered nanomaterials at nano-bio interfaces. Nat Commun. 2018;9(1):4416.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  44. Singh KD, Tancev G, Decrue F, Usemann J, Appenzeller R, Barreiro P, Jaumà G, Santiago MM, de Miguel GV, Frey U, Sinues P. Standardization procedures for real-time breath analysis by secondary electrospray ionization high-resolution mass spectrometry. Anal Bioanal Chem. 2019;411(19):4883–98.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  45. Qiong Z, Xiaofeng Y, Haifang W, Xing W, Xin L, Yao L, Xiaohe Z, Chen F, Haixia L, Yurong Q. Fecal metabolomics and potential biomarkers for systemic lupus erythematosus. Front Immunol. 2019;10:976.

    CAS  Article  Google Scholar 

  46. [dataset] Christina R, Rurik M, Kohlbacher O, Huber CG. Multi-omics toxicity profiling of engineered nanomaterials. metabolights_dataset, V1. 2006.

  47. Chaleckis R, Murakami I, Takada J, Kondoh H, Yanagida M. Individual variability in human blood metabolites identifies age-related differences. Proc Natl Acad Sci. 2016;113(16):4252–9.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  48. Ranninger C, Schmidt LE, Rurik M, Limonciel A, Jennings P, Kohlbacher O, Huber CG. Improving global feature detectabilities through scan range splitting for untargeted metabolomics by high-performance liquid chromatography–orbitrap mass spectrometry. Anal Chim Acta. 2016;930:13–22.

    CAS  Article  PubMed  Google Scholar 

  49. Ranninger C, Rurik M, Limonciel A, Jennings P, Kohlbacher O, Huber CG. Nephron toxicity profiling via untargeted metabolome analysis employing a high performance liquid chromatography–mass spectrometry-based experimental and computational pipeline. J Biol Chem. 2015;290(31):19121–32.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  50. Herman S, Khoonsari PE, Aftab O, Krishnan S, Strömbom E, Larsson R, Hammerling U, Spjuth O, Kultima K, Gustafsson M. Mass spectrometry based metabolomics for in vitro systems pharmacology: pitfalls, challenges, and computational solutions. Metabolomics. 2017;13(7):79.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  51. Vincent IM, Daly R, Courtioux B, Cattanach AM, Biéler S, Ndung’u JM, Bisser S, Barrett MP. Metabolomics identifies multiple candidate biomarkers to diagnose and stage human African trypanosomiasis. PLOS Negl Trop Dis. 2016;10(12):1–20.

    CAS  Article  Google Scholar 

  52. Venturini G, Malagrino PA, Padilha K, Tanaka LY, Laurindo FR, Dariolli R, Carvalho VM, Cardozo KHM, Krieger JE, Pereira ADC. Integrated proteomics and metabolomics analysis reveals differential lipid metabolism in human umbilical vein endothelial cells under high and low shear stress. Am J Physiol Cell Physiol. 2019;317(2):326–38.

    CAS  Article  Google Scholar 

  53. Murakami I, Chaleckis R, Pluskal T, Ito K, Hori K, Ebe M, Yanagida M, Kondoh H. Metabolism of skin-absorbed resveratrol into its glucuronized form in mouse skin. PLoS ONE. 2014;9(12):1–20.

    CAS  Article  Google Scholar 

  54. Herman S, Khoonsari PE, Tolf A, Steinmetz J, Zetterberg H, Åkerfeldt T, Jakobsson P-J, Larsson A, Spjuth O, Burman J, Kultima K. Integration of magnetic resonance imaging and protein and metabolite CSF measurements to enable early diagnosis of secondary progressive multiple sclerosis. Theranostics. 2018;8:4477–90.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  55. Herman S, Niemelä V, Khoonsari PE, Sundblom J, Burman J, Landtblom A-M, Spjuth O, Nyholm D, Kultima K. Alterations in the tyrosine and phenylalanine pathways revealed by biochemical profiling in cerebrospinal fluid of Huntington’s disease subjects. Sci Rep. 2019;9(1):4129.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  56. Ulaszewska MM, Trost K, Stanstrup J, Tuohy KM, Franceschi P, Chong MF-F, George T, Minihane AM, Lovegrove JA, Mattivi F. Urinary metabolomic profiling to identify biomarkers of a flavonoid-rich and flavonoid-poor fruits and vegetables diet in adults: the flavurs trial. Metabolomics. 2016;12(2):32.

    CAS  Article  Google Scholar 

  57. Mathé EA, Patterson AD, Haznadar M, Manna SK, Krausz KW, Bowman ED, Shields PG, Idle JR, Smith PB, Anami K, Kazandjian DG, Hatzakis E, Gonzalez FJ, Harris CC. Noninvasive urinary metabolomic profiling identifies diagnostic and prognostic markers in lung cancer. Cancer Res. 2014;74(12):3259–70.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  58. van der Hooft JJJ, Padmanabhan S, Burgess KEV, Barrett MP. Urinary antihypertensive drug metabolite screening using molecular networking coupled to high-resolution mass spectrometry fragmentation. Metabolomics. 2016;12:125.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  59. Thévenot EA, Roux A, Xu Y, Ezan E, Junot C. Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and OPLS statistical analyses. J Proteome Res. 2015;14(8):3322–35.

    CAS  Article  PubMed  Google Scholar 

  60. Trošt K, Ulaszewska MM, Stanstrup J, Albanese D, De Filippo C, Tuohy KM, Natella F, Scaccini C, Mattivi F. Host: microbiome co-metabolic processing of dietary polyphenols—an acute, single blinded, cross-over study with different doses of apple polyphenols in healthy subjects. Food Res Int. 2018;112:108–28.

    CAS  Article  PubMed  Google Scholar 

  61. Samino S, Vinaixa M, Díaz M, Beltran A, Rodríguez MA, Mallol R, Heras M, Cabre A, Garcia L, Canela N, de Zegher F, Correig X, Ibáñez L, Yanes O. Metabolomics reveals impaired maturation of HDL particles in adolescents with hyperinsulinaemic androgen excess. Sci Rep. 2015;5:11496.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  62. Chen G, Walmsley S, Cheung GCM, Chen L, Cheng C-Y, Beuerman RW, Wong TY, Zhou L, Choi H. Customized consensus spectral library building for untargeted quantitative metabolomics analysis with data independent acquisition mass spectrometry and metabodia workflow. Anal Chem. 2017;89(9):4897–906.

    CAS  Article  PubMed  Google Scholar 

  63. Leal-Witt MJ, Ramon-Krauel M, Samino S, Llobet M, Cuadras D, Jimenez-Chillaron JC, Yanes O, Lerin C. Untargeted metabolomics identifies a plasma sphingolipid-related signature associated with lifestyle intervention in prepubertal children with obesity. Int J Obesity. 2018;42(1):72–8.

    CAS  Article  Google Scholar 

  64. Rochat B, Mohamed R, Sottas P-E. LC-HRMS metabolomics for untargeted diagnostic screening in clinical laboratories: a feasibility study. Metabolites. 2018;8(2):39.

    CAS  Article  PubMed Central  Google Scholar 

  65. Teruya T, Chaleckis R, Takada J, Yanagida M, Kondoh H. Diverse metabolic reactions activated during 58-hr fasting are revealed by non-targeted metabolomic analysis of human blood. Sci Rep. 2019;9(1):854.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  66. Chaleckis R, Ebe M, Pluskal T, Murakami I, Kondoh H, Yanagida M. Unexpected similarities between the schizosaccharomyces and human blood metabolomes, and novel human metabolites. Mol BioSyst. 2014;10:2538–51.

    CAS  Article  PubMed  Google Scholar 

  67. Decuypere S, Maltha J, Deborggraeve S, Rattray NJW, Issa G, Bérenger K, Lompo P, Tahita MC, Ruspasinghe T, McConville M, Goodacre R, Tinto H, Jacobs J, Carapetis JR. Towards improving point-of-care diagnosis of non-malaria febrile illness: a metabolomics approach. PLOS Negl Trop Dis. 2016;10(3):0004480.

    CAS  Article  Google Scholar 

  68. Domingo-Almenara X, Brezmes J, Vinaixa M, Samino S, Ramirez N, Ramon-Krauel M, Lerin C, Díaz M, Ibáñez L, Correig X, Perera-Lluna A, Yanes O. eRah: A computational tool integrating spectral deconvolution and alignment with quantification and identification of metabolites in GC/MS-based metabolomics. Anal Chem. 2016;88(19):9821–9.

    CAS  Article  PubMed  Google Scholar 

  69. To KKW, Lee K-C, Wong SSY, Sze K-H, Ke Y-H, Lui Y-M, Tang BSF, Li IWS, Lau SKP, Hung IFN, Law C-Y, Lam C-W, Yuen K-Y. Lipid metabolites as potential diagnostic and prognostic biomarkers for acute community acquired pneumonia. Diagn Microbiol Infect Dis. 2016;85(2):249–54.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  70. Laursen MR, Hansen J, Elkjær C, Stavnager N, Nielsen CB, Pryds K, Johnsen J, Nielsen JM, Bøtker HE, Johannsen M. Untargeted metabolomics reveals a mild impact of remote ischemic conditioning on the plasma metabolome and α-hydroxybutyrate as a possible cardioprotective factor and biomarker of tissue ischemia. Metabolomics. 2017;13(6):67.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  71. Schoeman JC, Moutloatse GP, Harms AC, Vreeken RJ, Scherpbier HJ, Van Leeuwen L, Kuijpers TW, Reinecke CJ, Berger R, Hankemeier T, Bunders MJ. Fetal metabolic stress disrupts immune homeostasis and induces proinflammatory responses in human immunodeficiency virus type 1- and combination antiretroviral therapy-exposed infants. J Infect Dis. 2017;216(4):436–46.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  72. Lamichhane S, Ahonen L, Dyrlund TS, Kemppainen E, Siljander H, Hyöty H, Ilonen J, Toppari J, Veijola R, Hyötyläinen T, Knip M, Oresic M. Dynamics of plasma lipidome in progression to islet autoimmunity and type 1 diabetes–type 1 diabetes prediction and prevention study (DIPP). Sci Rep. 2018;8(1):10635.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  73. Sen P, Carlsson C, Virtanen SM, Simell S, Hyöty H, Ilonen J, Toppari J, Veijola R, Hyötyläinen T, Knip M, Orešič M. Persistent alterations in plasma lipid profiles before introduction of gluten in the diet associated with progression to celiac disease. Clin Transl Gastroenterol. 2019;10(5):00044.

    Article  Google Scholar 

Download references


The authors wish to thank the de.NBI cloud administrators for the expert technical support. This work was supported by grants of the German Federal Ministry of Education and Research (BMBF) of the project ‘Bielefeld-Gießen Center for Microbial Bioinformatics—BiGi’ (Grant number 031A533) within the German Network for Bioinformatics Infrastructure (de.NBI). We acknowledge support for the publication costs by the Open Access Publication Fund of Bielefeld University and the Deutsche Forschungsgemeinschaft (DFG).


Open Access funding enabled and organized by Projekt DEAL. KT received funding from the German-Canadian DFG international research training group ‘Computational Methods for the Analysis of the Diversity and Dynamics of Genomes’ (DiDy) GRK 1906/1. The funding body did not play any role in the design of the study, writing of the manuscript nor did they have any influence on the data collection, analysis or interpretation of the data and results. Contents are the authors’ sole responsibility and do not necessarily represent official DFG views.

Author information

Authors and Affiliations



KT designed and implemented the software. TWN, KN and SPA initiated, supervised, and directed the project. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Konstantinos Tzanakis.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Fig. S1.

K-means clustering on all 4827 experiments (not re-scaled).

Additional file 2: Fig. S2.

K-means clustering of 45 experiments originating of young individuals on 112 compounds, on metabolite level and replacing missing values with zero.

Additional file 3: Fig. S3.

K-means clustering of 45 experiments originating of elder individuals on 112 compounds, on metabolite level and replacing missing values with zero.

Additional file 4: Table S1. List of the parameters and their values that are used in the KNIME workflow for assembling the metabolite features.

Additional file 5: Table S2.

List of the parameters and their values that are used in the KNIME workflow for the map alignment (correcting retention time distortions between maps).

Additional file 6: Table S3.

List of the parameters and their values that are used in the KNIME workflow for grouping corresponding features from multiple maps.

Additional file 7: Table S4.

The study identifiers from 38 studies of the MetaboLights repository that were used for processing and analysis.

Additional file 8: Table S5.

List of clusters and number of experiments in each one.

Additional file 9: Table S6.

List of clusters and number of experiments in each one.

Additional file 10: Table S7.

List of the metabolites as they appear in Pearson Correlation heatmap starting from the top left corner.

Additional file 11: Table S8.

Clusters of the metabolites of 45 experiments originating of 15 young individuals.

Additional file 12: Table S9.

Clusters of the metabolites of 45 experiments originating of 15 elder individuals.

Additional file 13.

A walk through MetHoS.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tzanakis, K., Nattkemper, T.W., Niehaus, K. et al. MetHoS: a platform for large-scale processing, storage and analysis of metabolomics data. BMC Bioinformatics 23, 267 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Mass spectrometry data
  • Large-scale metabolomics
  • Parallel processing
  • Distributed storage
  • Distributed analysis