### Implementation

A generalized version of PKM and MIX (where an arbitrary number of independent metabolite kinetics can be modeled) was implemented as a Python class. As input it requires the number of metabolites used for kinetic modeling (\(\ell\)), a vector of time points as well as the measured mass data (\(\widetilde{\mathbf {M}}\), matrix with time points in the rows and metabolites in the columns). MIX additionally takes a \(\mathbf {Q}^\text {PQN}= [Q^\text {PQN}(t_1),...,Q^\text {PQN}(t_{n_{\text {time points}}})]^\text {T}\) vector (calculated with the PQN method from all metabolites, \({n_{\text {metabolites}}}\)) for all time points of a time series. Upon optimization (carried out with self.optimize_monte_carlo, which is a wrapper for SciPy’s optimize.curve_fit [32]) the kinetic constants and sweat volumes are optimized to the measured data by minimizing the functions listed in Eqs. 9b and 9c for PKM and MIX respectively:

$$\begin{aligned} \text {min}(\mathcal {L}^\text {MIX})&= \text {min}(\mathcal {L}^\text {PKM}+ \mathcal {L}^\text {PQN}) \end{aligned}$$

(9a)

where

$$\begin{aligned} \mathcal {L}^\text {PKM}&= \sum ^{n_\text {time points}}_{i=1} \sum ^{n_\text {metabolites}}_{j=1} L\left[ \lambda \left( T(\widetilde{M}_{ij}) - T(C_{ij}~V^\text {MIX}_{i})\right) ^2\right] , \end{aligned}$$

(9b)

$$\begin{aligned} \mathcal {L}^\text {PQN}&= \sum ^{n_\text {time points}}_{i=1} L\left[ (1-\lambda )\left( ZT(\mathbf {V}^\text {MIX})_i-ZT(\mathbf {Q}^\text {PQN})_i\right) ^2 \text {Var}(T(\mathbf {V}^\text {MIX})) \right] , \end{aligned}$$

(9c)

\(\text {Var}(\mathbf {V})\) is the variance of \(\mathbf {V}\) (which is the vector of estimated *V* over all time points), *T* is a transformation function, *Z* is a scaling function, and *L* is the loss function. The key difference between PKM and MIX is that the fitted \(V\) in MIX are biased towards relative abundances as calculated by PQN. An important additional hyperparameter of the MIX model is \(\lambda\), which weights the error residuals of \(\mathcal {L}^\text {PKM}\) and \(\mathcal {L}^\text {PQN}\). Its calculation is discussed in section "Hyperparameters". If \(\lambda = 1\), the MIX model simplifies again to a pure PKM model.

To summarize, an overview of the differences between PKM and MIX models is given in Additional file 1: Table S1 and a flow chart of data processing for MIX normalization is given in Fig. 1.

#### Hyperparameters

Several hyperparameters can be set for the PKM and MIX Python classes.

*Kinetic function.* Firstly, it is possible to choose the kinetic function used to calculate \(\mathbf {C}\). In this study we focused on a modified Bateman function *F*(*t*) with 5 kinetic parameters (\(k_a,~k_e,~c_0,~lag,~d\)):

$$\begin{aligned} F(t) = \left\{ \begin{array}{ll} b(t) + d &{} \text{ if } b(t) \ge 0 \\ d &{} \text{ if } b(t) < 0 \end{array} \right. \end{aligned}$$

(10a)

with

$$\begin{aligned} b(t) = c_0~\frac{k_a}{k_e-k_a}~\left( e^{-k_a(t-lag)} - e^{-k_e(t-lag)}\right) . \end{aligned}$$

(10b)

This function was designed to be flexible and able to represent several different metabolite consumption and production kinetics, as exemplified by Fig. 2. Intuitively, \(k_a\) and \(k_e\) correspond to kinetic constants of absorption and elimination of a metabolite of interest with the unit h\(^{-1}\). \(c_0\) is the total amount of a metabolite absorbed over the volume of distribution with the unit mol L\(^{-1}\). Additionally to these parameters which are also part of the classical Batman function [33], we here introduce *lag* and *d*. The *lag* term with the unit h shifts the function along the X-axis, intuitively defining the starting time point of absorption of a metabolite of interest, whereas the *d* term with the unit mol L\(^{-1}\) shifts the function along the Y-axis.

*Loss function*, *L*. *L* calculates the loss value after estimation of the error residuals of the model (Eq. 9). It can be set via self.set_loss_function to either cauchy_loss or max_cauchy_loss (or max_linear_loss). In both cases the loss is calculated as a Cauchy distribution of error residuals according to SciPy [32]. The difference, however, is that cauchy_loss only uses the absolute error residuals, whereas max_cauchy_loss uses the maximum of relative and absolute error residuals (thus the word max is expressed in its name). The reason for its addition was that a good performance has been achieved in a previous study [20]. In this study we used the max_cauchy_loss loss function for PKM models and cauchy_loss for MIX models. The choice of *L* is intertwined with the choice of *T* which becomes clear in the following paragraph.

*Transformation function*, *T*. *T* transforms the measured data \(\widetilde{\mathbf {M}}\) as well as the calculated \(\mathbf {Q}^\text {PQN}\), \(\mathbf {C}V\), and \(\mathbf {V}\) before calculation of the loss (Eq. 9). Two different transformations, none and log10, can be set during initialization with the argument trans_fun. As originally reported [20] no transformation was done for PKM (i.e. trans_fun=’none’),

$$\begin{aligned} T(\widetilde{\mathbf {M}}) = \widetilde{\mathbf {M}}. \end{aligned}$$

(11a)

For MIX models, however, a log-transform was performed (i.e. trans_fun=’log10’),

$$\begin{aligned} T(\widetilde{\mathbf {M}}) = \log _{10}(\widetilde{\mathbf {M}}+ 10^{-8}) \end{aligned}$$

(11b)

as the error on measured data is considered multiplicative [34] and the sweat volume log-normally distributed (Additional file 1: Fig. S1). To avoid problems with concentrations of the size 0 a small number (i.e., the size of optimizer precision [32]) is added.

In a sensitivity analysis study, we tested the quality of normalization of MIX with different *L* and *T* hyperparameters and concluded that a combination of cauchy_loss for *L* and log10 for *T* performed best (Additional file 1: Fig. S2C, D). This is in agreement with literature where logarithmic transformations performed well in combination with PQN for size effect normalization of sweat measurements [35].

*Scaling function*, *Z*. *Z* describes a scaling function performed on \(T(\mathbf {Q}^\text {PQN})\) and \(T(\mathbf {V})\). Scaling is performed to correct for noisy data (see Results section "In fluence of noise on PQN"). Two strategies can be set with the scale_fun argument during initialization of the MIX model class, standard or mean. In this study, all MIX models employ standard scaling, i. e.

$$\begin{aligned} ZT(\mathbf {Q}^\text {PQN})&= \frac{T(\mathbf {Q}^\text {PQN}) - \text {mean}(T(\mathbf {Q}^\text {PQN}))}{\text {Std}(T(\mathbf {Q}^\text {PQN}))}. \end{aligned}$$

(12a)

We additionally implemented mean scaling which differs depending on the choice of *T* with

$$\begin{aligned} ZT(\mathbf {Q}^\text {PQN})&= \left\{ \begin{array}{ll} T(\mathbf {Q}^\text {PQN}) - \text {mean}(T(\mathbf {Q}^\text {PQN})) &{} \text{ if } \texttt {trans\_fun='log10'} \\ T(\mathbf {Q}^\text {PQN}) / \text {mean}(T(\mathbf {Q}^\text {PQN})) &{} \text{ if } \texttt {trans\_fun='none'}. \end{array} \right. \end{aligned}$$

(12b)

*Optimization strategy.* The optimization of both PKM and MIX models is done with a Monte Carlo strategy where the initial parameters are sampled randomly from a uniform distribution between their bounds. Performing a sensitivity analysis, we previously showed that this method is preferable to a single fitting procedure [20]. In this study, the number of Monte Carlo replicates for model fitting was set to 100.

*Weighting of MIX loss terms.* A weighting constant for every measured data point can be used by the model. In a sensitivity analysis study, we found that the choice of \(\lambda\) is not critical to the quality of normalization as long as it is not extremely tilted to one side (i.e., \(\lambda\) close to 0 or 1, Additional file 1: Fig. S2A, B). Thus we propose a method where the loss terms are weighted by the number of data points fitted for each of both loss terms but not by the number of metabolites used in the calculation of each term (Additional file 1: Equation S1). For such a method the solution for \(\lambda\) is given by Eq. 13.

$$\begin{aligned} \lambda = \frac{1}{\ell + 1} \end{aligned}$$

(13)

#### Full and minimal models

In this study, we differentiate between full and minimal models. With full models, we refer to pharmacokinetic normalization models (PKM or MIX) where all metabolites of a given data set are used for the pharmacokinetic normalization. This means that, for example, if \({n_{\text {metabolites}}}= 20\) all 20 metabolites were modeled with the modified Bateman function and thus in Eqs. 7a and 8a, \(\ell = {n_{\text {metabolites}}}\) and \(\widetilde{\mathbf {M}}_{\ell +}\) is an empty vector. On the other hand, minimal models are models where only the few known, better constrained metabolites were modeled with a kinetic function. This means that the information used for PKM\(_{\text {minimal}}\) does not change upon the addition of (synthetic) metabolites. Therefore, its goodness of fit measure should stay constant within statistical variability upon change of \({n_{\text {metabolites}}}\). This behaviour was used to verify if the simulations worked as intended and if no biases in the random number generation existed. On the other hand, the MIX\(_{\text {minimal}}\) model still gained information from the increase of \({n_{\text {metabolites}}}\) as the PQN part of this model was calculated with all \({n_{\text {metabolites}}}\). Therefore, changes in the goodness of fit measures for MIX\(_{\text {minimal}}\) are expected. We emphasize that the definition of full and minimal models is specific to this particular study. Here we explicitly set \(\ell = 4\), which originates from previous work where 4 targeted metabolites (caffeine, paraxanthine, theobromine, theophylline) with known kinetics were measured [20].

### Synthetic data creation

Three different types of synthetic data sets were investigated. The first two types of data sets (sampled from kinetics, section "Sampled kinetics" and sampled from means and standard deviations, section "Sampled mean and standard deviation") test the behaviour of normalization models in extreme cases (either all metabolites describable by pharmacokinetics or all metabolites completely random). Finally, the third type of data set (sampled from real data, section "Sampled from real data") aims to replicate measured finger sweat data as close as possible. In sum, the performance of normalization methods on all three types of data sets can show how they behave in different situations with different amounts of describable data.

In all three cases, data creation started with a simple toy model closely resembling the concentration time series of caffeine and its degradation products (paraxanthine, theobromine, and theophylline) in the finger sweat as described elsewhere [20]. The respective parameters are listed in Additional file 1: Table S2. With them, the concentration of metabolites #1 to #4 were calculated for 20 time points (between 0 and 15 h in equidistant intervals, Fig. 3). Subsequently, new synthetic metabolite concentration time series were sampled and appended to the toy model (i.e., to the concentration vector, \(\mathbf {C}(t)\)). Three different synthetic data sampling strategies were tested, and their specific details are explained in the following sections. Next, sweat volumes (\(V\)) were sampled from a log-normal distribution truncated at (\(0.05 \le V\le {4}\,\upmu \hbox {L}\)) closely resembling the distribution of sweat volumes estimated in our previous publication [20], Additional file 1: Fig. S1. Finally, an experimental error (\(\varvec{\upepsilon }\)) was sampled for every metabolite and time point from a normal distribution with a coefficient of variation of \(20\%\) and the synthetic data was calculated as

$$\begin{aligned} \widetilde{\mathbf {M}}(t) = \text {diag}\big (\mathbf {C}(t)~V(t)\big )~\varvec{\upepsilon }(t)\end{aligned}.$$

(14)

For every tested condition, 100 synthetic data replicates were generated, and the normalization models were fitted.

#### Sampled kinetics

In simulation v1, data was generated by sampling kinetic parameters for new metabolites from an uniform distribution. The distribution was constrained by the same bounds also used for the PKM and MIX model fitting: \((0,0,0,0)^\text {T} \le (k_a,k_e,c_0,lag,d)^\text {T} \le (3,3,5,15,3)^\text {T}\). Subsequently the concentration time series of the synthetic metabolites were calculated according to the modified Bateman function (Eq. 10).

#### Sampled mean and standard deviation

Means and standard deviations of the concentration time series of metabolites were calculated from untargeted real finger sweat data (for details, see section "Real finger sweat metabolome data"). The probability density function of both can be described by a log-normal distribution (Additional file 1: Fig. S3). For the data generation of simulation v2, per added metabolite, one mean and one standard deviation were sampled from the fitted distribution and used as an input for another log-normal distribution from which a random concentration time series was subsequently sampled. This results in synthetic concentration values that behave randomly and, therefore, cannot be easily described by our pharmacokinetic models.

#### Sampled from real data

To get an even better approximation to real data, in simulation v3, concentration time series were directly sampled from untargeted real finger sweat data (for details, see section "Real finger sweat metabolome data"). To do so, the untargeted metabolite \(\widetilde{\mathbf {M}}\) time series data set was normalized with PQN. As the number of metabolites in this data set was comparably large (\({n_{\text {metabolites}}}= 3446\)) we could assume that the relative error (or rRMSE, for more explanation, see section "Synthetic data simulations") was negligibly small. The resulting values are, strictly speaking, fractions of concentrations. However, this does not affect the results as these values are anyways considered untargeted (i.e., no calibration curve exists) and thus relative. Therefore, the PQ normalized data set could be used as ground truth for concentration time series sampling. Subsequently, a subset of the original ground truth data was sampled for synthetic data generation.

#### Sampling of noisy data

We investigated the influence of background (i.e. noisy) signal on the performance on \(\mathbf {Q}^\text {PQN}\) (and scaled and transformed variants thereof). To simulate such an environment we used data sampled from real data (section "Sampled from real data"), and applied \(V\) only to a fraction of the \(\mathbf {C}\) vector,

$$\begin{aligned} \begin{pmatrix} \widetilde{\mathbf {M}}\phantom{n}(t) \\ \widetilde{\mathbf {M}}_{n}(t) \end{pmatrix}&= \text {diag} \begin{pmatrix} \mathbf {C}\phantom{n}(t)V(t) \\ \mathbf {C}_{n}(t)\phantom{V(t)} \end{pmatrix} ~\varvec{\upepsilon }. \end{aligned}$$

(15)

The noise fraction is given by the number of elements of \(\widetilde{\mathbf {M}}\) and \(\widetilde{\mathbf {M}}_n\) vectors,

$$\begin{aligned} f_n = \frac{\text {length} (\widetilde{\mathbf {M}}_n)}{\text {length} (\widetilde{\mathbf {M}}) + \text {length}( \widetilde{\mathbf {M}}_n)}, \end{aligned}$$

(16)

where subscript *n* in \(\widetilde{\mathbf {M}}_n, \mathbf {C}_n,\) and \(f_n\) denotes them as part of the noise.

Simulations were carried out for 20 equidistant noise fractions between \(0 \le f_n \le 0.95\) with \({n_{\text {metabolites}}}= 100\) and \(n_{\text {time points}}= 20\) for 100 replicates. The error residuals of mean and standard scaled \(\mathbf {Q}^\text {PQN}\) are calculated as

$$\begin{aligned} \text {Mean Scaled Error}&= \sum _i^{n_{\text {time points}}} \left[ ZT(\mathbf {Q}^\text {PQN})_i - ZT(\mathbf {V})_i\right] \end{aligned}$$

(17a)

with *Z* defined as in Eq. 12b and

$$\begin{aligned} \text {Standard Scaled Error}&= \sum _i^{n_{\text {time points}}} \left[ ZT(\mathbf {Q}^\text {PQN})_i - ZT(\mathbf {V})_i \right] \text {Std}(T(\mathbf {V})) \end{aligned}$$

(17b)

with *Z* defined as in Eq. 12a. For both cases *T* is defined as the logarithm (Eq. 11b). We point out that the multiplication with \(\text {Std}(T(\mathbf {V}))\) for the standard scaled error is important to make the results comparable, as otherwise the error would be biased towards the method with smaller scaled standard deviation regardless of the performance of the scaling.

### Normalization model optimization

Normalizing for the sweat volume by fitting kinetics through the measured values only has a clear advantage over PQN if it is possible to infer absolute sweat volumes and concentration data. In order to be able to do that, some information about the kinetics and the starting concentrations of metabolites of interest need to be known. For example, when modeling the caffeine network in our previous publication [20], we knew that the *lag* parameter of all metabolites was 0 and that the total amount of caffeine ingested (which corresponds to \(c_0\)) was 200 mg. Moreover, we knew that caffeine and its metabolites are not synthesized by humans and implemented the same strategy into our toy model (corresponding to *d*). As the toy model was designed to resemble such a metabolism, we translated this information to the current study. Therefore, we assumed that the first 4 metabolites in our toy model had known \(c_0\), *lag*, and *d* parameters. For their corresponding \(k_a\) and \(k_e\) and the parameters of all other metabolites the bounds were set to the same \((0,0,0,0)^\text {T} \le (k_a,k_e,c_0,lag,d)^\text {T} \le (3,3,5,15,3)^\text {T}\) used in kinetic data generation. Fig. 2 shows examples of concentration time series that can be described with the modified Bateman function and parameters within the fitting bounds.

### Real finger sweat metabolome data

The real world finger sweat data was extracted from 37 time series measurements of Study C from ref. [20]. It was downloaded from MetaboLights (MTBLS2772 and MTBLS2776).

*Preprocessing.* The metabolome data set was split into two parts: targeted and untargeted. The targeted data (i.e., the mass time series data for caffeine, paraxanthine, theobromine, and theophylline) was directly adopted from the mathematical model developed by [36]. This data is available on GitHub (https://github.com/Gotsmy/finger_sweat).

For the untargeted metabolomics part, the raw data was converted to the mzML format with the msConvert tool of ProteoWizard (version 3.0.19228-a2fc6eda4) [37]. Subsequently, the untargeted detection of metabolites and compounds in the samples was carried out with MS-DIAL (version 4.70) [38]. A manual retention time correction was first applied with several compounds present in the majority (more than 90%) of the samples. These compounds were single chromatographic peaks with no isomeric compounds present at earlier or later retention times (*m*/*z* 697.755 at 5.57 min, *m*/*z* 564.359 at 5.10 min, *m*/*z* 520.330 at 4.85 min, *m*/*z* 476.307 at 4.58 min, *m*/*z* 415.253 at 4.28 min, *m*/*z* 371.227 at 3.95 min, *m*/*z* 327.201 at 3.56 min, *m*/*z* 283.175 at 3.13 min, *m*/*z* 239.149 at 3.63 min, *m*/*z* 166.080 at 1.69 min, *m*/*z* 159.113 at 1.19). After this, untargeted peak detection and automated alignment (after the manual alignment) were carried out with the following settings: Mass accuracy MS1 tolerance: 0.005 Da, Mass accuracy MS2 tolerance: 0.025 Da, Retention time begin: 0.5 min, Retention time end: 6 min, Execute retention time correction: yes, Minimum peak height: 1E5, Mass slice width: 0.01 Da, Smoothing method: Linear weighted moving average, Smoothing level: 3 scans, Minimum peak width: 5 scans, Alignment reference file: C_D1_I_o_pos_ms1_1.mzML, Retention time tolerance: 0.3 min, MS1 tolerance: 0.015 Da, Blank removal factor: 5 fold change). No blank-subtraction was carried out as the internal standard caffeine was spiked into each sample, including the blanks. Peak abundances and meta-information were exported with the Alignment results export functionality.

Subsequently, we excluded isomers within a *m/z* difference of less than \({0.001}\,\hbox {Da}\) and a retention time difference of less than \({0.5}\,\hbox {min}\). To further reduce features that are potentially background, features with retention times after \({5.5}\,\hbox {min}\) as well as features with minimal sample abundances of \(< 5\times \text {maximum blank abundance}\) (except for the internal standard, caffeine-D9) were excluded from the data set. This was done on a time series-wise basis. Thus the number of untargeted metabolites considered for normalization differs with a mean of \(343\pm 152\) for the 37 time series of interest.

*Size effect normalization.* In this finger sweat data set, time series of targeted as well as untargeted metabolomics, are listed. The kinetics of the four targeted metabolites (caffeine, paraxanthine, theobromine, and theophylline) are known. A reaction network of the metabolites is shown in the top panel of Fig. 4. Briefly, caffeine is first absorbed and then converted into three degradation metabolites. Additionally, all four metabolites are eliminated from the body. All kinetics can be described with first order mass action kinetics [39, 40].

In order to assess the performance of the sweat volume normalization methods, the full network was split up into three subnetworks that all contained caffeine and one degradation metabolite each (Fig. 4 bottom panel). The solution of the first order differential equations describing such network is given in Additional file 1: Eqs. S2a and S2b. Moreover, the \(343\pm 152\) untargeted metabolite time series were randomly split up into three (almost) equally sized batches, and each batch was assigned to one subnetwork. All three networks were subsequently separately normalized with PKM\(_{\text {minimal}}\) and MIX\(_{\text {minimal}}\) methods with kinetic parameters that were adjusted to the specific reaction network (Fig. 4 bottom panel). Subsequently, the kinetic constants (\(k_1'\), \(k_2'\), \(k_3'\), \(k_4'\)) were estimated for 37 measured concentration time series. Fitting bounds were not changed in comparison to the original publication [20].

As all three subnetwork data sets originate from the same finger sweat measurements, the underlying kinetic constants should be exactly identical. As the kinetic constants of absorption (\(k_a^\text {caf}= k_1'\)) and elimination (\(k_e^\text {caf}= k_2' + k_3'\)) of caffeine are estimated in all three subnetworks, we used their standard deviation to test the robustness of the tested normalization methods.

### Real blood plasma metabolome data

In the study of Panitchpakdi et al. [41] the mass time series of the metabolome was measured in different body fluids after the uptake of diphenhydramine (DPH). Here, we focus on data measured in the blood plasma, which includes the abundances of DPH (known kinetics, calibration curve, pharmacological constants) as well as three of its metabolization products (known kinetics) and the abundances of 13526 untargeted metabolites with unknown kinetics.

*Preprocessing.* The data of peak areas was downloaded from the GNPS platform [42]. To reduce the number of metabolites that are potentially background and/or noise in the data set, features with minimal sample abundances of \(< 5\times \text {maximum blank abundance}\) were excluded from the data set on a time series-wise basis. Thus, the number of untargeted metabolites considered for normalization differs with a mean of \(1017\pm 114\) for the 10 time series of interest.

*Size effect normalization.* We assume that the kinetics of four metabolites (DPH, N-desmethyl-DPH, DPH N-glucuronide, and DPH N-glucose) can be described by the modified Bateman (Eq. 10). A reaction network of the metabolites is shown in Additional file 1: Fig. S4. Briefly, DPH is first absorbed and then – with unknown intermediates – converted into three degradation metabolites, which are in turn metabolized further downstream or eliminated. \(c_0\) of DPH was calculated with pharmacological constants for bioavailability, volume of distribution, and dosage of DPH as reported in the original publication [41].

Analogously to the normalization performed on finger sweat data, the full network of four metabolites is split up into three subnetworks with only one, shared, targeted metabolite (DPH itself), one additional untargeted metabolite with known kinetic (either N-desmethyl-DPH, DPH N-glucuronide, or DPH N-glucose, Additional file 1: Fig. S5) and one third of \(1017\pm 114\) untargeted metabolites with unknown kinetics. To ensure better convergence during fitting of the models, the \(\widetilde{\mathbf {M}}\) data was first scaled to values between 0 and 1 by dividing by its metabolite-wise maximum. This factor can be multiplied again as part of \(c_0\) after the normalization is done. Thereafter, PKM\(_{\text {minimal}}\) and MIX\(_{\text {minimal}}\) models were fitted onto the scaled \(\widetilde{\mathbf {M}}\) data (with \(\ell = 2\)) for all ten measured time series. The bounds of parameters were chosen so that previously reported estimates [41] are well within range: \(0 \le k \le {5}\,\hbox {h}^{-1}\) for \(\{k_1', k_3'\}\), \(0 \le k \le {1}\,\hbox {h}^{-1}\) for \(\{k_2', k_4'\}\), \(c_0^\text {DPH}\) as reported in the original publication normalized by the maximum factor, \(0\le c_0^m\le 300\) for \(m \in \{\)N-desmethyl-DPH, DPH N-glucuronide, DPH N-glucose\(\}\) and \(lag = d = 0\) as well as \(0.01 \le V\le {0.03}\,\hbox {mL}\).

As all three subnetwork data sets originate from the same plasma time series measurements, the underlying kinetic constants of DPH should be exactly identical. As the kinetic constants of absorption (\(k_a^\text {DPH}= k_1'\)) and elimination (\(k_e^\text {DPH}= k_2'\)) of DPH are estimated in all three subnetworks we used their standard deviation to test the robustness of PKM\(_{\text {minimal}}\) and MIX\(_{\text {minimal}}\).

### Data analysis

*Goodness of normalization.* Two goodness of fit measures were calculated to analyze the performance of the tested methods. RMSE is the standard deviation of the residuals of a sampled sweat volume time series vector (\(\mathbf {V}^\text {true}\)) minus the fitted sweat volume vector (\(\mathbf {V}^\text {fit}\)), while rRMSE is the standard deviation of the ratio of sampled and fitted \(\mathbf {V}\) vectors normalized by its mean. Intuitively, RMSE is a measure of how much absolute difference there is between the fit and a true value, rRMSE, on the other hand, gives an estimate of how good the fitted sweat volumes are relative to each other. A visual depiction of RMSE and rRMSE is shown in Additional file 1: Fig. S6 and their exact definition is given in the equations in 3.3.

*Statistical analysis.* The significant differences in the mean of goodness of fit measures were investigated by calculating *p* values with the non-parametric pairwise Wilcoxon signed-rank test [43] (SciPy’s stats.wilcoxon function [32]). Significance levels are indicated by *, **, and *** for \(p \le 0.05\), 0.01, and 0.001 respectively.