multiMarker: software for modelling and prediction of continuous food intake using multiple biomarkers measurements

Background Metabolomic biomarkers offer potential for objective and reliable food intake assessment, and there is growing interest in using biomarkers in place of or with traditional self-reported approaches. Ongoing research suggests that multiple biomarkers are associated with single foods, offering great sensitivity and specificity. However, currently there is a dearth of methods to model the relationship between multiple biomarkers and single food intake measurements. Results Here, we introduce multiMarker, a web-based application based on the homonymous R package, that enables one to infer the relationship between food intake and two or more metabolomic biomarkers. Furthermore, multiMarker allows prediction of food intake from biomarker data alone. multiMarker differs from previous approaches by providing distributions of predicted intakes, directly accounting for uncertainty in food intake quantification. Usage of both the R package and the web application is demonstrated using real data concerning three biomarkers for orange intake. Further, example data is pre-loaded in the web application to enable users to examine multiMarker’s functionality. Conclusion The proposed software advance the field of Food Intake Biomarkers providing researchers with a novel tool to perform continuous food intake quantification, and to assess its associated uncertainty, from multiple biomarkers. To facilitate widespread use of the framework, multiMarker has been implemented as an R package and a Shiny web application.

is becoming increasingly evident that multiple biomarkers are needed for a single food to add specificity and sensitivity to the determination of food intake [9]. However, there is a dearth of statistical tools available for modelling multiple biomarkers with a single food. Consequently, the demonstration of the potential utility of multiple biomarkers for food intake determination is hampered. In an effort to address this gap, we have proposed a latent variable approach, multiMarker, to estimate the relationship between food intake and multiple biomarkers, and to subsequently use this relationship to predict intake, together with the associated uncertainty, when only biomarker data are available [12]. The proposed model draws from factor analytic models [13] and mixture of experts models [14], to flexibly model the relationships between biomarkers, discrete food quantities (administered in an intervention study) and latent continuous intakes. To encourage and facilitate usage of the framework by a broad variety of researchers, multiMarker is implemented both in the form of an R package and a Shiny web application. To the best of our knowledge, currently no software allows continuous food intake quantification, and its associated uncertainty, from multiple biomarkers. We believe our software implementations of multiMarker will be useful tools for the research community.

multiMarker
Here we briefly provide a description of the multiMarker model, to which we will refer when outlining the functionality of the R package and web application in the following Sections.
A factor analytic framework is developed to quantify food intake from multiple biomarkers, obtained from an intervention study, where exact portions of intake are known. For a sample of n observations, i = 1, . . . , n , the model expresses the relationship between P biomarkers ( {y ip } P p=1 ) and latent intake ( z i ) through a factor analytic model, where the latent variable is the unobserved intake. The latent variable is modelled with a mixture of Gaussian distributions, with components centered around the D food quantities administered in an intervention study ( {x d } D d=1 ). Further, in order to better refine intake quantification, observation-specific component weights are employed, by embedding the latent intake prior distribution in a mixture of experts context. Thus the multi-Marker model is: where (α p , β p ) are, respectively, biomarker-specific intercept and slope parameters, ǫ ip are homoscedastic observation-biomarker-specific error terms, with variance σ 2 p , and θ 2 d are scaled component-specific variance parameters. The weights π d (y i ) are modelled via an ordinal regression model, with Cauchit link function. The model is developed within a hierarchical Bayesian framework and inference is conducted through Markov chain Monte Carlo (MCMC) algorithms [12]. When only biomarker data are available for n * observations ( {y * jp } P p=1 , j = 1, . . . , n * ), after the model parameters have been inferred from the intervention study data, the framework can be employed to perform prediction for the latent intakes {z * j } n * j=1 .
In the R package, multiMarker and predict.multiMarker are the two main functions. The multiMarker function infers the relationship between multiple biomarkers and food intake; the main arguments of this function are: • y: a matrix storing P biomarker measurements on a set of n observations (dimension: n × P); • quantities: a vector storing the food quantities allocated to each of the n observations in the intervention study data (length: n); • niter: the number of MCMC iterations; • burnIn: the number of MCMC iterations to be discarded prior to computing posterior estimates.
Note that the multiMarker method is independent of both the unit of measure of the P biomarkers and of the types of biofluid from which these are measured. The user can freely set the units of measure, and further decide on the biofluids from which to derive the biomarker measurements. Furthermore, the biomarkers can be quantitative or expressed as relative quantities. Model hyperparameters are computed according to the observed data, as described in [12]. However, users can specify different values using additional arguments of the multiMarker function (see [18]). The output of this function is an object of class multiMarker, storing posterior estimates and MCMC chains, for model parameters and latent intakes. Function predict.multiMarker facilitates prediction of intake values from biomarker data alone; its main arguments are: • object: an object of class multiMarker; • y: a matrix storing P biomarker measurements on a set of n * observations (dimension: n * × P); • niter, burnIn.
Usage of predict.multiMarker is conditional on the prior estimation of a mul-tiMarker model using data from an intervention study. Moreover, biomarkers considered for prediction should correspond to those of the intervention study, and should be ordered in the same way.
Importantly, in both functions, distributions of parameter estimates and intake predictions are provided, as well as multiple summary statistics: posterior median, posterior standard deviation and 95% credible intervals. This directly provides informative quantification of the uncertainty associated with the different quantities of interest, often lacking from food intake predictions. Examples for the two functions are provided, as well as example code to produce synthetic data, diagnostic plots for the model parameters and plots of the inferred intake distributions.
The multiMarker Shiny web application builds on the R package to provide non-Rexpert researchers with easy access to multiMarker. The Shiny application can be accessed at https:// adiet. shiny apps. io/ multi Marker/. The first two pages of the application, "About" and "Instructions", contain a brief overview of the web application's scope and structure. The main pages of the application are "1. Model Estimation" and "2. Intake Prediction". Figure 1 reports two flowcharts, illustrating the overall structure of the Shiny web application.
In "1. Model Estimation", users can upload data from an intervention study. Two different data formats are supported: .csv and .txt. Such data should consist of a matrix with n rows and (1 + P) columns, with the following structure: • Column 1: Food quantity values consumed by the participants/observations in a study; • Columns 2 to (P + 1) : The P biomarker measurements.
Exploratory tools are provided, such as a table containing descriptive summary statistics for the biomarker data, and food quantity-biomarker boxplots (see 1.1 in Fig. 1). Further, intakes' unit of measure can be specified. The multiMarker model can be run easily using the "Run model estimation" button (see 1.2 in Fig. 1), after having specified the number of MCMC iterations and the desired percentage of iterations for burn-in.
Users are provided with tables containing summary statistics for the estimated intercept ( α p ), scale coefficient ( β p ) and errors standard deviation ( σ p ) parameters, for the P biomarkers. Further, histograms showing the estimated posterior distributions of these parameters can be produced. The estimated model can be downloaded as an R model object, in a .RData file format (see 1.4 in Fig. 1), for future usage. Last, diagnostic trace plots for the parameters can be produced (see 1.5 in Fig. 1).
In "2. Intake Prediction", users can upload biomarker data. As in "1. Model Estimation", only .csv and .txt are supported. Uploaded data should consist of a matrix with n * rows and P columns, storing the P biomarker measurements. Moreover, if "2. Intake Prediction" is run in a different session than "1. Model Estimation", users should upload Fig. 1 Flowcharts of the "1. Model Estimation" and "2. Intake Prediction" pages of the multiMarker Shiny web application

Results and discussion
To demonstrate usage of the proposed Shiny web application, we employ results from the analysis of an orange consumption intervention study. Note that analogous output can be retrieved from the R package.
The intervention study data used to estimate the multiMarker model consist of 24, 22 and 24 participants who consumed 80, 160 and 320 grams of orange, respectively. The total number of observations is n = 24 + 22 + 24 = 70 . Measurements for P = 3 biomarkers are available: citrate, hippurate, and proline betaine [19]. Full details on the study design and its participants are provided in [10]. While our model does not include study participants' demographics it is worth noting that future work could examine how to incorporate information such as age and sex, as needed. Intake prediction was performed on a test-set consisting of the same 3 biomarkers for n * = 3 observations. The dataset was loaded and visualized in the "1. Model Estimation" page, and the multiMarker model was estimated, setting the number of MCMC iterations and burn in values to 20000 and 15% , respectively. The same settings were used in the "2. Intake Prediction"  credible intervals, to account for uncertainty in the food intake quantification page. Computation time was 2.11 minutes in "1. Model Estimation", and 25 seconds in "2. Intake Prediction". All three biomarkers presented similar estimated slope values and associated 95% credible intervals. As an example of plot output of the application, Fig. 2 presents the histogram of the estimated posterior distribution for the slope parameter of the proline betaine biomarker. The estimated posterior median and 95% credible interval are also reported.
The estimated model from "1. Model Estimation" was then used to predict continuous orange intake for the test data ( n * = 3 ). Figure 3 presents an example of output from the "2. Intake Prediction" page, which mimics real world data where only biomarker data are available. The predicted intake distributions for the three observations are displayed, whose actual consumption here is known to be 80 grams (observation 1), 160 grams (observation 2) and 320 grams (observation 3) of orange. The median predicted intakes are in good agreement with the actual intake values: 79.91, 158.72 and 326.43 grams, respectively for observations 1, 2 and 3. Also, note that the estimated predictive distributions of observations 1 and 2's intake are bi-modal. Here, the model accurately predicts the median intake for this observation, but displays some uncertainty leaning towards either larger (observation 1) or smaller (observation 2) intake values. Credible intervals ( 95% ) are also reported in the output plots, to provide insight on the uncertainty associated with the predicted intakes. The ability to predict continuous intake with an associated uncertainty is an important step forward for the implementation of biomarkers in nutrition research. Further, as biomarkers have limitations, including uncertainty in the estimation of intake will be important for their future use. Last, diagnostic plots (trace plots) are available for all parameters and predicted intakes.

Conclusion
The multiMarker web application, built on the homonymous R package, allows estimation of the relationship between multiple biomarkers and food intake, and importantly enables users to predict continuous food intake from biomarker data alone. A valuable feature of the work is the estimation of the uncertainty of the food intake predictions, as distributions of both parameter estimates and intake predictions are provided, equipping researchers with more informative quantification of food intake. The web application supports the workflow from data import, analysis and visualisation, thus facilitating the use of multiMarker in the research community. Implementation is available in two different software permitting usage of multiMarker by a broad variety of researchers, from R experts to domain experts. In particular, the multiMarker web application has the potential to facilitate the use of biomarkers for food intake assessment and move the field towards more examples of the utility of such biomarkers. Ultimately, this will pave the way forward for improvement in dietary assessment approaches.