 Research
 Open Access
 Published:
Potential predictors of type2 diabetes risk: machine learning, synthetic data and wearable health devices
BMC Bioinformatics volume 21, Article number: 508 (2020)
Abstract
Background
The aim of a recent research project was the investigation of the mechanisms involved in the onset of type 2 diabetes in the absence of familiarity. This has led to the development of a computational model that recapitulates the aetiology of the disease and simulates the immunological and metabolic alterations linked to type2 diabetes subjected to clinical, physiological, and behavioural features of prototypical human individuals.
Results
We analysed the time course of 46,170 virtual subjects, experiencing different lifestyle conditions. We then set up a statistical model able to recapitulate the simulated outcomes.
Conclusions
The resulting machine learning model adequately predicts the synthetic dataset and can, therefore, be used as a computationallycheaper version of the detailed mathematical model, ready to be implemented on mobile devices to allow selfassessment by informed and aware individuals. The computational model used to generate the dataset of this work is available as a webservice at the following address: http://kraken.iac.rm.cnr.it/T2DM.
Background
Type 2 diabetes (i.e. noninsulindependent, T2D) is a chronic, multifactorial, metabolic disorder typical of late adulthood characterised by less effective hormone insulin efficiency at lowering blood sugar. The World Health Organization reports that type 2 diabetes accounts for 85–90% of all cases of diabetes in the World [1].
There are many different mechanisms that contribute to the onset of T2D [2], therefore research is focusing on the simultaneous observation of several factors such as metabolic, immunological, genetic, and nutritional drivers. A recent study had pointed out a specific state of inflammation, unique for its characteristics and distinct from the classic inflammatory state, which manifests itself in the presence of a highcalorie diet and “susceptible” lifestyles [3]. The term metaflammation well describes this kind of inflammation caused by a high caloric and sugarrich diet which mainly originates in the visceral adipose tissue [4]. This inflammatoryeliciting insult triggers a cellular response consisting of the release of several intracellular signals and a low levels of cytokines such as Tumour Necrosis Factorα (TNFα), and Interleukin6 (IL6) [5]. Moreover, experiments have shown a correlation of these triggers with the inhibition of the insulin signal by phosphorylation of a serine in the Insulin Receptor Substrate1 (IRS1) [6]. The result is a malfunctioning receptor unable to bind insulin, turning the cells as insulinresistant. Summarising, the prolonged condition of a proinflammatory response alters the metabolic functions of the adipocytes [7] and, in the long term, causes hyperglycemia and eventually fullblown type 2 diabetes [8].
The scenario just depicted calls for a predictive approach aimed at identifying the metabolic and inflammatory “driving factors”, possibly amenable to being implemented on selfmonitoring devices. This has been the main aim of the EUfunded project “Multiscale Immune System Simulator for the Onset of Type 2 Diabetes” (MISSIONT2D) [9] which has led to the development of a validated multilevel patientspecific model able to integrate metabolic, nutritional and lifestyle data for the prediction of the metabolic and inflammatory processes underlying the development of type 2 diabetes in the absence of familiarity.
Approach
The mentioned computational model (herein referred to as MT2D) has been implemented to take into account a set of user input data and to subsequently provide an estimation of the risk to develop a T2D clinical picture.
Setting a definition for the risk of T2D has not been a trivial task. After a few attempts, we decided to combine the level of insulin resistance, the level of inflammatory cytokines, and the proinflammatory cell counts. These observables are, among others, used in the introduced mathematical description of the complex interdependencies among metabolites and pancreatic control as well as among adipose tissue components and inflammation.
Upon setting anthropometric parameters such as age, sex, body weight, height, and providing nutritional habits, fitness status and physical activity patterns by the user, the MT2D calculates the risk of progressing toward a T2Drelated state in a predefined time horizon.
Due to the high level of sophistication, MT2D is quite computationally expensive (a 6month simulated period takes many hours to run on a current highperformance computing server) and is therefore not a viable solution to perform selfmonitoring and assessment on mobile devices. Because of this limitation, we constructed an approximation, namely a surrogate model being able to forecast the output of the model MT2D with a reduced computational effort. The need for reducing the computational burden of a simulation tool occurs in many research fields. For instance, [10] proposed a statistical model for computer output being interested in the assessment of the computer code and the identification of the most significant predictors to efficiently design experiments. The authors in [11] investigated the same issues considering a Bayesian approach based on Gaussian processes. The study in [12] proposed a spatiotemporal neural network as a surrogate model for a particular type of chemical process, namely the polymerization reactor. The Gaussian process has been applied in [13] to approximate spatiotemporal processes while [14] used a Gaussian process with a modified before approximate dynamic processes in hydrology. For an up to date review regarding approximated models and techniques for complex processes, the interested reader can refer to [15].
The aim of this work is twofold: (1) to provide an approximation of the final state of MT2D via surrogate model at initial conditions out of the experimental design, and, (2) to analyze simulated data to assess the parameters’ value of the simulator used to carry on simulations. To this end, we apply Random Forest, a powerful Machine Learning (ML) algorithm, with finest fitting performances when dealing with complex datagenerating processes.
ML is becoming a popular and efficient approach to evaluate multidimensional longitudinal health data in different fields of medical research. Examples of this kind of studies include the diagnosis of asymptomatic liver disease [16], the prediction of opioid dependence [17], the evaluation of sociodemographic determinants of health status in aging [18], the prediction of the mobility of medical rescuevehicles [19], forecasting adverse perioperative outcomes [20], the measure of caloric intake at the population level [21], the personalisation of oncological treatment in radiogenomics [22], the determination of features of systolic blood pressure variability [23], the identification of clinical variables in bipolar disorder [24] and, interestingly, a specific interest in uncovering potential predictors of diabetes (type 1 and 2) using large set of data [25,26,27,28,29,30,31,32]. ML can also support global efforts in various fields of epidemic outbreaks of infectious diseases, developing uptodate text and datamining techniques to assist COVID19related research, especially by developing drugs faster (screening and detecting antibody virus interactions and detect viral antigens), understanding viruses better, mapping where viruses come from, and hopefully predicting the next pandemic [33, 34]. ML may offer accurate results with fewer requirements if compared with traditional mathematical modeling and it is often used to extract hardertodetect knowledge from unstructured data. ML models are particularly useful in settings where the input is represented by the enormous amount of diagnostic data whereas the output consists in predictive therapeutic options. At variance with the classical application of ML methods, in the present work which deals with the prediction of the risk of T2D, we use Random Forest to “approximate” MT2D. For this purpose, the training set consisted of a large number of virtual (i.e. simulated) subjects experiencing different lifestyle conditions. The MLderived surrogate model can recapitulate the simulated outcomes, thus computing the risk index with a significantly smaller computational effort, therefore allowing, as anticipated, to be computed in realtime on mobile devices.
Advances in wearable devices, computational power, and safe communications are permitting the evolution of precision medicine that could facilitate the development of personalized treatment of diabetes risk of each patient on an individual basis [35]. The accomplishments presented here are thus better valued looking at the great development of selfmonitoring systems nowadays embedded in portable communication devices which open up to the application of predictive tools in health care [35]. Such predictive tools integrated with wearable devices, could feed their modelpredictive alarm set and control systems with monitored signal data to adapt to the in vivo changes of the metabolic state of the user. The computational cheapness of the surrogate model proposed would then allow using data coming from wearable devices, as soon as they are measured [36], providing, therefore, a realtime calculation of health indicators, whose evaluation would otherwise be unfeasible.
Methods
In this section we first describe the computational model MT2D and then we detail the experimental design used to generate the data. Such description is necessary to understand the data analysis that is carried on in the next section.
The computational model
The wholebody multiscale computational model for fuel homeostasis MT2D [37] describes the metabolic, hormonal and inflammatory changes due to exercise sessions and food ingestion [38]. It consists of the combination of many ordinary differential equations and an agentbased model unified into a multiscale simulation tool.
The metabolic physiologybased submodel of MT2D consists of an extended formulation of [39] to describe fuel homeostasis in response to a session of physical exercise. It incorporates the hormonal model inspired by [40] in which both glucagon and insulin are represented and glucose regulation is achieved by altering the balance between the two. Concerning the original model in [39] and with the aim of achieving greater generalization and usercustomization, MT2D provides an enhanced description of the physical exercise similar to that in [41] and [42]. In particular, we used a “relative” (rather than fixed) exercise intensity as well as the estimation of functional capacity in relation to age, sex, anthropometric characteristics, and current fitness status [37]. Moreover, MT2D includes oxygen consumption and the dynamics of epinephrine as directly dependent on the relative exercise intensity to modulate hormones and metabolites responses to different exercise modalities (e.g. cycling, walking, running, stepping). For what concerns the description of the physiological changes due to food ingestion, stomach emptying, and absorption of macronutrients monomers in the gut [38] we follow the work in [43] and [44]. The description of the dynamics of alanine and triglycerides from proteins and fats ingestion, respectively, needed the settings of proper parameters, since the model in [43] is limited to the description of glucose dynamics. Insulin resistance or insulindeficient states leads to a reduced response of tissues, such as the skeletal muscle, liver, and adipose tissue, to insulin, therefore MT2D also implements the effects of insulin resistance on the glucose uptake by peripheral organs [45]. Besides that, in modeling fasting plasma glucose concentration we took into consideration factors depending on dietary habits, physical activity, and inflammation. These factors contribute in different ways to increase or diminish the blood sugar level. The glycemia (i.e. presence of glucose in the blood) rises due to unhealthy eating habits, leading to inflammation. Also, it decreases if the patient does physical exercises.
All together, MT2D includes several models: (1) a model of energy balance and weight gain/loss is added in [45], based on the equations provided by [46] and [47]; (2) the emergence of the inflammation is described as the result of adipose mass increase which, in turn, is a direct consequence of a prolonged excess of highcalorie intake [48]; (3) to better describe lifestyle, we include a previously published model of physical exercise [37]; (4) to counteract the inflammatory scenario, the presence of antiinflammatory mechanisms promoted during exercise by skeletal muscle has been considered, based on a previous published study [49]; (5) finally, to describe the inflammatory status of the subject, MT2D merges the metabolic model with a generalpurpose simulator of the immune system [50], a modeling framework used to study different human pathologies [51,52,53], specific aspects of the immune response [54, 55] and also aspects of nonhuman immunity [56].
The generation of synthetic data
Simulated trajectories of the dynamic metabolic model MT2D starting from different initial conditions (i.e. anthropometric features, physical activity patterns and dietary habits) corresponding to different virtual subjects have been generated by varying the parameters in Table 1. The total number m = 46,170 is thus the product of the following terms (\(\cdot \) indicates the cardinality of the set):
Low/medium/high quantities of carbohydrates, proteins, and fats are computed taking into account the balance of calories between the meal and the total daily energy expenditure (TDEE) [45]. In details, TDEE is the result of the sum of Resting Energy Expenditure (REE), Activity Energy Expenditure (AEE) [57] and Thermic Effect of Food (TEF) [58]. We implemented the equations by Mifflin and coworkers in [46] to estimate the REE considering weight, height, age, and sex. We determine the AEE based on the intensity, duration, volume of oxygen consumed, and the number of sessions of the exercise as in [45]. Finally, the TEF is the amount of energy expenditure that occurs after eating, due to the cost of digesting and processing food and represents about 10% of the calories due to meal ingestion [58]. The resulting TDEE represents the number of calories that have to be ingested to have a balance among energy intake and expenditure. In our calculation, these calories are somehow arbitrarily yet realistically split between breakfast (25%TDEE), lunch (45%TDEE) and dinner (30%TDEE). Furthermore, for each meal, we divided the caloric content of the meal in calories from carbohydrates, proteins, and fats equal to 50%, 20%, and 30%, respectively. Finally, to convert calories to grams we used the Atwater general factor system [59]. These “standards” or average values of grams for carbohydrates, proteins, and fats are used as reference values (median or ’med’ value). Simple multiplications to the constants 0.8 and 1.5 are used to fix ’low’ and ’high’ quantities of the food intake description given above. The complete patient specification of the initial condition of the simulation is thus given as a string vector. For instance, the initial condition specified by the string female 28 obese tall 2 60/40 low/high/low corresponds to a 28 years old female subject, tall and obese, who exercises twice a week (sixty minutes each time with an intensity of 40%\(\hbox {VO}_{2\max}\)) and who follows a diet consisting in a low amount of carbohydrates and fats but rich in proteins. So in general we indicate the vector corresponding to the initial condition as follows:
Simulations’ outputs were analyzed based on the following variables which are deemed the most significant to calculate the risk of developing T2D: Glucose BaseLine (GBL, namely the fasting glucose concentration), Body Mass Index (BMI), and Tumor Necrosis Factor\(\alpha \) (TNF) as measured in the adipose tissue compartment. The execution of MT2D starting from the initial condition \({\varvec{x}}\) generated a complete trajectory of these variables with a time resolution of ten seconds. However, since we are interested in analysing the condition of the virtual subject only at the end of a specified period of 6 months, these measures are taken after 6 months of routinely and uninterrupted physical activity and diet patterns as specified (among the other things) in \({\varvec{x}}\). Formally,
where t is 6 months. The set \(\{ ( {\varvec{x}}^{(k)}, {\varvec{y}}^{(k)} ) : k=1,\dots ,m \}\) is used as a training set for the development of a statistical model able to recapitulate, given \({\varvec{x}}\), the dynamics of the computational model and to predict the risk of developing T2D over a time horizon of 6 months. In other words, our goal was to find a statistical/ML model (which should not be confused with the computational model MT2D) able to predict the dependent variables, namely \({\varvec{y}}\) given a set of regressors/predictors \({\varvec{x}}\), that is, the initial conditions defining the virtual subject and her/his lifestyle. The new statistical model is, therefore, a surrogate model of MT2D whose role is only to forecast the T2D risk after 6 months for given initial conditions that were not considered for the construction of the synthetic data. The main reason for finding such ML model is that the complexity of MT2D requires a significant computational effort to run, whereas a statistical model, once trained, provides a real time solution of computing \({\varvec{y}}^{(i)}\) given \({\varvec{x}}^{(i)}\) allowing a fast generalisation to cases other than those in the training set \( \left\{ {\varvec{x}}^{(k)}, {\varvec{y}}^{(k)} \right\} _{k=1,\dots ,m}\).
Results
In order to be a viable solution to the given time and computational restriction, the model should have the following characteristics: good fitting performance in predicting the expected behaviour at time t of the output variables given the input variables at time \(t_{0}<t\), where \(tt_{0}=6\) months; usability of the results in analysing the impact of each regressor on the output; computational inexpensiveness in order to be implemented on wearable devices. To this end, we adopted a data driven approach over the simulated patterns, in particular, using the notation introduced in the previous section, the ML model has been constructed and validated by using the initial conditions \({\varvec{x}}\) of the regressors as input variables and the dependent variables \({\varvec{y}}\) as output variables.
In this section, we first carry on a preliminary analysis to understand the quantitative characteristics of the data and the need to choose Random Forest as ML algorithm among many others.
Preliminary analysis
Figure 1 shows the correlations among variables. In particular, the dots in the boxes represent the sample Pearson Correlation Coefficients \(\rho _{ij}\) between \(x_{i}\) and \(x_{j}\), namely,
where \(s_{i},s_{j}\) are the standard deviations and \(\mu _{i}, \mu _{j}\) are the mean of variables \(x_{i}\) and \(x_{j}\) respectively. Their significance is indicated by both the size of the dot (larger means higher significance) and the color (the actual value).
In Fig. 2 we report the scatter plots of the regressors and the dependent variables together with the fit (in orange; note: a poor fit, that is, a lack of dependence between the two variables appears as a horizontal or vertical orange line).
Figures 1 and 2 allow identifying critical key features of the dataset. We noticed that there are nonlinear dependencies between the output variables and the regressors (e.g. scatter plot of \(BMI_{0}=W_{0}/H^2\) in the third row). This observation was expected and, given the high level of complexity of the process generating the data, it suggests that a nonlinear ML model should be considered rather than a linear one. Moreover, the variables related to the diet (\(C_{\mathrm{ME}}, P_{\mathrm{ME}}, F_{\mathrm{ME}}\), cfr. correlation plot in the middleboxes next to diagonal) do appear strongly correlated. However, these correlations are “spurious” because the corresponding variables depend linearly on another variable indicating the amount of calorie intake (already discussed in the previous section and [45]). Lastly, the correlation plot shows that the output variables BMI, GBL and TNF are correlated, see for instance the dot in position \(TNFBMI\).
All of above observations strongly suggest that a multivariate model is the appropriate choice in the attempt to construct a ML model recapitulating the data. Specifically, we are looking for a statistical model defined as
where \({\varvec{x}}\) and \({\varvec{y}}\) are the vectors of regressors and dependent variables respectively, \({\mathcal {N}}_{3}\) is a Gaussian in \({\mathbb {R}}^3\) with zero mean and covariance matrix \(\varvec{\Sigma }\), and \(\psi \left( \cdot \right) \) is a function to be determined. We tested several statistical models and compared their forecasting performance. We started from the simplest, namely, the linear regression model. Even though preliminary results already prove its unfit, it is interesting to quantify the error made by the linear model. Successively, we tested a few nonlinear models, specifically, polynomial regression models of orders 2, 3, and 4. Finally, we tested the random forest algorithm [60]. To investigate the performances of each of the above models, we divided the dataset into a train set consisting of 2m/3 data points used to estimate the parameters of the models and the remaining m/3 data points in the test set used to assess the predictive performance of the model.
Results are shown in Fig. 3. Each row, corresponding to one of the models considered shows the outofsample (i.e. computed on the test set) scatter plot of the true versus the predicted values. The linear regression model obtained by defining \(\psi \) in Eq. (4) as a linear combination of the regressors, was not able to describe the behavior of none of the dependent variables. Indeed, all scatter plots in Fig. 3a are far from the \(y=x\) line. Interestingly, the scatter plots of BMI (leftmost panel) and of the TNF (rightmost panel) suggests that the linear model does partially capture these variables’ dynamics, indeed despite an unwanted very large variability in the predicted value, a positive correlation between predicted and true values is observed. Conversely, results shown in the middle panel in Fig. 3a pertaining the GBL suggest that the linear regression model fails in this case because there is no evident correlation between the true and the fitted values. The result confirms that there is a nonlinear structure among \({\varvec{x}}\) and \({\varvec{y}}\) and hints to the use of nonlinear models. Figure 3b–d are related to the polynomial regression models of degree \(d=2\), 3 and 4 respectively, obtained by defining \(\psi \) in Eq. (4) as a polynomial of order d. From the plots, it is clear that BMI (leftmost panel) and TNF (rightmost panel) are only partially described by these models because the scatter plots show large variation in the predicted value hence the use of polynomial models does not improve significantly when increasing d. Likewise the linear model, the middle panels in Fig. 3b–d of true versus predicted GBL fails to show a clear correlation hence leading to the conclusion that the polynomial structure is also not appropriate.
We then decided to assess other ML approaches, namely decision trees and random forest, based on a tradeoff among forecasting performance, the usability of the results and computational effort required.
Decision trees and random forest
In statistics/ML, decision trees are powerful tools when dealing with data coming from a complex process with a large number of degrees of freedom, both for regression and classification purposes. The main idea of such tools is to find binary splits, of the form \(X_{i}\le c\) and \(X_{i}>c\), called splitting rules, to divide the dataset into hyperrectangles being as homogeneous as possible in terms of dependent variables. Homogeneity is measured as the mean square error in the case of regression trees or as Gini index in the case of classification trees. The first node of a decision tree is called root, the internal nodes generated by splits are simply called nodes and the terminal unsplit are called leaves. Each node k including the root, is associated to the splitting rule parameters \(\theta _{k}=\left( X_{i}, c\right) \) where \(X_{i}\) is the splitting predictor and c is the splitting value; each leaf l is instead associated to dependent variables’ data points \(\mu _{l}\) as their mean values in case of regression trees and the most observed value in the case of classification trees [61]. The structure of the tree \({\mathcal {T}}\) is intended to be the whole set of parameters \(\theta _{k}\) and \(\mu _{l}\), for \(k=1,\dots ,K\) and \(l=1,\dots ,L\) where K and L are the total numbers of nodes and leaves respectively.
The major drawback of this regression/classification tool is the high variability characterizing the output, meaning that several trees constructed over the same dataset could produce significantly different outputs. Research has addressed this issue by considering ensemble methods. These are methods that generate multiple outputs using the same algorithm but starting from different random initializations.
Random forest introduced in [60] is one of the most wellknown and powerful regression/classification ensemble method. The general idea of this algorithm is to construct a forest of decision trees and to define the output to be either the mean of all the outputs in the case of regression trees or the result of a majority rule on the output in the case of classification trees.
In detail, the application of the random forest algorithm to predict \({\varvec{y}}\) from \({\varvec{x}}\) with respect to Eq. (4), provides the following formula
where N is the number of decision trees that have been build up and \({\mathcal {T}}_{i}\) is the structure of the ith tree that is the whole set of parameters \(\theta _{k,i}\) for \(k=1,\dots , K\) and \(\mu _{l,i}\) for \(l=1,\dots ,L\) as detailed above.
Learning the parameters of the random forest from synthetic data to predict the risk of T2D
The random forest algorithm has been trained and tested using the scheme adopted for the previous models; the obtained results are shown in Fig. 3e. As clearly shown by the three panels, the multivariate random forest outperforms the previous ones in predicting \({\varvec{y}}\). Indeed the scatter plots of all three variables are aligned on the \(y=x\) line indicating a fairly good correlation. Just, a bit of variability is still observed for small values of GBL and TNF. Looking more in detail, the virtual individuals showing unfit for small values of GBL are those having extreme features. An example is given by the virtual individual defined by the following initial conditions \({\varvec{x}}=\left[ \texttt {male, 28, tall, underweight, 1, 60/60, low/low/low} \right] \) that corresponds to a 28 years old male subject, tall (1.91 m) and underweight (65.66 kg), who exercises once a week (sixty minutes each time with an intensity of 60%\(\hbox {VO}_{\mathrm{2max}}\)) and who follows a diet consisting in a low amount of carbohydrates, fats and proteins, this subject is bordering on anorexia. The lack of knowledge regarding metabolic processes in case of anorexia generates higher variability in simulation’s output that is reflected into a higher unfit of the machine learning algorithm. We focused on the core distribution of the simulation output that is clearly caught up by the random forest algorithm, however, if the interest is toward extreme events quantile regression forest could be a valuable algorithm to analyse the tails of the distribution.
In Table 2 we quantify the error produced by each model as the mean square error (MSE) in measuring the goodness of the fit, in particular, \(\mathrm{MSE}_{\mathrm{In}}\) has been computed over the train set (i.e. in sample) while the \(\mathrm{MSE}_{\mathrm{Out}}\) has been computed over the test set (i.e. out of sample). As expected, the highest error corresponds to the linear model and it slightly decreases in polynomial regression models when using a higher degree polynomials. Finally, the multivariate random forest regression shows to outperform all other regression methods bringing down the MSE to more than one order of magnitude compared to the polynomial regression. Also to note that the small increase of the \(\mathrm{MSE}_{\mathrm{Out}}\) compared to the \(\mathrm{MSE}_{\mathrm{In}}\) denotes the absence of overfitting of data.
Discussion
Random forest algorithm showed good fitting performance and it provided a relatively easy interpretation of the data analysis’ results allowing for interesting clinical hints. As first results, we looked at the variables’ importance using a method already described in [62]. In a few words, we measured the impact of each variable on the predictive power of the model, as the difference between the prediction error computed when some noise is added to the predictors and the prediction error computed on the original predictors. Such impact is shown in Fig. 4, where the variables’ importance for each of the elements of \({\varvec{y}}\) are plotted. The impact of some variables appears to be the same for the three variables BMI, GBL, and TNF. Indeed, we observed that the variables related to the physical activity (i.e. \(N_{\mathrm{PA}}, D_{\mathrm{PA}}\), and \(I_{\mathrm{PA}}\)) appear as the less important. This fact points out that better accounting for the physical activity on antiinflammatory factors as well as on the reduction of glucose baseline already on time horizons smaller than 6 months is required in MT2D. This is a task that is already ongoing and will be reported in due time [45].
The most important variable for both the BMI (grey bar in Fig. 4) and the GBL (black bar) is the initial value of the body mass index (\(BMI_{0}\)). This means that weight plays an important role in determining the glucose baseline thus in the determination of the risk of T2D. As for the remaining variables, we observed that they have a comparable impact on the BMI. This is not the same for the glucose baseline or GBL index, for which the second most important dependence is with the number of carbohydrates in the diet (\(C_{\mathrm{ME}}\)). For what concerns the inflammation represented by the level of TNF\(\alpha \) (i.e. TNF index, white bar in Fig. 4) the most important dependence is, as expected, the age (A) followed in order of importance by \(C_{\mathrm{ME}}\) and \(BMI_{0}\). This is interesting as it goes along the recently defined concept of Inflammaging [63] which joins immunemetabolic processes with agerelated diseases in a single, integrated, clinical framework.
To carry on with the analysis of the relative importance of each input variable, we calculated their influence when taken in pairs. Again, we measure the impact of the couple \(\left( x_{i}, x_{j}\right) \) as the difference between prediction error when to \(\left( x_{i}, x_{j}\right) \) some noise is added versus the prediction error calculated in the unmodified case [62]. Looking at the pairwise coinfluence on \({\varvec{y}}\) in Fig. 5, we noted that the most common of them involve \(BMI_0\). This is somehow expected since \(BMI_0\) is the most or among the most important variables for all the output variables and both importance analyses are computed using the same methodology.
To overcome any bias coming from this procedure, we considered another method to investigate the variables’ coinfluence on \({\varvec{y}}\), namely the maximal subtree method [62, 64, 65]. This method is based on the idea that variables that split close to the root play an important role in prediction error, while variables that split next to the leaves do not influence that much the prediction error. To have quantitative method for the idea just explained we need to introduce two concepts [62]: the maximal \(\nu \)subtrees and the minimal depth. Given a tree \({\mathcal {T}}\), a \(\nu \)subtrees \({\mathcal {T}}_{\nu }\) is a subtree of \({\mathcal {T}}\) whose root is split using \(\nu \); a maximal \(\nu \)subtrees is a \(\nu \)subtrees that is not a subtree of a larger \(\nu \)subtrees. The minimal depth \(D_{\nu }\) is the distance from the root of \({\mathcal {T}}\) to the root of the closest maximal \(\nu \)subtrees, that is \(D_{\nu }\) measures the distance from the root of the first split on \(\nu \). The idea explained above can be expressed in terms of the minimal depth as follows: the smaller \(D_{\nu }\) the higher the impact of \(\nu \) on the prediction error.
We apply this method whose result is shown in Table 3, which reports a matrix where the diagonal element \(\left( i,i\right) \) represents the normalised (to have a number in the interval \(\left( 0,1\right) \)) minimal depth \(D_{i}\) of variable i, and the offdiagonal element \(\left( i,j\right) \) indicates the normalised minimal depth \(D_{j}^{i}\) of variable j with respect to the maximal i–subtree \({\mathcal {T}}_{i}\), that is \(D_{j}^{i}\) measures the distance from the root of \({\mathcal {T}}_{i }\) of the first split on j. Variables having smaller values on the diagonal are more predictive. Small value on the diagonal element \(\left( i,i\right) \) together with small value on the off diagonal element \(\left( i,j\right) \) is a sign of significant coinfluence on \({\varvec{y}}\) between variables \(x_{i}\) and \(x_{j}\). This method provides similar results to the one based on the pairwise importance, indeed smaller values of both diagonal and offdiagonal elements correspond to initial \(BMI_0\), \(C_{\mathrm{ME}}\), A, \(P_{\mathrm{ME}}\), \(F_{\mathrm{ME}}\) and S while the variables related to the physical activity shows higher values.
As shown in Table 3, we noted that age and diet, taken together, play a significant influence on the outcome \({\varvec{y}}\), that is, on the overall risk of progressing to T2D. The same can be said for gender and diet. Conclusions on the effect of physical activity can not be appreciated at least on a time horizon of 6 months, as we already pointed out when discussing the variables’ importance, while the coinfluence of either gender or age with the number of physical activities performed per week has an impact on the risk of T2D larger than the impact of the duration and the intensities of the bouts of exercises.
Conclusions
Effective prevention of type 2 diabetes onset in the population can be helped by close and regular checks for early detection of signs of progression into the disease. A tool which allows selfassessment based on lifestyle parameters, however approximate, remains a very valuable and beneficial means to increase awareness of the risk of T2D. Nowadays, tools of this kind are within technological reach thanks to the widespread use of monitoring devices able to keep track of exercise and dietary patterns and, at the same time, the coming into view of computational methods which estimate the risk of progressing from the healthy (i.e. prediabetic) to the disease condition.
The present study shows that it is possible to positively exploit these technologies. Smartphones, tablets, wristwatches and wearable devices are, and increasingly will be, used in everyday life as tools with the potential to foster a proper and healthy life, creating a positive impact on users with an improved effect on the quality of life. Today, the ability to estimate an individual patient’s trajectory risk in realtime remains poor. Knowledge of a patient’s dynamic risk profile may allow physicians to modify targeted and step by step changes in the T2D care plan that will alter the patient’s outcome trajectory [20]. At present, computational tools which exploit the availability of massive data collected by personal assistant devices employing ML techniques are the focus of a great deal of research efforts. Considering recent improvements in healthcare delivery technologies like smartphone applications, device connectivity, artificial intelligence and machinelearning technology there is strong opportunity to reach better efficiency in prediabetes and diabetes care, and ameliorate patient involvement in diabetes selfmanagement, which can decrease the surge of diabetesrelated healthcare expenditures, paving the way to the future scenario of patientdriven diabetes care in the technology era [66]. Also, this new approach has great potential as a lowcost monitoring tool for nutritional habits and physical activity of different segments of the population, permitting their users to achieve knowledge hardly comprehensible by even the best expert.
In this work, we have shown how a computational model running very complex simulations of realistic multivariate scenarios can be used to feed a machine learning method which demonstrated to perform satisfactorily to predict the risk of T2D using notably less time and computational resources, making it compliant for mobile devices use and for customized and immediate responses to the users. Here we focused on the prediction of the final state of the simulator has given some initial conditions, therefore in the current implementation the ML model provides a 6 months ahead risk of T2D to the users; this time horizon will be extended to predict the whole dynamic of the simulator. This extension, that will be presented in due time [45], will provide the complete dynamic of the variables related to T2D risk, thus becoming a powerful instrument for users as a short and midterm assessment tool. In perspective, the ability to link the subject’s parameters with measuring devices such as those in portable communication systems (smartphones and wristwatches) enables the development of health care systems linked in realtime to issue alerts, warnings or simple recommendations to the patient [35]. In the near future, the “realtime” execution of the model, with completely customizable input parameters can be envisaged as a dedicated bioinformatics service, able to provide increasingly personalized healthcare and facilitating selfmonitoring.
We conclude by looking at the near future, where we envision at least two avenues of research. A new era of medicine is opening up by combining traditional data from randomised clinical trials with new realworld data, collected from registries, electronic health records, social media, and wearable devices which produce realworld evidence, which can both uncover potential predictors of diabetes or challenge several RCTs data so far collected [32].
A final word should be spent to mention how needed is to open a bioethical debate (beyond, and in respect to, the EU General Data Protection Regulation or any other national regulations) on how to use and secure sensitive health data obtained by wearable devices, at stake, there are ethical questions about practices aimed at monetizing the patients’ data rather than therapeutic quality improvement [67].
Availability of data and materials
The datasets generated and/or analysed during the current study are available as a webservice at the following address: http://kraken.iac.rm.cnr.it/T2DM.
Abbreviations
 T2D:

Type 2 diabetes
 TNFα :

Tumor Necrosis Factorα
 IL6:

Interleukin6
 IRS1:

Insulin Receptor Substrate1
 MT2D:

Multiscale Immune System Simulator for the Onset of Type 2 Diabetes
 ML:

Machine Learning
 TDEE:

Total daily energy expenditure
 REE:

Resting energy expenditure
 TEF:

Thermic effect of food
 AEE:

Activity energy expenditure
 S:

Sex
 A:

Age
 W:

Weight
 H:

Height
 N_{PA} :

Number of physical activity sessions per week
 D_{PA} :

Duration of physical activity sessions in minutes
 I_{PA} :

Intensity of physical activity sessions
 C_{ME} :

Mean carbohydrates intake per day
 F_{ME} :

Mean fat intake per day
 P_{ME} :

Mean protein intake per day
 GBL:

Glucose Base Line
 BMI:

Body Mass Index
 BMI_{0} :

Body Mass Index at time 0
 MSE:

Mean square error
 MSE_{In} :

Insample mean square error
 MSE_{Out} :

Outofsample mean square error
References
Organization, W.H. Media Centre. http://who.int/mediacentre/factsheets/fs312/en/. Accessed 27 Sept 2016
Donath MY, Schumann DM, Faulenbach M, Ellingsgaard H, Perren A, Ehses JA. Islet inflammation in type 2 diabetes. Diabetes Care. 2008;31(Supplement 2):161–4.
Donath MY, Shoelson SE. Type 2 diabetes as an inflammatory disease. Nat Rev Immunol. 2011;11(2):98–107.
Gregor MF, Hotamisligil GS. Inflammatory mechanisms in obesity. Annu Rev Immunol. 2011;29(1):415–45.
Akash MSH, Rehman K, Chen S. Role of inflammatory mechanisms in pathogenesis of type 2 diabetes mellitus. J Cell Biochem. 2013;114(3):525–31.
Hotamisligil GS. Inflammation and metabolic disorders. Nature. 2006;444(7121):860–7.
Hotamisligil GS, Erbay E. Nutrient sensing and inflammation in metabolic diseases. Nat Rev Immunol. 2008;8(12):923.
Donath MY, Dalmas É, Sauter NS, BÉniSchnetzler M. Inflammation in obesity and diabetes: islet dysfunction and therapeutic opportunity. Cell Metab. 2013;17(6):860–72.
Castiglione F, Tieri P, De Graaf A, Franceschi C, Liò P, Van Ommen B, Mazzà C, Tuchel A, Bernaschi M, Samson C, Colombo T, Castellani GC, Capri M, Garagnani P, Salvioli S, Nguyen VA, BobeldijkPastorova I, Krishnan S, Cappozzo A, Sacchetti M, Morettini M, Ernst M. The onset of type 2 diabetes: proposal for a multiscale model. JMIR Res Protoc. 2013;2(2):44.
Sacks J, Welch WJ, Mitchell TJ, Wynn HP. Design and analysis of computer experiments. Stat Sci. 1989;4(4):409–23.
Currin C, Mitchell T, Morris M, Ylvisaker D. Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. J Am Stat Assoc. 1991;86(416):953–63.
Meert K, Rijckaert M. Intelligent modelling in the chemical process industry with neural networks: a case study. Comput Chem Eng. 1998;22:587–93.
Banerjee S, Gelfand AE, Finley AO, Sang H. Gaussian predictive process models for large spatial data sets. J R Stat Soc Ser B (Stat Methodol). 2008;70(4):825–48.
Reichert P, White G, Bayarri MJ, Pitman EB. Mechanismbased emulation of dynamic simulation models: concept and application in hydrology. Comput Stat Data Anal. 2011;55(4):1638–55.
Bhosekar A, Ierapetritou M. Advances in surrogate based modeling, feasibility analysis, and optimization: a review. Comput Chem Eng. 2018;108:250–67.
Babic A, Bodemar G, Mathiesen U, Ahlfeldt H, Franzen L, Wigertz O. Machine learning to support diagnostics in the domain of asymptomatic liver disease. Medinfo. MEDINFO. 1995;8:809–13.
Ellis RJ, Wang Z, Genes N, Ma’ayan A. Predicting opioid dependence from electronic health records with machine learning. BioData Min. 2019;12(1):3.
Engchuan W, Dimopoulos AC, Tyrovolas S, Caballero FF, SanchezNiubo A, Arndt H, AyusoMateos JL, Haro JM, Chatterji S, Panagiotakos DB. Sociodemographic indicators of health status using a machine learning approach and data from the English longitudinal study of aging (ELSA). Med Sci Monit Int Med J Exp Clin Res. 2019;25:1994.
Fernandes R, GL RD. A new approach to predict user mobility using semantic analysis and machine learning. J Med Syst. 2017;41(12):188.
Fritz BA, Chen Y, MurrayTorres TM, Gregory S, Ben Abdallah A, Kronzer A, McKinnon SL, Budelier T, Helsten DL, Wildes TS, Sharma A, Avidan MS. Using machine learning techniques to develop forecasting algorithms for postoperative complications: protocol for a retrospective study. BMJ Open. 2018;8(4):e020124.
Fuscà E, Bolzon A, Buratin A, Ruffolo M, Berchialla P, Gregori D, Perissinotto E, Baldi I. Measuring caloric intake at the population level (notion): protocol for an experimental study. JMIR Res Protoc. 2019;8(3):12116.
Kang J, Rancati T, Lee S, Oh JH, Kerns SL, Scott JG, Schwartz R, Kim S, Rosenstein BS. Machine learning and radiogenomics: lessons learned and future directions. Front Oncol. 2018;8:228.
Lacson RC, Baker B, Suresh H, Andriole K, Szolovits P, Lacson J. Eduardo: use of machinelearning algorithms to determine features of systolic blood pressure variability that predict poor outcomes in hypertensive patients. Clin Kidney J. 2018;12(2):206–12.
Belizario GO, Junior RGB, Salvini R, Lafer B, da Silva Dias R. Predominant polarity classification and associated clinical variables in bipolar disorder: a machine learning approach. J Affect Disord. 2019;245:279–82.
Kurasawa H, Hayashi K, Fujino A, Takasugi K, Haga T, Waki K, Noguchi T, Ohe K. Machinelearningbased prediction of a missed scheduled clinical appointment by patients with diabetes. J Diabetes Sci Technol. 2016;10(3):730–6.
Casanova R, Saldana S, Simpson SL, Lacy ME, Subauste AR, Blackshear C, Wagenknecht L, Bertoni AG. Prediction of incident diabetes in the jackson heart study using highdimensional machine learning. PLoS ONE. 2016;11(10):e0163942.
Alghamdi M, AlMallah M, Keteyian S, Brawner C, Ehrman J, Sakr S. Predicting diabetes mellitus using smote and ensemble machine learning approach: the henry ford exercise testing (fit) project. PLoS ONE. 2017;12(7):e0179805.
Choi BG, Rha SW, Kim SW, Kang JH, Park JY, Noh YK. Machine learning for the prediction of newonset diabetes mellitus during 5year followup in nondiabetic patients with cardiovascular risks. Yonsei Med J. 2019;60(2):191–9.
Cinar A. Multivariable adaptive artificial pancreas system in type 1 diabetes. Curr Diabetes Rep. 2017;17(10):88.
Basu S, Raghavan S, Wexler DJ, Berkowitz SA. Characteristics associated with decreased or increased mortality risk from glycemic therapy among patients with type 2 diabetes and high cardiovascular risk: Machine learning analysis of the accord trial. Diabetes Care. 2018;41(3):604–12.
Farran B, AlWotayan R, Alkandari H, AlAbdulrazzaq D, Channanath A, Thanaraj TA. Use of noninvasive parameters and machinelearning algorithms for predicting future risk of type 2 diabetes: a retrospective cohort study of health data from kuwait. Front Endocrinol. 2019;10:624.
Klonoff DC, Gutierrez A, Fleming A, Kerr D. Realworld evidence should be used in regulatory decisions about new pharmaceutical and medical device products for diabetes. Los Angeles: SAGE Publications; 2019.
Alimadadi A, Aryal S, Manandhar I, Munroe PB, Joe B, Cheng X. Artificial intelligence and machine learning to fight covid19. Physiol Genom. 2020;52(4):200–2.
Tárnok A. Machine learning, covid19 (2019ncov), and multiomics. Cytometry Part A. 2020;97(3):215–6.
Castiglione F, Diaz V, Gaggioli A, Liò P, Mazzà C, Merelli E, Meskers CGM, Pappalardo F, von Ammon R. Physioenvironmental sensing and live modeling. Interact J Med Res. 2013;2(1):3.
Yoram V, Csete M, Bartels J, Chang S, An G. Translational systems biology of inflammation. PLoS Comput Biol. 2008;4(4):1–6.
Palumbo MC, Morettini M, Tieri P, Diele F, Sacchetti M, Castiglione F. Personalizing physical exercise in a computational model of fuel homeostasis. PLoS Comput Biol. 2018;14(4):e1006073.
Palumbo M, Morettini M, Tieri P, de Graaf A, Krishnan S, Castiglione F. Modeling meal consumption and physical exercise for fuel homeostasis (2020) (in preparation)
Kim J, Saidel GM, Cabrera ME. Multiscale computational model of fuel homeostasis during exercise: effect of hormonal control. Ann Biomed Eng. 2007;35(1):69–90.
Saunders PT, Koeslag JH, Wessels JA. Integral rein control in physiology. J Theore Biol. 1998;194(2):163–73.
Roy A, Parker RS. Dynamic modeling of exercise effects on plasma glucose and insulin levels. IFAC Proc Vol. 2006;39(2):509–14.
Kildegaard J, Christensen TF, Johansen MD, Randløv J, Hejlesen OK. Modeling the effect of blood glucose and physical exercise on plasma adrenaline in people with type 1 diabetes. Diabetes Technol Therapeut. 2007;9(6):501–8.
Dalla Man C, Camilleri M, Cobelli C. A system model of oral glucose absorption: validation on gold standard data. IEEE Trans Biomed Eng. 2006;53(12):2472–8.
Elashoff JD, Reedy TJ, Meyer JH. Analysis of gastric emptying data. Gastroenterology. 1982;83(6):1306–12.
Palumbo M, Morettini M, Tieri P, de Graaf A, Liò P, Diele F, Castiglione F. An integrated multiscale model for the simulation and prediction of metabolic and inflammatory processes in the onset and progress of type 2 diabetes (in preparation) (2020)
Mifflin MD, St Jeor ST, Hill LA, Scott BJ, Daugherty SA, Koh YO. A new predictive equation for resting energy expenditure in healthy individuals. Am J Clin Nutr. 1990;51(2):241–7.
Westerterp KR, Donkers JHHLM, Fredrix EWHM, Oekhoudt P. Energy intake, physical activity and body weight: a simulation model. Br J Nutr. 1995;73(3):337–47.
Prana V, Tieri P, Palumbo MC, Mancini E, Castiglione F. Modeling the effect of high calorie diet on the interplay between adipose tissue, inflammation, and diabetes. Comput Math Methods Med 2019;2019
Morettini M, Palumbo MC, Sacchetti M, Castiglione F, Mazza C. A system model of the effects of exercise on plasma interleukin6 dynamics in healthy individuals: role of skeletal muscle and adipose tissue. PLoS ONE. 2017;12(7):e0181224.
Bernaschi M, Castiglione F. Design and implementation of an immune system simulator. Comput Biol Med. 2001;31(5):303–31.
Castiglione F, Duca K, Jarrah A, Laubenbacher R, Hochberg D, ThorleyLawson D. Simulating EpsteinBarr virus infection with CImmSim. Bioinformatics. 2007;23(11):1371–7.
Pappalardo F, Lollini PL, Castiglione F, Motta S. Modeling and simulation of cancer immunoprevention vaccine. Bioinformatics. 2005;21(12):2891–7.
Mancini E, Quax R, De Luca A, Fidler S, Stohr W, Sloot PM. A study on the dynamics of temporary hiv treatment to assess the controversial outcomes of clinical trials: an insilico approach. PLoS ONE. 2018;13(7):e0200892.
Baldazzi V, Paci P, Bernaschi M, Castiglione F. Modeling lymphocyte homing and encounters in lymph nodes. BMC Bioinform. 2009;10(1):387.
Castiglione F, Tieri P, Palma A, Jarrah AS. Statistical ensemble of gene regulatory networks of macrophage differentiation. BMC Bioinform. 2016;17(19):506.
Madonia A, Melchiorri C, Bonamano S, Marcelli M, Bulfon C, Castiglione F, Galeotti M, Volpatti D, Mosca F, Tiscar PG, Romano N. Computational modeling of immune system of the fish for a more effective vaccination in aquaculture. Bioinformatics. 2017;33(19):3065–71.
Melanson EL, Keadle SK, Donnelly JE, Braun B, King NA. Resistance to exerciseinduced weight loss: compensatory behavioral adaptations. Med Sci Sports Exerc. 2013;45(8):1600.
Westerterp KR. Diet induced thermogenesis. Nutr Metab. 2004;1(1):5.
Atwater WO, Bryant AP. The chemical composition of American food materials, vol. 28. Washington: US Government Printing Office; 1906.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Friedman J, Hastie T, Tibshirani R. The elements of statistical learning, vol. 1. New York: Springer; 2001.
Ishwaran H. Variable importance in binary regression trees and forests. Electron J Stat. 2007;1:519–37.
Franceschi C, Garagnani P, Parini P, Giuliani C, Santoro A. Inflammaging: a new immunemetabolic viewpoint for agerelated diseases. Nat Rev Endocrinol. 2018;14(10):576–90.
Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS. Highdimensional variable selection for survival data. J Am Stat Assoc. 2010;105(489):205–17.
Ishwaran H, Kogalur UB, Chen X, Minn AJ. Random survival forests for highdimensional data. Stat Anal Data Min ASA Data Sci J. 2011;4(1):115–32.
Ashrafzadeh S, Hamdy O. Patientdriven diabetes care of the future in the technology era. Cell Metab. 2019;29(3):564–75.
Basch E, Schrag D. The evolving uses of “realworld” data. JAMA. 2019;321:1359–60.
Stolfi P, Valentini I, Palumbo MC, Tieri P, Grignolio A, Castiglione F. Potential predictors of type2 diabetes risk: machine learning, synthetic data and wearable health devices. In: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 2214–2221 (2019)
Acknowledgements
A preliminary version of this paper has been presented at the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) [68].
About this supplement
This article has been published as part of BMC Bioinformatics Volume 21 Supplement 17 2020: Selected papers from the 3rd International Workshop on Computational Methods for the Immune System Function (CMISF 2019). The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume21supplement17.
Funding
Publication costs are funded by the European Commission under the 7th Framework Programme (MISSIONT2D project, contract No.600803) and under the Horizon 2020 research and innovation programme (iPC project, Grant Agreement No.826121). No role/involvement of funding bodies in the study.
Author information
Affiliations
Contributions
P.S., designed the work, performed the data analysis and helped in the writing, I.V., designed the work and helped in the writing, MC.P., provided the data and helped in the writing, P.T., helped in the writing, A.G., helped in writing and F.C., supervised the project, provided the data, helped in data analysis and helped in writing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not Applicable.
Consent for publication
Not Applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Stolfi, P., Valentini, I., Palumbo, M.C. et al. Potential predictors of type2 diabetes risk: machine learning, synthetic data and wearable health devices. BMC Bioinformatics 21, 508 (2020). https://doi.org/10.1186/s12859020037634
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859020037634
Keywords
 Machine learning
 Random forest
 Emulator
 T2D
 Computational modeling
 Synthetic data