Dynamic model updating (DMU) approach for statistical learning model building with missing data

Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04138-z.

scenarios with large samples of complete data. Further, CCA provides biased estimates in cases when data is not missing completely at random (MCAR) [1].
Imputation approach can handle samples without complete data by replacing missing data with either single values (Single imputation) or multiple values (Multiple imputation). Single imputation technique adds single plausible value to each missing value and creates a single imputed dataset. Single imputation approach treats imputed values as an actual value rather than an estimate with standard error value during the downstream analysis, which creates a potential bias in the results [2]. Mean imputation is one of the most straightforward imputation techniques. It replaces the missing values of a predictor with the mean value of the observed data of the predictor. The main disadvantages are that it underestimates the variance of the predictor and ignores the relationship between the predictors [3,4]. Regression imputation is another technique in which the predictor with missing value is regressed on the predictors with non-missing values. Finally, the missing values of the predictor are estimated from the regression model [5]. However, the technique relies on the linear relationship [6], which may affect the model quality.
Multiple imputation approach provides more unbiased estimates as compared to single imputation approach as it considers the uncertainty in estimates. Multiple imputation approach assigns multiple plausible values to every missing value, which creates multiple imputed datasets. Each dataset undergoes analysis and results are pooled using Rubin's rules [7]. The MICE package in R is one of the popular packages for performing multiple imputations [8]. It provides many multiple imputation approaches like predictive mean matching, Bayesian regression, and linear regression. However, the multiple imputation approach still cannot provide unbiased estimates for all scenarios [9].
In non-linear relationships among the predictors, various machine learning-based techniques are used to perform imputation. K-Nearest Neighbors (K-NN) is one of the machine learning technique used for imputation. For any predictor with missing values, K-NN tries to identify the k nearest neighbors for each missing value using the predictors with non-missing data. The missing value is imputed using the values of the k nearest neighbors [10]. K-means clustering segregates the complete dataset (including missing values) into k clusters. Then, K-NN algorithm is applied in each cluster to impute the missing values in the cluster [11]. However, in many cases, K-NN and K-means based approaches could perform poorly as compared to other approaches [12,13]. MissForest technique uses random forest for imputing the missing data to overcome the limitations of regression-based imputation methods [14].
In many real-world scenarios, data collection is not simultaneous. Instead, it happens over time. CCA and imputation-based approaches have to wait for the completion of the data collection process. In scenarios of high throughput data, data storage can be an issue [15]. This paper successfully proposes an alternative, i.e., dynamic model updating (DMU) approach of analyzing the dataset with missing values. DMU analyses multiple smaller datasets obtained from the original dataset rather than the original dataset and allowing estimate updating with every single analysis. The paper is organized as follows. 'Methodology' section describes the DMU algorithm; the model performance is evaluated and demonstrated using simulation and real dataset studies in the "Simulation Studies" and "Real Data Study" sections, respectively. Finally, the 'Conclusion and Discussion' section concludes the paper and discusses the limitations of the study.

Results
The performance of DMU is evaluated and compared with CCA and PMM approach for both the simulated datasets and real data studies.

Simulation studies
We use simulation studies to evaluate model performance. In the simulation studies, data is generated from the following regression model: where ε ~ N(0, 0.25) is noise in the model and x 1 ,…,x p are predictors. The values for predictors x 1 and x 2 is drawn from beta (~ Beta (7,2)) and uniform (~ U(0, 2)) distribution respectively, while the values for predictors x 3 ,…,x p is drawn from normal distribution(~ N(0, 1)). Coefficient values for x 1 , x 2 and x 3 are 0.2, 0.3 and 0.4 respectively. The remaining predictors have zero coefficient values. The correlation matrix is designed to add multicollinearity to the model. The predictors {x 1 ,…,x 5 } are randomly assigned correlation values between [− 0.5, 0.5] with replacement and zero correlation value is assigned to all other cases as shown below.
A multivariate normal distribution generates data for 20, 25, 30 and 100 predictors (p) in the simulated dataset, D. Two scenarios are created to test the performance of different methods. In the first scenario, the training dataset contains some complete rows (SCR). Training data comprises of 3150 samples with each predictor having 80% of its values MCAR and 50 samples of complete data across all predictors. Test data consisted of 1000 samples of complete data across all predictors. In the second setting, the training dataset has no complete row (NCR). Training data comprised of 3150 samples with each predictor having 80% of its values MCAR. Test data consisted of 1000 samples of complete data across all predictors.
DMU method is used to build the model and its performance is estimated. Prior distributions are defined as follows: Markov Chain Monte Carlo (MCMC) is used to generate the posterior distribution of parameters in the model using MCMC package in R [16]. Total 6000 iterations are . .
(4) β ∼ N (0, 100)|βε β 1 , . . . , β p performed and the first 1000 are used as burn-in iterations. A constraint is used to segment Dataset D such that d i with sample size to predictor set ratio greater than or equal to two is used for model building. Because optimal k is not known, hence the genetic algorithm is used for selecting k. The performance of the DMU method is compared with simple linear regression (SLR), k-Nearest Neighbors based imputation (kNN), simple linear regression combined with imputation (SLRM) and random forest based imputation (RF) using simulated data. In the case of SLRM, the predictive mean matching (PMM) based imputation method provided by the MICE package of R [8] is used to impute missing data in the dataset. VIM package and missForest packages of R are used for kNN imputation and RF imputation, respectively [14,17]. The performance of different methods is evaluated using mean square error (MSE) between the estimated outcome and the actual outcome in the test data. The reported performance is normalized with mean imputation MSE performance. The GA package in R [18] is used for the genetic algorithm.
Simulation datasets (S = 30) are created with the abovementioned settings and the overall performance of each method is measured. Table 1 and Additional file 1: Table S1 shows the performance results of the five methods. The study shows that the DMU has a lower or comparable MSE as compared to the MSE of SLR and SLRM. In SCR settings, SLRM gave the worst performance, and kNN gave the best performance, while DMU performance is either similar or better than the SLR and RF method. In NCR settings, RF gave the best performance, but DMU performance is better or similar to SLRM. Overall, these results suggest that the DMU can develop better or at par models as compared to SLR and SLRM based models for datasets with values missing completely at random. The results are validated in higher feature space (p = 100), where DMU MSE for SCR is 0.16 (1.27 MSE(DMU)/MSE(mean imputation)) but SLRM MSE for SCR is 0.23. KNN and RF imputation performed better with MSE of 0.13. In case of NCR, DMU MSE performance of 0.13 is comparable with RF imputation MSE performance of 0.13 and better than SLRM MSE performance of 0.47. Table 2 provides the computation time for different methods on a system with processor Intel ® Core(TM) i7-8750H CPU@2.20 GHz with 16 GB RAM on a Windows 10 64-bit operating system. We use the SCR scenario and three different feature spaces (p), i.e., 20, 25 and 30, to estimate time. It is found that DMU with optimized hyperparameters and SLRM takes a similar computation time which is less than random forest and kNN based imputation but more than SLR.

Real data studies
Furthermore, the study compares the proposed regression method with SLR and SLRM using two real-world datasets. Dataset I is Community Health Status Indicators dataset which contains USA county-level data on various demographics and health parameters to help in making informed decisions in combating obesity, heart disease and cancer [19]. The dataset contains data on 578 features for 3141 US counties. Dataset II is Study of Women's Health Across the Nation, 2006-2008 dataset which contains multi-site data for middle-aged women in the USA on various physical, biological, psychological and social parameters [20]. The dataset contains data on 887 features for 2245 respondents. These datasets are processed and cleaned to remove textual or categorical variables. One of the shortlisted variables is used as the outcome variable and remaining variables are used as predictors. Different scenarios are created using these two datasets, as shown in Table 3. The maximum correlation allowed between the predictors in each scenario is ± 0.52. Different predictors have a different percentage of missing values; thus, the maximum percentage of missing values is defined for each scenario. For example, in Scenario 1, predictors up to 10% of missing values is selected for model building. It is possible to have datasets in real-world settings where no single row has data for all the predictors. Hence, the study tried to recreate the settings by testing the performance of the methods in two different settings. In the first setting, some complete rows (SCR) are added into training dataset. In the second setting, no complete row (NCR) is added in the training dataset. In both settings, the test dataset only comprised of compete rows. Since SLR could be performed only on rows with complete data, so this method is not applied to scenarios with no complete rows in the training dataset. All three methods are compared based on their prediction performance in the test datasets. Table 4 and Additional file 2: Table S2 provides the performance of different methods on two real datasets. The results are like those obtained in the simulated datasets. The proposed DMU approach provides better or at par MSE performance as compared to other methods. The performance is consistent across different proportions of missing data, but increased sample size in training data improves the performance of all the approaches. NCR seems to increase the MSE of the methods.

Real data studies: genomic data
The study also compares the proposed regression method with SLR, SLRM, kNN and RF using a real-world genomic dataset. A Genomics of Drug Sensitivity in Cancer (GDSC) dataset containing copy number variations (CNV) in 24,503 genes and inhibitory concentrations (IC50) of cancer drugs for 946 cell line samples is used [21]. We selected Devimistat (CPI-613) drug IC50 as the clinical outcome and CNV as input feature space. The drug is known to reduce the aggressiveness of pancreatic cancer by inhibiting the tricarboxylic acid cycle and Is currently in Phase III clinical trial [22]. The dataset is processed and cleaned to remove input features with duplicated values, high correlation and no missing value. The reduced dataset contains 42 input features with 911 samples. The dataset is randomly split into 80% training data and 20% test data. Around 30% of the input feature values from each input features is randomly removed from the training data. The performance of the different method is compared for both SCR and NCR scenarios over five trials. In the case of SCR, 50  (Table 5).

Discussion
Handling missing data during model building is a challenge that this study addresses using a new perspective. DMU allows building the model from samples with partial information rather than removing samples with partial information or imputing information. DMU performance is better than complete case analysis and predictive mean matching based imputation when applied in linear regression. The proposed method has certain limitations. First, the comprehensiveness of the DMU testing is limited. The model is not tested on different datasets like datasets containing categorical outcome, time to event outcome, categorical predictors. Similarly, it did not consider high correlation variables, interaction terms and different continuous distributions like exponential and logarithmic. Thus, our approach could be considered for datasets with continuous marginal features and outcome with low correlation among the features. Future studies have scope to determine the robustness of the DMU in different data settings.  Another limitation of the study is the computational intensiveness, especially in cases where the number of subgroups is not pre-defined. In such cases, computational resources need to be spent identifying the best value of k by creating multiple models. The study uses a genetic algorithm to address the problem. Various other optimization algorithms like swarm optimization and simulated annealing can be explored in addressing the problem.

Conclusion
An innovative approach is proposed for building statistical models with missing data. DMU approach divides the dataset with missing values into smaller subsets of complete data followed by preparing and updating the Bayesian model from each of the smaller subsets. The approach provides a different perspective of building models with missing data using available data as compared to the existing perspective in the literature of either removing missing data or imputing missing data. The approach is more flexible as compared to existing approaches as it can update the old models with new data without a need to retain the old data. Secondly, DMU does not depend on the association among the predictors for imputing data. Hence, MU can update the models even when the new dataset contains an incomplete list of predictors.

Methodology
In this section, first CCA and Predictive Mean Matching (PMM) based imputation are described, followed with the dynamic model updating (DMU) method.

Complete case analysis (CCA)
Complete Case Analysis is a common approach used in handling the missing data. This approach omits all the samples with missing data. CCA builds a statistical model from the remaining samples with complete data (or, complete cases). The approach performance is affected when many samples are omitted [23], or data is not missing completely at random [1].

Predictive mean matching (PMM) based imputation
Predictive Mean Matching is a common approach for imputing missing data in MCAR cases. It is a robust approach which assigns an observed value to the missing case. In this approach, the predictor with missing values (X miss ) is regressed on the predictor/s with complete values (X obs ): where β = β o , β 1 ,… are estimates of regression coefficients and used to get estimated values of X miss . Once the estimated values of X miss are obtained, these values are replaced with the closest observed value of X miss in the dataset. Multiple imputed datasets are created by randomly sampling one of the k closest value, instead of the closest value, for each of estimated value of X miss in the dataset. k is usually in the range of 1-10. This approach is implemented in MICE package in R, where the default value of k is 5 [8].
One of the limitations of this approach is that it always imputed data from the observed (5) X miss = β 0 + β 1 X obs + . . .
values. Thus, in cases where the missing values are in the tail of a distribution, PMM may have biased imputation [24].

Dynamic model updating (DMU) approach
PMM based imputation is a popular and robust approach for handling MCAR and MAR types of missing data, but it has certain limitations. The DMU approach (Algorithm 1) proposes a different perspective of handling the missing data. While any imputation approach focuses on replacing the missing value with a predicted value to complete the information, DMU approach focuses on building the model on incomplete information rather than on imputed information. The basic framework is to divide the dataset into smaller datasets containing a smaller number of predictors but complete information, and sequentially build the model for each dataset followed by updating the estimates of the predictors after each model, as shown in Fig. 1. It is explained in more details below. where k is the number of subsets in which dataset D is divided and d is the set containing k smaller datasets. The d set is created such that any of its two elements are mutually exclusive to each other. Any dataset, d l will have maximum p predictors and n samples. a rc is an element in the dataset d l .

Hierarchical clustering
Different approaches can segment dataset D into smaller datasets. Literature provides different clustering approaches which can be broadly classified into four types, namely centroid-based, density-based, distribution-based or model-based and connectivitybased [25]. Centroid-based clustering focuses on partitioning samples into clusters with the nearest mean or median [26]. They provide local optima rather than global optima [27]. K-mean clustering is an example of the centroid-based clustering [27]. Densitybased clustering focuses on partitioning the samples into clusters with a higher density than the remainder of the samples [28]. Hence many samples may not be assigned any cluster. DBSCAN is an example of the density-based clustering [28]. Distribution-based clustering focuses on partitioning the samples into clusters with similar statistical distribution [29]. They suffer from convergence to local optima and overfitting [30]. Gaussian mixture models is an example of distributionbased clustering [29]. Connectivity-based clustering or hierarchical clustering partitions the samples based on the distance of a sample with other samples. The similar samples have lower distance among them as compared to dissimilar samples. It does not provide a single set of clusters rather a hierarchy of clusters based on the threshold distance value used to partition the data [31]. It is a computationally intensive approach [32]. The current study uses hierarchical clustering to partition dataset D. Hierarchical clustering does not have the issue of local optimum, avoids rejection of sparse samples and does not require the prior knowledge of statistical distribution model for samples.

Subgroup construction
The predictor space of dataset D is split into k subgroups using hierarchical clustering technique (Fig. 2). The clustering technique needs to classify the samples in D based on the similarity (or, dissimilarity) in the missingness pattern. The dataset D contains mixtures of missing values and non-missing values. The magnitude of non-missing values can influence the clustering computation since hierarchical clustering techniques rely upon the distance between the samples. The magnitude effect of non-missing values is eliminated by transforming the predictor space of dataset D into binary data where a value zero is assigned to a missing value and one is assigned to non-missing values as shown below: where D ij represents the original dataset with n samples and p predictors, M represents the missing values and Bin.D ij represents the binary transformation of D ij matrix. Hierarchical clustering of Bin.D ij is performed. The n rows are used as samples which are to be clustered with p-dimensional data.
The number of clusters selected from the hierarchical clustering represents the total number of subgroups, k in which dataset D is divided. k is the hyperparameter which determines the number of subgroups used in model building and is user-defined.

Model building
The model building step relies upon the Bayesian paradigm and the assumption that predictors are independent of each other. The Bayesian paradigm focuses on finding the distribution of the parameter estimate of a predictor [33]. The Bayesian paradigm takes the prior belief about the distribution of parameter estimate. This prior belief is updated to give the posterior distribution of parameter estimate of a predictor with likelihood estimate using the data. The posterior distribution of the parameter estimate of a predictor from the previous model can be used as a prior belief for the next model. If the consecutive model contains the same predictor, the prior distribution will then be updated; else it will not. Bayesian regression is used to create a model for each of the dataset, d l .   Dynamically, the posterior probability of one model is used as the prior probability for the next model. Only, for the first model, the prior probability for each predictor need to be pre-specified.