A diabetes prediction model based on Boruta feature selection and ensemble learning

Zhou, Hongfang; Xin, Yinbo; Li, Suli

doi:10.1186/s12859-023-05300-5

Research
Open access
Published: 01 June 2023

A diabetes prediction model based on Boruta feature selection and ensemble learning

Hongfang Zhou^1,2,
Yinbo Xin¹ &
Suli Li¹

BMC Bioinformatics volume 24, Article number: 224 (2023) Cite this article

4553 Accesses
13 Citations
Metrics details

Abstract

Background and objective

As a common chronic disease, diabetes is called the “second killer” among modern diseases. Currently, there is no medical cure for diabetes. We can only rely on medication for auxiliary treatment. However, many diabetic patients still die each year. In addition, a considerable number of people do not pay attention to their physical health or opt out of treatment due to lack of money, which eventually leads to various complications. Therefore, diagnosing diabetes at an early stage and intervening early is necessary; thus, developing an early detection method for diabetes is essential.

Methods

In this study, a diabetes prediction model based on Boruta feature selection and ensemble learning is proposed. The model contains the use of Boruta feature selection, the extraction of salient features from datasets, the use of the K-Means++ algorithm for unsupervised clustering of data and stacking of an ensemble learning method for classification. It has been validated on a diabetes dataset.

Results

The experiments were performed on the PIMA Indian diabetes dataset. The model was evaluated by accuracy, precision and F1 index. The obtained results show that the accuracy rate of the model reaches 98% and achieves good results.

Conclusion

Compared with other diabetes prediction models, this model achieved better results, and the obtained results indicate that this model is superior to other models in diabetes prediction and has better performance.

Peer Review reports

Introduction

With the rapid development of the social economy, people’s quality of life has constantly improved, and the diet structure has also significantly changed. Therefore, a variety of chronic diseases arise, and diabetes is one of the most common. Insulin is a hormone that regulates blood glucose homeostasis. When the pancreas does not produce enough insulin or the body does not use the produced insulin effectively, blood sugar rises, leading to hyperglycemia, which can lead to diabetes. With time, this can cause serious damage to the human body and result in blindness, amputation, heart disease, stroke, and kidney failure. Diabetes incidence is second only to cancer, and it is known as the “second killer” among modern diseases [1]. In the 2019 Global Leading Cause of Death Survey, diabetes was included in the top 10 causes of death [2]. According to the International Diabetes Federation, 700 million adults in the world will have diabetes by 2045. The cost of healthcare for diabetes is significant, at approximately $760 billion annually. The global growth curve of the number of people with diabetes is shown in Fig. 1 below (Image source: Statistics from the International Diabetes Federation) [3].

With the expansion of artificial intelligence applications, especially in disease diagnosis and medical image processing, it has become possible to use machine learning techniques to extract valid information from medical data for predicting chronic diseases. If we can predict the diabetic population and nondiabetic population at an early stage, when doctors diagnose diabetes, they can tend to focus on people with a high probability of having diabetes, which greatly reduces the intervention of human factors and provides a general direction for doctors to diagnose and take timely measures related to prevention and interception. It will be of great benefit to reduce the incidence of diabetes, improve people’s quality of life and the healthy life expectancy of the population, and will also effectively reduce the burden of diabetes treatment. This is the most fundamental motivation for us to carry out this work.

Ensemble learning is mainly a combination of several single classifiers in different ways and is used to improve the accuracy and robustness of classification. There are three main types: bagging, boosting and stacking. The bagging method subsamples from the training set to form the required subtraining set for each base model and combines the results predicted by all base models to produce the final prediction results. The boosting method is trained in the order of the base models, and if the previous base model has incorrect classification results, then the next base model can be trained with a larger weight assigned to correct the classification results, and the results predicted by all base models are linearly combined to produce the final prediction results. The stacking method is mainly divided into a base model and a meta-model. By training the base model, the generated results are used as input to the meta-model, which is trained to produce the final classification results.

Research on combinatorial classifiers has been conducted in areas such as disease prediction and bioinformatics. Leyi Wei et al. [4] proposed a PIHS algorithm based on selective integration learning, which combines the prediction results of each basic model by voting and uses a partitioning strategy to achieve a high level of performance on several biological informatics problems, showing high efficiency and robustness. Cheng Chen et al. [5] proposed a prediction framework called StackPPI, using XGBboost to reduce feature noise. An ensemble classifier using a combination of random forest, random tree and logistic regression algorithms is used as a classifier, and the method mainly works on protein and drug design with good classification performance. Jasmina Nali´c et al. [6] proposed a hybrid data mining model based on a combination of multiple feature selection and ensemble learning classification algorithms, which used a soft voting approach to synthesize classifiers into eight different ensemble models. Finally, GLM + DT’s model had the best hybrid performance, which was later tested on biological datasets and outperformed other ensemble learning models and single classifiers. Rajesh Yakkundimath et al. [7] proposed a new classifier combination model for the classification of cervical cancer cells, which uses Artificial Neural Network(ANN), Random Forest (RF) and Support Vector Machine (SVM) as basic classifiers. It is more suitable for the classification of cervical cancer cells compared to the results achieved by a single basic classifier. Tien Thanh Nguyen et al. [8] proposed a combinatorial classifier based on a Bayesian inference framework, which estimates a multivariate Gaussian distribution for each class of data using a variational inference approach, which was tested on 18 datasets and a medical imaging database and compared with several well-known ensemble methods, resulting in a large advantage.

Combined classifiers perform better than single classifiers. When performing classification, we always want to find a classification model with high robustness, high accuracy and balanced time complexity and space complexity. However, this is a relatively ideal state. When we perform classification, the classifier will be more or less influenced by the dataset, such as extremes, outliers and other noisy data, which can affect the classification results. Once noisy data are present, the performance of the single classifier will be greatly degraded. In contrast, with a combination classifier, different weights will be assigned according to the votes, and the misclassified data can be reclassified, and the combination classifier also has good adaptability to noisy data; thus, a combinatorial classifier was adopted in this study. Because diabetes is a common chronic disease, the classification of diabetic patients differs from other datasets. Alternatively, the classification of medical datasets differs from the classification of other datasets because the diagnosis of a particular disease is a rigorous, long-term process that involves people economically, physically and psychologically, especially for chronic diseases. If a disease is misdiagnosed, it can be fatal for the patient. Therefore, for the diagnosis of diseases, the choice of classifier has higher requirements, as aspects such as accuracy are very important. For the research direction proposed in this paper, for the classification of the diabetic population, experiments on single and combined classifiers have shown that the combined classifier has better results.

In this paper, a diabetes prediction model based on Boruta feature selection and ensemble learning is proposed. The model uses the Boruta feature selection algorithm, K-Means++ unsupervised cluster learning algorithm and stacking ensemble learning method.

There are three main contributions in this paper.

(1)
Feature Selection. In this thesis, the main focus is on the prediction of diabetes. For the diabetes dataset, it is necessary to determine the attributes that best match the diagnosis of diabetes, and we consider the attributes selected by a comprehensive comparison with Boruta’s algorithm through the selection methods for different features, such as Pearson’s correlation coefficient and PCA, as the most appropriate.
(2)
A clustering algorithm was used on the data. To provide the correct number of clusters, we used the K-Means + + algorithm, which is an improved version of K-Means. The K-Means + + algorithm optimises some of the problems that exist in the K-Means algorithm [27].
(3)
The most suitable base classifier and meta-classifier were selected, and the ensemble learning stacking method was used and tested repeatedly to determine the most suitable parameter values for diabetes. Most of them use single models for classification, such as Support Vector Machine (SVM), Logistic Regression(LR) or Softmax, but single models are highly susceptible to noisy data if they are not sufficiently trained, resulting in poor prediction accuracy. Additionally, most of the research workers did not adjust the parameters of the model. Thus, are the parameters at this time in line with the optimal parameters for diabetes prediction? Based on this, we first selected the most suitable base classifier and original classifier for the stacking method of ensemble learning. Second, the values of the focal parameters of the metamodel and the base model were determined by repeated experiments.

The subsequent organization of this paper is as follows. An overview of the work conducted by other researchers in diabetes prediction is presented in Section II. The model proposed in this paper and the methods used in the model are described in Section III, including Boruta feature selection, K-Means++, and ensemble learning. The experimental procedure is described in detail in Section IV, including dataset description, steps, parameter settings, etc. Section V discusses the experimental results, including evaluation of the model, comparison with other models, and comparative experiments. Section VI summarizes the work and provides some suggestions for future work.

Literature survey

In this section, we review the work done by other researchers in diabetes prediction using machine learning and deep learning methods.

Machine learning

Chen et al. [9] analyzed the relationship between diabetes and the levels of several elements in hair/urine samples for the diagnosis of diabetes, and principal component analysis was used to perform preliminary processing work on the data. Both ensemble learning and Support Vector Machine (SVM) algorithms were used as classifiers with an average accuracy, sensitivity and specificity of 99%, 100%, and 99% and 97%, 89%, and 99% for hair and urine samples, respectively. Finally, it was shown by various model evaluation metrics that hair samples are superior to urine samples for the diagnosis and prevention of diabetes and that they provide more valuable information for the prevention, diagnosis, treatment and research of diabetes.

Perveen et al. [10] proposed an AdaBoost and bagging integration technique based on J48 (c4.5) using J48 (c4.5) as the base classifier and combining the standalone data mining technique J48 to classify diabetic patients. Tested on the Canadian primary care surveillance network dataset, the experimental results show that the AdaBoost integration method outperforms bagging and independent J48 decision trees. In addition, the researchers propose that Naive Bayesian (NB), Support Vector Machine (SVM), etc., can be used as the basic learning algorithms in the ensemble learning framework, and the method can be applied to other disease datasets such as hypertension, coronary heart disease, etc.

Wu, Yang et al. [11] proposed a new model for predicting type 2 diabetes mellitus (T2DM) based on data mining techniques, which consists of a modified K-Means algorithm and a logistic regression algorithm. The improved K-Means algorithm addresses the randomness of the seed values by inserting a procedure to record and sort the values called the “sum of squared errors within clusters” in ascending order; the smaller is the value, the better is the result. The model was evaluated on the PIMA Indian diabetes dataset as well as on two other diabetes datasets with good experimental results, and the prediction accuracy of the model was 3.04% higher than that of other researchers.

Zhu et al. [12] proposed an improved logistic regression model for diabetes prediction by integrating PCA and K-Means techniques, which provided adequate and efficient clustered datasets. The model consists of three components: principal component analysis, K-Means and logistic regression algorithms, and data normalization. The experimental results showed that PCA enhanced the accuracy of the K-Means clustering algorithm and logistic regression classifier compared to other published findings, with K-Means outputting 25 correctly classified data points and logistic regression accuracy improving by 1.98%.

Lukmanto et al. [13] used F-exponential feature selection and fuzzy Support Vector Machine for the detection and classification of diabetes. Feature selection was used to extract valuable features from the dataset. Then, the dataset was trained using SVM, fuzzy rules were generated, and finally, the output was classified using the fuzzy inference method. The method achieves an accuracy of 89.02% on the PIMA Indian Diabetes dataset. Moreover, the employed method provides an optimized fuzzy rule count while still maintaining sufficient accuracy.

Shankar G et al. [14] proposed a diabetes prediction model based on fuzzy logic with the gray wolf optimizer algorithm. The fuzzy rules are learned by the model and then optimized according to the GWO algorithm and validated on the dataset with an accuracy of 81%. The proposed model is based on the gray wolf optimization algorithm, which is able to globally optimize the features and gives higher accuracy than the ant colony algorithm.

Beschi Raja et al. [15] proposed a predictive model for type 2 diabetes based on a data mining strategy consisting of particle swarm optimization (PSO) and fuzzy clustering (FCM). It was evaluated by conducting experiments on the PIMA Indian diabetes dataset and using sensitivity, specificity and accuracy metrics. The obtained results showed that the accuracy of the model was improved by 8.26% compared to other methods, and the model had better performance compared to other methods.

Howsalya Devi et al. [16] proposed a diabetes diagnosis method combining a furthest-first (FF) clustering algorithm and a sequence minimum optimization (SMO) classifier algorithm. The clustering algorithm divides the data into different sets of clusters at first, which reduces the size of the dataset and greatly shortens the computation time. Then, the clustering output is used as the input of SVM to complete the classification. The method achieved better results on the PIMA Indian Diabetes dataset. The experimental results show that the ensemble method has 99.4% accuracy in predicting diabetes. The experimental results prove that the hybrid approach of data mining methods can help doctors make better clinical diagnosis decisions for diabetic patients.

Saloni et al. [17] proposed binary classification using an ensemble soft voting classifier and completed the classification using the ensemble of three machine learning algorithms (random forest, logistic regression and Naive Bayes). In this paper, the proposed method is experimentally evaluated using the proposed method and basic classifiers (AdaBoost, Logistic, SVM, RF, Naive Bayes, Bagging, GradientBoost, XGBoost, CatBoost). The accuracy, precision, recall, and F1 index were used as evaluation criteria. The values of accuracy, precision, recall, and F1 index for the PIMA Indian diabetes dataset were 79.04%, 73.48%, 71.45%, and 80.6%, respectively.

Jobeda Jamal Khanam et al. [18]used seven machine learning and neural network algorithms to predict diabetes on the PIMA diabetes dataset. And neural network models with different hidden layers for different periods were built.The experimental results showed that the models using Logistic Regression (LR) and Support Vector Machine (SVM) were beneficial for diabetes prediction, and the accuracy of neural networks with two hidden layers is 88.6%.

Rajendra et al. [19] compared logistic regression algorithms and ensemble learning techniques for diabetes prediction and conducted experiments on the PIMA diabetes dataset.The experimental results show that logistic regression is one of the effective algorithms for building predictive models.This study also found that the use of data pre-processing, feature selection and integration techniques could also improve the accuracy of the model.

Rawat et al. [20] conducted comparative experiments on the PIMA diabetes dataset based on machine learning algorithms such as Naïve Bayesian (NB), Support Vector Machine (SVM), and Neural Network. The experimental results showed that the neural network was the best classifier with an accuracy of 98%.Therefore, the neural network approach is the best way to detect diabetic disease at an early stage.

Su et al. [21] used XGBoost, LightGBM, Neural Network, Logistic Regression algorithms for joint data modeling between different organizations. They conducted on PIMA diabetes dataset. The experimental results show that using federated learning models we can make better use of the patient data between different organizations and deliver a reliable and improved prediction of Diabetes Mellitus risks.

Deep learning

Edla et al. [22] proposed a deep neural network framework using stacked autoencoders for the classification of diabetes data, using stacked autoencoders to extract features from the dataset and using softmax to complete the classification. The method uses accuracy, recall and F1 index as evaluation metrics. The accuracy and recall for the PIMA Indian diabetes dataset were 90.66% and 87.92%, respectively.

Nguyen et al. [23] applied a broad deep learning model that combines the strengths of generalized linear models with various features and deep feedforward neural networks to improve prediction of the onset of type 2 diabetes mellitus (T2DM). Our final ensemble model not using SMOTE obtained an accuracy of 84.28%, area under the receiver operating characteristic curve (AUC) of 84.13%, sensitivity of 31.17% and specificity of 96.85%, further optimizing the prediction of diabetes onset.

Rahman et al. [24] proposed a novel diabetes classification model based on convolutional long short-term memory (CONV-LSTM). The method was tested on the PIMA Indian diabetes dataset and compared with three models: convolutional neural network (CNN), traditional LSTM (T-LSTM) and CNN-LSTM, and the obtained results showed an accuracy of up to 97.26%, outperforming the other three models and the state-of-the-art model.

Bala et al. [25] developed a deep neural network (DNN) classifier, an unsupervised learning approach, which is used for accurate prediction for the Pima Indian diabetes dataset, and a feature importance model that is bagged with extra trees and random forest is used for feature selection. The model achieved 98.16% accuracy with a random train-test split, and it was observed that the model obtained better performance than other state-of-art methods.

Garc´ıa-Ord´as et al. [26] proposed a method based on deep learning techniques for predicting diabetic patients. The method includes data enhancement using a variational autoencoder (VAE), feature enhancement using a sparse autoencoder (SAE) and a convolutional neural network for classification. Feature extraction was performed on the PIMA Indian diabetes dataset, considering information such as the number of pregnancies, glucose levels, insulin levels, blood pressure, and age of the patients. The obtained results showed that the method achieved an accuracy of 92.31%, which was 3.17% more accurate than other methods.

Satish et al. [27] proposed a related technique for feature selection. The method applies AdaBoost to selected features for classification, and a novel stacking technique based on multilayer perceptron, Support Vector machine and logistic regression (MLP, SVM and LR) is designed and developed for the selected features. Its proposed stacking technique integrates intelligent models, improves model performance, and overcomes the decision residual problem that occurs with AdaBoost. The obtained results outperform other reported techniques based on the PIMA Indian diabetes dataset implementation.

Aghila et al. [28] proposed a custom hybrid model of an artificial neural network (ANN) and genetic algorithm for an efficient prediction framework of diabetic diseases. The method correctly identifies the importance of the impact of each variable on the output, thus prioritizing the variables considered to be the most important. The model and its corresponding decision algorithm achieved a prediction accuracy of 80% on the PIMA India diabetes dataset.

YalinWu et al. [29] proposed a new and efficient binary logistic regression (BLR) to accurately predict the specific type of T2DM and make the model adaptive to multiple datasets. To improve the recognition rate of the database, a series of preprocessing steps was performed, including outlier removal, normalization and missing value processing. The generated high-dimensional features were modeled using a BLR application. Experiments were conducted using XGBoost-BLR on the PIMA Indian diabetes dataset and early diabetes dataset with diabetes prediction identification rates of 94% and 98%, respectively.

Roobini et al. [30]used the Convolutional Graph Long Short Term Memory (CGLSTM) classifier for classification. The weights of this deep neural network were optimised using the AdaGrad optimiser to improve the accuracy of the predictions. They conducted experiments on the PIMA diabetes dataset and compared them with existing methods to demonstrate the efficiency of the proposed system.

Rabhi et al. [31] developed a generic deep-learning-based framework for modeling IMTS. This framework facilitated the comparative studies of sequential neural networks (transformers and long short-term memory) and irregular time representation techniques. This study highlighted the significance of modeling time gaps between medical records to improve prediction performance and the utility of a generic framework for conducting extensive comparative studies.

Qi et al. [32] proposed an ensemble learning framework: KFPredict, which combines multi input models with key features and machine learning algorithms. They first propose a multi-input neural network model (KF_NN) that fuses key features. Then, they ensemble KF_NN with three machine learning algorithms (i.e., Support Vector Machine, Random Forest and K-Nearest Neighbors) for soft voting to form our predictive classifier for diabetes prediction. Taking the PIMA diabetes dataset as the test data, the experiment shows that the framework presents good prediction results.

Proposed methodology

This thesis proposes a diabetes prediction model based on Boruta feature selection and ensemble learning based on correlation work.

The model mainly uses the Boruta feature selection algorithm to select the features in the dataset, selecting the most relevant features for diabetes diagnosis and eliminating irrelevant features. In the unlabeled dataset, there are potentially K patterns in general; thus, we used the K-Means++ algorithm for unsupervised cluster learning on the dataset and found that the K patterns present in the dataset can be clustered into different clusters. Finally, data classification is performed using the stacking method in ensemble learning. Stacking in this paper uses Naive Bayesian (NB), K-Nearest Neighbor (KNN) and Decision Tree (DT) as the base model and Support Vector Machine (SVM) as the meta model. The specific steps of the model are shown in Fig. 2 below. After the original dataset is input, the dataset is preprocessed, and the results of the preprocessing are put through a clustering algorithm to calculate the correctly clustered data. The correctly clustered data are input into the stacking algorithm for classification. The results of the base model classification are fed into the meta-model, yielding diabetic and nondiabetic patients. The algorithms used in the model proposed in this paper are described below.

Boruta feature selection

Boruta is a feature selection algorithm based on a random forest classifier. Unlike the goal of a general feature selection algorithm, the goal of the Boruta feature selection algorithm is to select the set of features that are most relevant to the dependent variable rather than to a particular model. Unlike the goal of a general feature selection algorithm, the goal of the Boruta feature selection algorithm is to select the set of features that are most relevant to the dependent variable rather than to select the minimum compact set of features for which a particular model is best suited. The specific steps of the Boruta feature selection algorithm are as follows [33].

(1)
Create a new feature matrix. Each feature of the real feature matrix M is randomly disordered to obtain the shadow feature matrix M_S. Then, we splice the shaded feature matrix M_S with the original feature matrix M to form a new feature matrix N, N = [M,M_S].
(2)
Use the feature matrix N as input and train the model and output the Feature_Importances model.
(3)
Calculate the Z_Score metric for the true feature matrix M and the shadow feature matrix M_S. Find the Z_Score metric with the largest shadow feature, denoted as ${\text{Z}}_{{{\text{max}}}}$.
(4)
Real features with Z_Score greater than ${\text{Z}}_{\max }$ are marked as”important” and real features with Z_Score less than ${\text{Z}}_{\max }$ are marked as insignificant” and removed from the feature set.
(5)
Remove all shadow features.
(6)
Repeat steps 1-5 until importance has been assigned to all features or the algorithm has reached the previously set number of random forest runs.

In this study, using the PIMA Indian diabetes dataset, the Boruta feature selection algorithm was used to select five features with high predictive relevance from eight features associated with diabetes prediction, namely, glucose, BMI, age, diabetes spectrum function, and insulin.

K-means++

There are generally K potential patterns in the dataset. K-Means is a classical unsupervised cluster learning algorithm that finds K patterns in a dataset and uses the Euclidean distance as a measure of similarity. Generally, the closer is the distance, the greater is the similarity, and the farther is the distance, the lower is the similarity. However, the convergence of the K-Means algorithm is heavily dependent on the initialization status of the cluster centers. If all (or most) cluster centers are unfortunately initialized to the same cluster during the initialization process, then the K-Means clustering algorithm will largely fail to converge to the global optimal solution in this case. To solve this problem, the K-Means++ algorithm improves K-Means: when initializing the K cluster centers, the more distant are the samples from other cluster centers, the more likely are they to be selected as the next cluster center, thus solving the defective problem in the K-Means algorithm [34].

For better cluster learning, in this study, we use a modified version of the K-Means++ algorithm for unsupervised clustering learning. The specific implementation steps are shown below.

(1)
Create K points as the initial center-of-mass points (select the K data points with the greatest distance).
(2)
For each data point, the distance between it and the center-of-mass point is calculated, and the data point is assigned to the cluster with the closest distance, as shown in Eq. 1.
$${\mathrm{D}}^{(\mathrm{i},\mathrm{j})}={\mathrm{argmin}}_{\mathrm{j}}{||{\mathrm{X}}^{\left(\mathrm{i}\right)}-{\upmu }_{(\mathrm{j})}||}^{2}$$
(1)
where ${\mathrm{X}}^{\left(\mathrm{i}\right)}$ is the ith sample data point, ${\upmu }_{(\mathrm{j})}$ is the jth centroid, and ${\mathrm{D}}^{(\mathrm{i},\mathrm{j})}$ is the minimum distance between the sample data point and the centroid.
(3)
Determine whether the clusters where the sample points are located before and after clustering are the same; if they are, the algorithm terminates. Otherwise, go to step 4.
(4)
Calculate their respective centroids (Eq. 2) based on the sample points in each cluster, use the result of the calculation as the new centroid for that cluster, and go to step 2. The algorithm ends when the sample points in each cluster are not changing, i.e., when the convergence state is reached is the jth centroid. The centroid count function counts the number of sample points that belong to the current centroid.
$${{\upmu }^{{\prime}}}_{\left(\mathrm{j}\right)}=\frac{{\sum }_{\mathrm{i}=1}^{\mathrm{m}}({\mathrm{X}}^{(\mathrm{i})}\in {\upmu }_{(\mathrm{j})})}{\mathrm{count}\left[{\sum }_{\mathrm{i}=1}^{\mathrm{m}}({\mathrm{X}}^{(\mathrm{i})}\in {\upmu }_{(\mathrm{j})})\right]}$$
(2)
where ${{\upmu }^{^\prime}}_{\left(\mathrm{j}\right)}$ is the jth cluster’s new center point, ${\mathrm{X}}^{(\mathrm{i})}$ is the ith sample data point, and ${\upmu }_{(\mathrm{j})}$In this study, unsupervised cluster learning is performed using the K-Means++ algorithm by preprocessing the dataset with operations such as removing extremes and outliers, filling in missing values, and normalizing the data. By comparison with the original dataset, the correctly clustered data account for approximately 74% of the total data. These diabetic data will be used as the input for the ensemble learning stacking method.

Ensemble learning

Stacking is an ensemble learning method that combines multiple classification models with a single meta-classifier. Stacking first obtains several base models based on different algorithms by parallel training, then combines the output of each base model by training a metamodel, and finally takes the output of the metamodel as the final output. Stacking in this paper uses NB, KNN and DT as the base model and SVM as the metamodel. The code of the stacking method is shown in Algorithm 1 below.

Where $\mathrm{D}$ is the dataset, ${\mathrm{x}}_{\mathrm{i}}$ is each sample data, and ${\mathrm{y}}_{\mathrm{i}}$ is the label corresponding to each sample data, ${\mathrm{D{^\prime}}}_{\mathrm{test}}=\left\{\left({\mathrm{P}}_{\mathrm{test}},{\mathrm{y}}_{\mathrm{i}}\right)\right\}={\left\{\left({\mathrm{P}}_{1\mathrm{i}},{\mathrm{P}}_{2\mathrm{i}},{\mathrm{P}}_{3\mathrm{i}},\dots ,{\mathrm{P}}_{\mathrm{ni}},{\mathrm{y}}_{\mathrm{i}}\right)\right\}}_{\mathrm{i}=1}^{\mathrm{n}}$, ${\mathrm{P}}_{\mathrm{test}}={\left({\mathrm{P}}_{\mathrm{i}1},{\mathrm{P}}_{\mathrm{i}2},{\mathrm{P}}_{\mathrm{i}3},\dots {,\mathrm{P}}_{\mathrm{in}}\right)}^{\mathrm{T}}$ is the output of the jth $\left(\mathrm{j}=\mathrm{1,2},3,\dots ,\mathrm{t}\right)$ base model.

Grid search

In machine learning algorithms, the difference in parameters directly determines the effectiveness of a model. If the manual trial parameter approach is adopted, it is true that the optimal parameters can be obtained after a finite number of steps, but it will be labor-intensive and inefficient. To improve efficiency, reduce human error and be able to find the optimal parameters in the fastest way, grid search is used to select the optimal parameters in this study. The grid search method is an exhaustive search method for specifying parameter values. The method tries the possibility of each parameter by iterating through each parameter in a loop over the range of all parameter candidates and tests the model on the validation set. Finally, the parameter with the best model effect is the result of the final grid search and is the optimal parameter for the model within the range of parameter candidates. The grid search method ensures that the best model parameters are found within the candidate range of parameters.

Because the grid search method is an exhaustive approach that requires traversal of all possible parameter combinations, it can be time-consuming for large datasets and models with multiple parameters. The PIMA Indian diabetes dataset used in this study is a small dataset, and the model has relatively few parameters; thus, it is appropriate to use the grid search method to find the optimal parameters of the model.

Experiment

Dataset

This experiment used the PIMA Indian diabetes dataset, a common dataset for diabetes prediction.

Dataset description

The experiments used the PIMA Indian diabetes dataset from the UCI Machine Learning Repository, a common dataset for diabetes prediction. The dataset consisted of 768 women with and without diabetes from Arizona, USA, who were all over 21 years of age and had type 2 diabetes. The dataset includes nine attributes, eight of which are related to diabetes diagnosis (pregnancy, body mass index, insulin levels, age, blood pressure, skin thickness, glucose and diabetes spectrum function) and one label attribute. The label attribute is used to distinguish between diabetic and nondiabetic populations. The dataset consisted of 268 test-positive examples and 500 test-negative examples. The attribute values are specifically described as shown in Table 1 below.

Table 1 Dataset description

A diabetes prediction model based on Boruta feature selection and ensemble learning

Abstract

Background and objective

Methods

Results

Conclusion

Introduction

Literature survey

Machine learning

Deep learning

Proposed methodology

Boruta feature selection

K-means++

Ensemble learning

Grid search

Experiment

Dataset

Dataset description

Data preprocessing

Handling missing data

Handling noisy data

Boruta feature selection

Data standardization

Experimental procedure

K-means++ algorithm

Ensemble learning stacking methods

Parameter settings

KNN parameter

DT parameter

SVM parameter

Results and discussion

Evaluation indicator

Performance evaluation

Comparison between the same studies

Comparison with other combinatorial classifiers

Comparison in the original dataset

McNemar and standard deviation metrics

Performance on other datasets

Dataset description

Data collection prevention and experiment

Comparison between the same studies

Comparison between combined classifiers

Comparison on the original dataset

McNemar and standard deviation metrics

Computational complexity analysis

Data processing

Feature selection

Clustering

Stacking

Analysis of model advantages and disadvantages

Advantages

Disadvantages

Summary and future work

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us