Skip to main content

A voting-based machine learning approach for classifying biological and clinical datasets

Abstract

Background

Different machine learning techniques have been proposed to classify a wide range of biological/clinical data. Given the practicability of these approaches accordingly, various software packages have been also designed and developed. However, the existing methods suffer from several limitations such as overfitting on a specific dataset, ignoring the feature selection concept in the preprocessing step, and losing their performance on large-size datasets. To tackle the mentioned restrictions, in this study, we introduced a machine learning framework consisting of two main steps. First, our previously suggested optimization algorithm (Trader) was extended to select a near-optimal subset of features/genes. Second, a voting-based framework was proposed to classify the biological/clinical data with high accuracy. To evaluate the efficiency of the proposed method, it was applied to 13 biological/clinical datasets, and the outcomes were comprehensively compared with the prior methods.

Results

The results demonstrated that the Trader algorithm could select a near-optimal subset of features with a significant level of p-value < 0.01 relative to the compared algorithms. Additionally, on the large-sie datasets, the proposed machine learning framework improved prior studies by ~ 10% in terms of the mean values associated with fivefold cross-validation of accuracy, precision, recall, specificity, and F-measure.

Conclusion

Based on the obtained results, it can be concluded that a proper configuration of efficient algorithms and methods can increase the prediction power of machine learning approaches and help researchers in designing practical diagnosis health care systems and offering effective treatment plans.

Peer Review reports

Background

Classification is the process of dividing data samples into different groups using the machine learning (ML) approaches [1]. This technique has been extended to a wide range of computational and biological applications such as identifying potential gene/miRNA/protein biomarkers [2], repurposing drugs against different diseases [3], suggesting novel therapeutic modalities for curing illnesses [4], diagnosing heart and diabetes sicknesses [5], and better perceiving biological phenomena [6]. In this line, several ML strategies have been developed, resulting in the generation of computer-aided health decision support systems [7]. These strategies aimed to improve the ML and feature selection (FS) algorithms mainly because of their effects on the performance of a classification model [8]. For instance, to diagnose diabetes disease in its early stages, Patil et al. utilized C4.5 and k-means clustering ML algorithms and achieved ~ 92.38% value of tenfold cross-validation accuracy on the Pima Indian Diabetes (PID) dataset [9]. To this end, the researchers removed serum insulin and triceps skinfold features and reduced the total number of samples from 768 to 625. The researchers then determined the data patterns using the k-means algorithm and eliminated 192 other instances. Based on the obtained patterns, a decision tree was formed, and the produced model was evaluated. Although these researchers generated a proper prediction model with a high value of accuracy on the PID dataset, their methods suffered from overfitting because of removing a remarkable number of the data instances. To tackle such a limitation, Aslam et al. examined a three-step ML method [10]. In the first phase, based on different statistical methods (e.g., Kolmogorov–Smirnov test and t-test), the existing diabetes features were ranked, and some subsets of diabetes features were produced using a progressive FS manner. In the second phase, for every generated subset of features, a genetic programming technique was employed. In the third phase, the usefulness of the produced features was measured using the k-nearest neighbor (KNN) and support vector machine (SVM) classifiers. The results demonstrated that the Gaussian process-SVM (GP-SVM) technique resulted in about 87% of accuracy. In addition to the PID dataset, several studies targeted other biological/clinical datasets and suggested some real-world consistent prediction models [11,12,13]. For this purpose, the prior studies combined various computational techniques such as the teaching learning-based optimization algorithm (OA) with the fuzzy wavelet neural network [14], the rough set theory with the backpropagation neural network [15], and the fuzzy concept with the min–max neural networks [16]. The mentioned computational strategies have been applied to the Cleveland heart disease (CHD) [17], Statlog heart disease (SHD) [18], Wisconsin diagnostic breast cancer (WDBC) disease and mammogram datasets [19, 20], respectively. Some researchers also designed other types of hybrid ML techniques and applied them to different biological/clinical datasets [21,22,23,24]. The mentioned studies encounter several limitations, including low prediction power, inability in grouping multiclass data (more than two classes), overfitting, and filtering the samples with missing values. Therefore, Arabi et al. suggested a ML approach that creates a specific model for every class of existing data [25, 26]. For this purpose, the researchers acquired several regression and classification datasets from the ML repository of the University of California, Irvine (UCI). Then, after normalizing the obtained data, a distinct model was generated for every class of data. For example, if the data of interest included three classes, three individual machines were designed for each of them. In the next phase, a cascade-like artificial neural network was designed and trained using the world competitive contests (WCC) optimization algorithm [27]. The performance of the methods was investigated on the different partitioned train and test datasets (e.g., 70%-30% or 60%-40%), indicating that their methods outperformed other ML approaches in terms of classification criteria. Although the described technique by Arabi and coworkers yielded a model with a higher prediction ability, their ML method was not suitable for large-size datasets.

To address the above-mentioned constraints, in the present study, we extended the Trader optimization algorithm for selecting a near-optimal subset of features and generating an efficient prediction model in terms of classification criteria [28]. Additionally, to get better prediction results, a voting-based ML framework was proposed, labeling data samples based on the consensus of predictions obtained from different artificial machines. To evaluate the proposed machine learning approaches, in all the computational experiments, the fivefold cross-validation technique was used.

Methods

The UCI repository has collected various datasets from different scopes and provided a suitable resource for machine learning applications. From this repository, a total of 13 clinical/biological datasets, utilized in various research work as gold-standard input files, were obtained (Table 1). These datasets included different numbers of samples and features/genes, so they seemed to be qualified for evaluating the proposed method in different conditions. The proposed framework, shown in Fig. 1, was applied to these data, and the results were compared from a wide range of classification measurements.

Table 1 The properties of the datasets obtained from the UCI repository
Fig. 1
figure 1

The framework of the proposed voting-based machine learning method for classifying biological/clinical datasets. The final prediction is determined by aggregating the outputs of different models. \({\mathrm{F}}_{{\mathrm{p}}_{\mathrm{i}}}\) shows the total number of models that predicted a given sample belongs to the ith group

The acquired data were preprocessed in two phases that completed missing values and normalized the data, sequentially. To complete the missing values of a given sample, firstly, ten other samples (not including the missing values and sharing a high value of Pearson correlation coefficient (> 0.5) with the sample of interest) were chosen if plausible. Secondly, the missing value of the sample of interest was determined based on calculating the interpolation of the related values. Finally, for normalizing the data, Eq. 1 was used, which converts the data of a feature to range [0, 1].

$$x_{{i.j_{new} }} = \frac{{x_{{i.j_{current} - min_{j} }} }}{{max_{j} - min_{j} }}$$
(1)

where xi,j, minj, and maxj represent the value of the ith row and jth column of a given data, and the minimum and maximum values of the jth column, respectively.

Our previously suggested Trader optimization algorithm (OA) was modified and developed to select an optimal/near-optimal subset of features/genes [29, 30]. In this line, as shown in Fig. 2, the algorithm generated some potential candidate solutions (CS) randomly, each of which included a set of selected features/genes and was displayed using an array (Eq. 2).

$${\text{CS}} = \left[ {{\text{V}}_{{1}} ,{\text{V}}_{{2}} ,{\text{V}}_{{3}} , \ldots ,{\text{V}}_{{\text{n}}} } \right]$$
(2)

where CS and Vi stand for a candidate solution and its ith variable, respectively.

Fig. 2
figure 2

The flowchart of the suggested Trader optimization algorithm for selecting a near-optimal subset of features/genes. The algorithm generates some random candidate solutions (CS) and evaluates them using the value of accuracy obtained from the support vector machine (SVM) classifier. Next, the algorithm divides the CSs into several groups and modifies them with three operators. At the final step, the best acquired CS is introduced as a near-optimal subset of features/genes, which can enhance the prediction ability of the SVM classifier

The CSs were then divided into several groups in a random manner. Every group and the sum of its CSs’ scores represented a trader and its finances/benefits, respectively.

In the next step, the produced CSs were evaluated using the SVM classification technique (i.e., the accuracy obtained from SVM, was considered as the worthiness/fitness/score of a given subset of features). Since some of the datasets consisted of more than two classes, to calculate the value of accuracy, a mean-based method (Eq. 3) was used.

$$Accuracy = \frac{{\mathop \sum \nolimits_{i = 1}^{C} TP_{i} }}{N}$$
(3)

where TPi, C, and N show the total number of true positives of the ith class, the total number of data classes, and the total number of data samples, respectively.

Based on the calculated fitness values, the CSs of every group were separated into master and slave CSs (MCS and SCS, respectively), consequentially showing the best CS and the remaining CSs of a group. Next, the CSs went through a series of changes using the three operators of the Trader algorithm, and new CSs were produced. The first operator of Trader (the retailing operator (Eq. 4)) created minor modifications in the SCS. This operator usually plays an essential role in improving the CSs in the last steps of the algorithm.

$$\forall_{j \in S} \left( {SCS_{i,j} = SCS_{i,j} + k \times random\left( {0 , SCS_{i,j} } \right)} \right)$$
(4)

where SCSi,j and k indicate the jth variable of the ith slave CS and a random value (i.e., either -1 or 1), respectively. Additionally, S displays a set of randomly selected variables of the ith slave CS.

The second operator of Trader (the distributing operator (Eq. 5)) aimed to improve the SCSs by transferring data from their related MCSs. For this purpose, for a given SCS, some variables were randomly chosen from an MCS, and then, their values were transferred to that SCS.

$$\forall_{j \in S} \left( {SCS_{i,j} = MCS_{i,j} } \right)$$
(5)

where SCSi,j, MCSi,j, and S are the jth variable of the ith SCS, the jth variable of the ith group’s MCS, and a set of randomly chosen variables of the ith CS, respectively.

The third operator of Trader, the importing-exporting (IE) operator (Eq. 6), aimed to correct the MCSs and change CSs globally. For this purpose, an MCS was considered the importer while the others as exporters. Like two previous operators, the IE operator acted randomly and changed a given MCS in a similar method described for the second operator.

$$\forall_{j \in S, k \in M} \left( {MCS_{i,j} = MCS_{k,j} } \right)$$
(6)

where MCSi,j, MCSM,j, S, and M display the jth variable of the ith importer MCS, the jth variable of the Mth exporter MCS, a set of randomly chosen variables of the ith importer MCS, and a set of randomly chosen variables of the Mth exporter MCS, respectively.

The proposed OA was compared with other popular OAs in the same conditions. For instance, all the algorithms had a time order of O(n3) and called an identical number of the objective function (SVM) during a distinct run.

Results

The proposed method was implemented in the MATLAB programming language, and the outputs were examined in terms of various criteria associated with evaluating the modified optimization algorithm (Trader) and proposed voting-based classification system. To evaluate the usefulness of Trader in selecting informative features/genes, it was applied to the downloaded datasets (Table 1), and the outcomes were compared with four other public/effective optimization algorithms. These algorithms (i.e., WCC [27], LCA [31], PSO [32], and ICA [33]) were chosen because of their diversities and proper functionalities reported in the prior studies. Because the values of OAs’ parameters strongly affected their efficiencies, a trial–error method was employed to regulate them [34]. The initial population size of the algorithms was set to 100, and, their steps iteration parameter was regulated to 50. Besides, in every iteration, each of the algorithms changed 30% of candidate solutions. The results of this section were organized into three sections described as follows.

The first part of the results compared the performance of the mentioned OAs in terms of improving the prediction power of a learner. To this end, the data of features/genes, chosen by the algorithms, were passed to SVM [35], and the learner then created a model for classifying them. As mentioned in the materials and methods section, OAs generated some random potential answers and modified/improved them using their operators. Hence, it was usually expected to get better results in the jth iteration than in the ith iteration (j > i) (the convergence behavior of OAs) [36]. For the datasets consisting of > 10 features/genes, the convergence behavior of the algorithm was followed, and the mean outcomes of 50 individual executions, were depicted (Fig. 3). Since the performance of the algorithms on the SHD and CHD datasets were similar, only the convergence diagram of the algorithms on the SHD was displayed. Based on the acquired results, Trader was able to select more distinctive features and get higher values of accuracy than the other algorithms. Therefore, it can be articulated that the proposed OA had a better convergence behavior than other compared OAs.

Fig. 3
figure 3

The convergence behavior of the OAs on the a SHD, b HEP, c PAR, d WDBC, e LUNG, f ARRYTM, g PARKINSON, h ARCENE, and i GENEEXPR datasets

In every individual run, due to generating and changing CSs based on stochastic operators, the OAs selected a diverse subset of features and produced different values of accuracy. However, the final accuracy values must be similar to each other (the stability behavior of OAs) [37,38,39,40]. To examine this property of the OAs in detail, they were executed 50 times, and the distribution of their outputs was illustrated using boxplots (Fig. 4). The obtained outcomes indicated that Trader generally boosted the performance of SVM (via selecting the most informative features) and showed a more stable behavior than the other OAs.

Fig. 4
figure 4

The stability behavior of the OAs on the a SHD, b HEP, c PAR, d WDBC, e LUNG, f ARRYTM, g PARKINSON, h ARCENE, and i GENEEXPR datasets on 50 individual executions

Further, to investigate whether the performance of Trader and the other OAs was the same or not, the Wilcoxon rank-sum test was utilized. For this purpose, the results of Trader were considered as a test base and compared with the others’ outcomes (Table 2). It was observed that, for most cases, the p-value was less than 0.01 and about close to zero, resulting in rejecting the null hypothesis and validating that Trader outperformed the others in terms of enhancing the prediction power of SVM.

Table 2 The Wilcoxon rank-sum test of the OAs with Trader’s results as the test base

In the second part of the computational experiments, the proposed algorithm was embedded into a voting-based prediction system. To evaluate this system, different criteria were considered, such as accuracy (ACC), precision (PRE), sensitivity (SEN), specificity (SPC), and F-score (F). The outputs were then organized into several tables as well as receiver operating characteristic (ROC) and precision-recall (PR) curves. The compared machine learning approaches indicated different efficiencies on diverse datasets. Based on the outcomes, the proposed voting-based system generally outperformed the others in terms of the mentioned parameters (Table 3).

Table 3 Comparing the utilized methods based on the fivefold cross-validation technique

The classification power of the generated models, separating positive or negative data samples into their related classes, was examined using the ROC and PR curves shown in Figs. 5 and 6, respectively. Although most of previously performed studies utilized the ROC and PR diagrams for evaluating the performance of binary classifiers, the presented study extended this concept to multi-class classifiers. To this end, Eq. 3 was used for calculating the values of SEN, SPC, and PRE. The acquired diagrams expressed that the proposed voting-based approach was closer to a perfect classifier than the others were. To display the area under curve (AUC) of the classifiers, for every dataset, two bar diagrams were provided, showing the AUC of the ROC (Fig. 7) and PR (Fig. 8) curves, respectively.

Fig. 5
figure 5

The ROC curve of the algorithms on the a SHD, b HEP, c PAR, d WDBC, e LUNG, f ARRYTM, g PARKINSON, h ARCENE, and i GENEEXPR datasets

Fig. 6
figure 6

The PR curve of the algorithms on the a SHD, b HEP, c PAR, d WDBC, e LUNG, f ARRYTM, g PARKINSON, h ARCENE, and i GENEEXPR datasets

Fig. 7
figure 7

The AUC bar diagram for ROC curves on the a SHD, b HEP, c PAR, d WDBC, e LUNG, f ARRYTM, g PARKINSON, h ARCENE, and i GENEEXPR datasets

Fig. 8
figure 8

The AUC bar diagram for PR curves on the a SHD, b HEP, c PAR, d WDBC, e LUNG, f ARRYTM, g PARKINSON, h ARCENE, and i GENEEXPR datasets

In the last part of the computational experiments, a comprehensive comparison between the proposed ML method and those suggested in the previous studies was performed. Up to the time of designing the present study, much attention has already been paid to introducing novel ML techniques and generating powerful biological data classifiers. Among these studies, the artificial neural network-based ML approach, introduced by Arabi et al., outperformed the other previously introduced ML methods in terms of the classification benchmarks. Here, the outputs of these studies were collected and then extended with the results obtained from the current study (Table 4). The outcomes indicated that the accuracy of the proposed voting-based ML approach was higher than the accuracy of the previously introduced ML methods. All in all, the suggested voting-based system improved all the criteria by ~ 10%.

Table 4 A comprehensive comparison between the proposed voting-based method and the other previously introduced approaches

Discussion

To classify various types of clinical/biological datasets, the present study introduced a novel wrapper ML method that combined the Trader algorithm for selecting a near-optimal subset of features/genes and the SVM classifier for scoring them. Although the previous studies had suggested several ML methods and algorithms to stratify clinical/biological datasets [53,54,55], they encountered two critical limitations described as follows.

First, some literary works ignored the FS concepts in the data preprocessing step or utilized some heuristic filter-based FS techniques. For instance, several studies ranked and reduced the total number of the existing features/genes in a specific application such as introducing a limited number of genes as potential biomarkers for certain cancer [56]. To this end, some statistical-based FS algorithms have been utilized, such as the data Entropy-based FS method. As demonstrated in many recent bioinformatics-related pieces of research, wrapper-based FS approaches outperformed the filter and embedded-based FS techniques [57, 58], and two-step FS methods usually showed better functionality than single-step procedures [59,60,61]. Besides, in some cases, previous studies presented that filter-based FS techniques may reduce the prediction power of a learner [8, 62]. Hence, given the capabilities of the Trader algorithm in Np-hard problems, this study developed the algorithm for selecting the features and applied it not only to large-size datasets but also to small-size ones. The related outcomes (obtained from both the small and large-size datasets) indicated that the FS concept was a critical preprocessing step for biological applications, and the performance of the algorithms differed from each other on various datasets. In addition to gaining a suitable prediction model, the outputs of the FS phase may be essential for designing diagnosis/treatment plans, such as introducing the selected features/genes/proteins/miRNA as potential biomarkers for a wide range of diseases. The discovered biomarkers might be further investigated to determine their druggability properties and find candidate medicines to inhibit them.

Second, some previously carried out studies tested the usefulness of their methods on small-size datasets [63, 64]. Therefore, their proposed approaches could not be embedded into software packages due to their lower performances on large-size data. To address the mentioned restriction, a voting-based ML framework was introduced and applied to the different datasets having various properties. It was shown that the suggested framework could boost the prediction power of classification systems on both the small and large-size datasets whereas the previously introduced ML techniques lost their performances on large-size datasets. For example, Arabi et al. introduced and developed a perceptron-based artificial neural network for classifying 13 clinical/biological datasets and showed that their designed artificial neural network had a higher prediction ability than the other performed methods [25]. Arabi’s proposed method generated a distinct model for every class of a given data and categorized a data sample into a group whose related model represented the highest value of score. However, the outcomes of the present study exhibited that the mentioned ML method (introduce by Arabi et al.) suffered from overfitting on the small-size datasets. In other words, the proposed voting-based ML system yielded a more powerful prediction model on large-size datasets than the previously performed approaches. On the small-size datasets, the efficiency of the proposed ML framework was slightly lower than that of Arabi’s method in terms of the classification criteria. This issue was probably because of overfitting Arabi’s approach on the small-size datasets.

Like all the other previously carried out studies, the present work also might suffer from some limitations. Especially, the current study was organized based on the five OAs producing non-deterministic but acceptable outcomes. Hence, in designing a healthcare system, the deterministic rate may decrease. To deal with such a limitation, a possible solution can be identifying a proper configuration of algorithms that can correctly display a synergic effect. However, obtaining such a configuration seems to be a challenging task. Collectively, a combination of algorithms, such as various types of operators for changing CSs, might be an advantageous approach.

Conclusion

This study extended our previously introduced optimization algorithm, Trader, to select a near-optimal subset of features/genes and proposed a voting-based machine learning technique to classify large-size biological/clinical datasets. According to the acquired results, it was indicated that the suggested voting-based classification framework yielded better predictions than the other previously performed studies. As a result, this technique can be considered an effective diagnosis/treatment approach such as discovering potential biomarkers and drugs to combat different diseases. In addition, the outcomes indicated that the feature selection concept is an essential preprocessing phase not only for large-size biological/clinical datasets but also for small-size ones, whereas most of the prior studies neglected the effect of the feature selection concept in their computational methods.

Availability of data and materials

The data of interest were obtained from the UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php).

Abbreviations

ACC:

Accuracy

CS:

Candidate solution

CHD:

Cleveland heart disease

FS:

Feature selection

IE:

Importing-exporting

KNN:

K-nearest neighbor

ML:

Machine learning

OA:

Optimization algorithm

PR:

Precision-recall

PRE:

Precision

ROC:

Receiver operating characteristic

SEN:

Sensitivity

SHD:

Statlog heart disease

SPC:

Specificity

SVM:

Support vector machine

UCI:

University of California, Irvine

WCC:

World competitive contests

WDBC:

Wisconsin diagnostic breast cancer

References

  1. Abbasi K, Razzaghi P, Poso A, Ghanbari-Ara S, Masoudi-Nejad A. Deep learning in drug target interaction prediction: current and future perspectives. Curr Med Chem. 2021;28(11):2100–13.

    Article  CAS  PubMed  Google Scholar 

  2. Sarkar JP, Saha I, Sarkar A, Maulik U. Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers. Comput Biol Med. 2021;131:104244.

    Article  CAS  PubMed  Google Scholar 

  3. Masoudi-Sobhanzadeh Y, Salemi A, Pourseif MM, Jafari B, Omidi Y, Masoudi-Nejad A. Structure-based drug repurposing against COVID-19 and emerging infectious diseases: methods, resources and discoveries. Brief Bioinf. 2021;22(6):113.

    Article  Google Scholar 

  4. Tiulpin A, Klein S, Bierma-Zeinstra S, Thevenot J, Rahtu E, Meurs JV, Oei EH, Saarakkala S. Multimodal machine learning-based knee osteoarthritis progression prediction from plain radiographs and clinical data. Sci Rep. 2019;9(1):1–11.

    Article  Google Scholar 

  5. Ibrahim I, Abdulazeez A. The role of machine learning algorithms for diagnosing diseases. J Appl Sci Technol Trends. 2021;2(01):10–9.

    Article  Google Scholar 

  6. Walsh I, Fishman D, Garcia-Gasulla D, Titma T, Pollastri G, Harrow J, Psomopoulos FE, Tosatto SC. DOME: recommendations for supervised machine learning validation in biology. Nat Methods. 2021;18(10):1122–7.

    Article  CAS  PubMed  Google Scholar 

  7. Anbarasi A, Ravi S, Vaishnavi J, Matla S. Computer aided decision support system for mitral valve diagnosis and classification using depthwise separable convolution neural network. Multimedia Tools Appl. 2021;80(14):21409–24.

    Article  Google Scholar 

  8. Masoudi-Sobhanzadeh Y, Motieghader H, Omidi Y, Masoudi-Nejad A. A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications. Sci Rep. 2021;11(1):1–19.

    Article  Google Scholar 

  9. Patil BM, Joshi RC, Toshniwal D. Hybrid prediction model for type-2 diabetic patients. Expert Syst Appl. 2010;37(12):8102–8.

    Article  Google Scholar 

  10. Aslam MW, Zhu Z, Nandi AK. Feature generation using genetic programming with comparative partner selection for diabetes classification. Expert Syst Appl. 2013;40(13):5402–12.

    Article  Google Scholar 

  11. Colombo F, Calesella F, Mazza MG, Melloni EMT, Morelli MJ, Scotti GM, Benedetti F, Bollettini I, Vai B. Machine learning approaches for prediction of bipolar disorder based on biological, clinical and neuropsychological markers: a systematic review and meta-analysis. Neurosci Biobehav Rev. 2022;5:104552.

    Article  Google Scholar 

  12. Abbasi K, Razzaghi P, Poso A, Amanlou M, Ghasemi JB, Masoudi-Nejad A. DeepCDA: deep cross-domain compound–protein affinity prediction through LSTM and convolutional neural networks. Bioinformatics. 2020;36(17):4633–42.

    Article  CAS  PubMed  Google Scholar 

  13. Abbasi K, Poso A, Ghasemi J, Amanlou M, Masoudi-Nejad A. Deep transferable compound representation across domains and tasks for low data drug discovery. J Chem Inf Model. 2019;59(11):4528–39.

    Article  CAS  PubMed  Google Scholar 

  14. Alneamy JSM, Alnaish ZAH, Hashim SM, Alnaish RAH. Utilizing hybrid functional fuzzy wavelet neural networks with a teaching learning-based optimization algorithm for medical disease diagnosis. Comput Biol Med. 2019;112:103348.

    Article  Google Scholar 

  15. Li X, Jiang Q, Hsu MK, Chen Q. Support or risk? software project risk assessment model based on rough set theory and backpropagation neural network. Sustainability. 2019;11(17):4513.

    Article  Google Scholar 

  16. Al Sayaydeh ON, Mohammed MF, Lim CP. Survey of fuzzy min–max neural network for pattern classification variants and applications. IEEE Trans Fuzzy Syst. 2018;27(4):635–45.

    Article  Google Scholar 

  17. Reddy GT, Reddy M, Lakshmanna K, Rajput DS, Kaluri R, Srivastava G. Hybrid genetic algorithm and a fuzzy logic classifier for heart disease diagnosis. Evol Intel. 2020;13(2):185–96.

    Article  Google Scholar 

  18. Arunprakash R, Bhavadharini R, Vinoth Kumar K, Jayasankar T: Intelligent Heart Disease Detection and Classification Method Using Optimal Neuro-Fuzzy with Stochastic Gradient Descent Model for Smart Societies. In: Artificial Intelligence Applications for Smart Societies. Springer; 2021: 179–193.

  19. Obaid OI, Mohammed MA, Ghani M, Mostafa A, Taha F. Evaluating the performance of machine learning techniques in the classification of Wisconsin Breast Cancer. Int J Eng Technol. 2018;7(436):160–6.

    Google Scholar 

  20. Alshayeji MH, Ellethy H, Gupta R. Computer-aided detection of breast cancer on the Wisconsin dataset: an artificial neural networks approach. Biomed Signal Process Control. 2022;71:103141.

    Article  Google Scholar 

  21. Selicato L, Esposito F, Gargano G, Vegliante MC, Opinto G, Zaccaria GM, Ciavarella S, Guarini A, Del Buono N. A new ensemble method for detecting anomalies in gene expression matrices. Mathematics. 2021;9(8):882.

    Article  Google Scholar 

  22. Loddo A, Buttau S, Di Ruberto C. Deep learning based pipelines for Alzheimer’s disease diagnosis: a comparative study and a novel deep-ensemble method. Comput Biol Med. 2022;141:105032.

    Article  PubMed  Google Scholar 

  23. Doppala BP, Bhattacharyya D, Janarthanan M, Baik N. A reliable machine intelligence model for accurate identification of cardiovascular diseases using ensemble techniques. J Healthcare Eng. 2022;5:489.

    Google Scholar 

  24. Wang X, Li F, Xu J, Rong J, Webb GI, Ge Z, Li J, Song J. ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning. Brief Bioinf. 2022;6:749.

    Google Scholar 

  25. Bulaghi ZA, Navin AH, Hosseinzadeh M, Rezaee A. World competitive contest-based artificial neural network: a new class-specific method for classification of clinical and biological datasets. Genomics. 2021;113(1):541–52.

    Article  Google Scholar 

  26. Bulaghi ZA, Navin AHZ, Hosseinzadeh M, Rezaee A. SENET: a novel architecture for IoT-based body sensor networks. Inf Med Unlocked. 2020;20:100365.

    Article  Google Scholar 

  27. Masoudi-Sobhanzadeh Y, Motieghader H. World Competitive Contests (WCC) algorithm: a novel intelligent optimization algorithm for biological and non-biological problems. Inf Med Unlocked. 2016;3:15–28.

    Article  Google Scholar 

  28. Masoudi-Sobhanzadeh Y, Omidi Y, Amanlou M, Masoudi-Nejad A. Trader as a new optimization algorithm predicts drug-target interactions efficiently. Sci Rep. 2019;9(1):1–14.

    Article  CAS  Google Scholar 

  29. Masoudi-Sobhanzadeh Y, Masoudi-Nejad A. Synthetic repurposing of drugs against hypertension: a datamining method based on association rules and a novel discrete algorithm. BMC Bioinf. 2020;21(1):1–21.

    Article  Google Scholar 

  30. Masoudi-Sobhanzadeh Y, Jafari B, Parvizpour S, Pourseif MM, Omidi Y. A novel multi-objective metaheuristic algorithm for protein-peptide docking and benchmarking on the LEADS-PEP dataset. Comput Biol Med. 2021;138:104896.

    Article  PubMed  Google Scholar 

  31. Kashan AH. An efficient algorithm for constrained global optimization and application to mechanical engineering design: league championship algorithm (LCA). Comput Aided Des. 2011;43(12):1769–92.

    Article  Google Scholar 

  32. Inbarani HH, Azar AT, Jothi G. Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis. Comput Methods Programs Biomed. 2014;113(1):175–85.

    Article  PubMed  Google Scholar 

  33. Tang Z, Todo Y, Gao S. A hybrid discrete imperialist competition algorithm for gene selection for microarray data. Curr Proteom. 2018;15(2):99–110.

    Article  Google Scholar 

  34. Yang X-S. Nature-inspired optimization algorithms: challenges and open problems. J Comput Sci. 2020;46:101104.

    Article  Google Scholar 

  35. Masoudi-Sobhanzadeh Y, Motieghader H, Masoudi-Nejad A. FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinf. 2019;20(1):1–17.

    Article  Google Scholar 

  36. Li M-W, Wang Y-T, Geng J, Hong W-C. Chaos cloud quantum bat hybrid optimization algorithm. Nonlinear Dyn. 2021;103(1):1167–93.

    Article  Google Scholar 

  37. Tubishat M, Idris N, Shuib L, Abushariah MA, Mirjalili S. Improved salp swarm algorithm based on opposition based learning and novel local search algorithm for feature selection. Expert Syst Appl. 2020;145:113122.

    Article  Google Scholar 

  38. Tubishat M, Ja’afar S, Alswaitti M, Mirjalili S, Idris N, Ismail MA, Omar MS. Dynamic salp swarm algorithm for feature selection. Expert Syst Appl. 2021;164:113873.

    Article  Google Scholar 

  39. Nouri-Moghaddam B, Ghazanfari M, Fathian M. A novel multi-objective forest optimization algorithm for wrapper feature selection. Expert Syst Appl. 2021;175:114737.

    Article  Google Scholar 

  40. Arora S, Anand P. Binary butterfly optimization approaches for feature selection. Expert Syst Appl. 2019;116:147–60.

    Article  Google Scholar 

  41. Christo VE, Nehemiah HK, Brighty J, Kannan A. Feature selection and instance selection from clinical datasets using co-operative co-evolution and classification using random forest. IETE J Res. 2020;6:1–14.

    Google Scholar 

  42. Ouadfel S, Abd Elaziz M. Enhanced crow search algorithm for feature selection. Expert Syst Appl. 2020;159:113572.

    Article  Google Scholar 

  43. Sreejith S, Nehemiah HK, Kannan A. A classification framework using a diverse intensified strawberry optimized neural network (DISON) for clinical decision-making. Cogn Syst Res. 2020;64:98–116.

    Article  Google Scholar 

  44. Elgin Christo V, Khanna Nehemiah H, Minu B, Kannan A. Correlation-based ensemble feature selection using bioinspired algorithms and classification using backpropagation neural network. Comput Math Methods Med. 2019;7:148.

    Google Scholar 

  45. Dennis B, Muthukrishnan S. AGFS: adaptive genetic fuzzy system for medical data classification. Appl Soft Comput. 2014;25:242–52.

    Article  Google Scholar 

  46. Subbulakshmi C, Deepa S. Medical dataset classification: a machine learning paradigm integrating particle swarm optimization with extreme learning machine classifier. The Sci World J. 2015;14:795.

    Google Scholar 

  47. Rao S, Verma AK, Bhatia T. A review on social spam detection: challenges, open issues, and future directions. Expert Syst Appl. 2021;186:115742.

    Article  Google Scholar 

  48. Murugesan S, Bhuvaneswaran R, Khanna Nehemiah H, Keerthana Sankari S, Nancy Jane Y. Feature selection and classification of clinical datasets using bioinspired algorithms and super learner. Comput Math Methods Med. 2021;9:1478.

    Google Scholar 

  49. Anter AM, Ali M. Feature selection strategy based on hybrid crow search optimization algorithm integrated with chaos theory and fuzzy c-means algorithm for medical diagnosis problems. Soft Comput. 2020;24(3):1565–84.

    Article  Google Scholar 

  50. Al-Thanoon NA, Algamal ZY, Qasim OS. Feature selection based on a crow search algorithm for big data classification. Chemom Intell Lab Syst. 2021;212:104288.

    Article  CAS  Google Scholar 

  51. Nahato KB, Harichandran KN, Arputharaj K. Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Comput Math Methods Med. 2015;3:1400.

    Google Scholar 

  52. Nahato KB, Nehemiah KH, Kannan A. Hybrid approach using fuzzy sets and extreme learning machine for classifying clinical datasets. Inform Med Unlocked. 2016;2:1–11.

    Article  Google Scholar 

  53. MotieGhader H, Masoudi-Sobhanzadeh Y, Ashtiani SH, Masoudi-Nejad A. mRNA and microRNA selection for breast cancer molecular subtype stratification using meta-heuristic based algorithms. Genomics. 2020;112(5):3207–17.

    Article  CAS  PubMed  Google Scholar 

  54. Wegner FK, Benesch Vidal ML, Niehues P, Willy K, Radke RM, Garthe PD, Eckardt L, Baumgartner H, Diller G-P, Orwat S. Accuracy of deep learning echocardiographic view classification in patients with congenital or structural heart disease: importance of specific datasets. J Clin Med. 2022;11(3):690.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Elgin Christo V, Khanna Nehemiah H, Keerthana Sankari S, Jeyaraj S, Kannan A. Classification framework for clinical datasets using synergistic firefly optimization. IETE J Res. 2021;5:1–20.

    Google Scholar 

  56. Haghjoo N, Masoudi-Nejad A. Introducing a panel for early detection of lung adenocarcinoma by using data integration of genomics, epigenomics, transcriptomics and proteomics. Exp Mol Pathol. 2020;112:104360.

    Article  CAS  PubMed  Google Scholar 

  57. Remeseiro B, Bolon-Canedo V. A review of feature selection methods in medical applications. Comput Biol Med. 2019;112:103375.

    Article  CAS  PubMed  Google Scholar 

  58. Bommert A, Welchowski T, Schmid M, Rahnenführer J. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Brief Bioinform. 2022;23(1):354.

    Article  Google Scholar 

  59. Wei Z, Ding S, Duan M, Liu S, Huang L, Zhou F. FeSTwo, a two-step feature selection algorithm based on feature engineering and sampling for the chronological age regression problem. Comput Biol Med. 2020;125:104008.

    Article  PubMed  Google Scholar 

  60. Sun L, Kong X, Xu J, Xue ZA, Zhai R, Zhang S. A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumor classification. Sci Rep. 2019;9(1):1–14.

    Google Scholar 

  61. Guo J, Jin M, Chen Y, Liu J. An embedded gene selection method using knockoffs optimizing neural network. BMC Bioinform. 2020;21(1):1–19.

    Article  CAS  Google Scholar 

  62. Alirezanejad M, Enayatifar R, Motameni H, Nematzadeh H. Heuristic filter feature selection methods for medical datasets. Genomics. 2020;112(2):1173–81.

    Article  CAS  PubMed  Google Scholar 

  63. Chang V, Bailey J, Xu QA, Sun Z. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput Appl. 2022;5:1–17.

    Google Scholar 

  64. Rajni R, Amandeep A. RB-Bayes algorithm for the prediction of diabetic in Pima Indian dataset. Int J Electr Comput Eng. 2019;9(6):4866.

    Google Scholar 

Download references

Acknowledgements

Not applicable

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

YM-S and ND participated in the design, analysis of the datasets using the algorithms, interpretation of data, and drafting of the manuscript. YO participated in the study design, interpretation of the data, and drafting of the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Yosef Masoudi-Sobhanzadeh or Yadollah Omidi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Daneshvar, N.HN., Masoudi-Sobhanzadeh, Y. & Omidi, Y. A voting-based machine learning approach for classifying biological and clinical datasets. BMC Bioinformatics 24, 140 (2023). https://doi.org/10.1186/s12859-023-05274-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-023-05274-4

Keywords