Artificial Intelligence based wrapper for high dimensional feature selection

Background Feature selection is important in high dimensional data analysis. The wrapper approach is one of the ways to perform feature selection, but it is computationally intensive as it builds and evaluates models of multiple subsets of features. The existing wrapper algorithm primarily focuses on shortening the path to find an optimal feature set. However, it underutilizes the capability of feature subset models, which impacts feature selection and its predictive performance. Method and Results This study proposes a novel Artificial Intelligence based Wrapper (AIWrap) algorithm that integrates Artificial Intelligence (AI) with the existing wrapper algorithm. The algorithm develops a Performance Prediction Model using AI which predicts the model performance of any feature set and allows the wrapper algorithm to evaluate the feature subset performance in a model without building the model. The algorithm can make the wrapper algorithm more relevant for high-dimensional data. We evaluate the performance of this algorithm using simulated studies and real research studies. AIWrap shows better or at par feature selection and model prediction performance than standard penalized feature selection algorithms and wrapper algorithms. Conclusion AIWrap approach provides an alternative algorithm to the existing algorithms for feature selection. The current study focuses on AIWrap application in continuous cross-sectional data. However, it could be applied to other datasets like longitudinal, categorical and time-to-event biological data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05502-x.

where, f represents the model function such that the error function ϕ is minimized, i.e., minϕ(y, f (q)) .The approaches adopted for feature selection can be categorized into two groups.The first and simpler approach uses expert opinion for feature selection where features are selected using domain knowledge [4,5] and allows feature selection before data evaluation.This approach has limited or no applicability if a feature has no or little availability of domain information, high dimensional feature space and/or presence of interactions among the features.
The second approach uses the data to perform the feature selection.The algorithms under this approach are broadly classified into filter, embedded and wrapper algorithms [6][7][8] and could be used in supervised, semi-supervised or unsupervised learning frameworks [8][9][10].Filter algorithms rely on the internal data structure of the features for selecting features.Commonly, information gain based algorithms are used for univariate filtering of features [8,11] and correlation based algorithms are used for multivariate filtering of features [12].They are computationally efficient, but interactions between the features may hinder the feature selection performance.Embedded algorithms incorporate feature selection within the model building step by adding a penalization step in the model building process.They are efficient and can handle interactions between the features.Least Absolute Shrinkage and Selection Operator (LASSO) based algorithms [13,14] are commonly used for linear combination models, while tree-based algorithms [15] are used in non-linear combination models.Wrapper algorithms use an iterative approach of evaluating a feature subset for model performance on a given dataset.The process is repeated until the best performance is obtained [16,17].It provides better performance than other algorithms, but it has a higher computational cost.
The key challenge in wrapper algorithms is that models are prepared for every feature subset obtained at each iteration.One strategy is to reduce the number of iterations needed to get the target feature set q for addressing the computational cost issue by focusing on the sampling of feature subset.Feature subset sampling step is commonly performed using either random, sequential or evolutionary sampling.The random sampling approach arbitrarily generates the feature subset [18].The sequential sampling approach adds or removes a feature sequentially from a feature set like forward sampling and backward sampling [16,19].The evolutionary sampling approach selects the feature subset based on the performance of features in the previous subset like genetic algorithm [20] and swarm optimization [21].Another strategy is to use hybrid algorithms, model building at the iteration step is replaced with filter techniques to estimate the performance of many feature subsets at each iteration step [22][23][24].However, filter algorithm challenges persist.
We propose a unique strategy which uses an Artificial Intelligence (AI) model instead of filter techniques.Currently, the existing wrapper algorithms partially or entirely discard the unselected models of feature subset in selecting the next population of feature subsets.Individually, each model may only be useful in providing performance information, but in combination, these models could help in identifying hidden relationships that could predict the performance of unknown feature subset models.This eliminates the need for building models for every single feature subset obtained in the sampling step.
In this study, we propose a novel Artificial Intelligence based Wrapper (AIWrap) algorithm.The algorithm predicts the performance of unknown feature subset using an AI model referred here as Performance Prediction Model (PPM).To determine the performance of unknown feature set, standard wrapper estimates the performance by building a model of unknown feature set on the given dataset to calculate the actual model performance.AIWrap predicts the performance by creating PPM that uses the performances of known feature subsets to compute the performance of unknown feature subset.
AIWrap contributes in many ways.Firstly, it is unique in its perspective as, unlike standard wrapper approach of building models for every feature subset provided by feature subset sampling step, it builds models for only a fraction of the feature subset.Secondly, it provides a unique application of AI models, that are used to replace the AI model-based performance estimation step with AI model-based performance prediction step.Thirdly, AIWrap is versatile, which allows its integration with existing statistical and machine learning techniques.Fourthly, the algorithm allows the explicit identification of interaction terms.
This paper provides the "Results" section to evaluate and compare the algorithm performance against the existing feature selection algorithms for simulations and real studies.We summarize and provide future directions for research in the "Discussion" and "Conclusion" section.We provide "Conceptual Framework" section to explain the basic framework of AIWrap.Finally, the "Methodology" section explains the AIWrap algorithm used in this paper.

Results
The performance of AIW rap is evaluated and compared with standard algorithms like LASSO, Adaptive LASSO (ALASSO), Group LASSO (GLASSO), Elastic net (Enet), Adaptive Elastic net (AEnet) and Sparse Partial Least Squares (SPLS) for both the simulated datasets and real data studies.

Simulation studies
We perform simulation studies to evaluate the proposed algorithm and compare its performance with other feature selection algorithm.The study uses multivariate normal distributions to generate high-dimensional datasets for marginal and interaction models.The regression model, provides the outcome variable of the sim- ulated data for marginal and interaction models, respectively.Error term, ε ∼ N (0, σ 2 ) and features, x i ∼ N (0, 1) follow normal distribution.{ x ij } represents the pairwise inter- actions between features {(x 1 , x 2 ), . . ., (x 2 , x 3 ), . . ., (x p−1 , x p )} .In the current study, only two-way interactions are considered for demonstration purposes, but it could be easily extended to higher-order interactions.Correlation is added between the first 15 features out of p marginal features using the covariance matrix as given below.
. .1).Non-zero β value is assigned only to the true features.The AIWrap algorithm is imple- mented both with and without a performance-based filter step.The final predictive model from selected features is prepared using either ridge regression (AIWrap-LR) or non-penalized linear regression (AIWrap-LLr).When no performance-based filter step is performed, model obtained from embedded feature selection stage is used as the final predictive model and is referred to as AIWrap-L technique.All penalized regression models used in AIWrap performed tenfold cross validation-based optimization of hyper-parameters to reduce overfitting.

Computation time estimation
AIWrap algorithm time complexity can change based on the techniques used to perform feature subset sampling, PPM and feature subset model building.In the current paper, LASSO, random forest and genetic algorithm are used and the time complexity is O(g * pop * (p 3 + k w log(k w ))) where g is the number of generations in genetic algo- rithm, pop is the population size in each generation, p is the number of features and k w is number of LASSO models used to train PPM (Additional File 1).Still, time complexity is also estimated using computation time of the AIWrap algorithm under different scenarios.The algorithm is run on a system with processor Intel ® Core (TM) i7-8750H CPU@2.20 GHz with 16 GB RAM on a Windows 10 64-bit operating system.AIWrap algorithm is compared with a Standard Wrapper (StW) algorithm and hybrid algorithm.In StW, sampling method and feature subset model building method is same as AIWrap but does not have PPM and performance-based feature selection step.Further, an embedded feature selection step is added in StW.AIWrap-L version of AIWrap algorithm is used for comparison, thus any performance difference would be due to PPM.Genetic algorithm is used to generate samples in feature subset sampling step with maximum number of iterations fixed to 100.
Two methods are used in hybrid algorithms.Interaction Information-Guided Incremental Selection (IGIS) method is used as a sample hybrid algorithm for comparison [22].This method uses sequential forward selection as wrapper feature sampling technique.Since, it is designed for classification problem, the filters used are mutual information and joint mutual information.Accordingly, the continuous outcome is converted into 10 equally spaced bins.A Modification of IGIS (mIGIS) is also used where filters used are correlation and ridge regression to allow the use of continuous outcome.Since, IGIS is not designed to provide explicit interaction terms, hybrid algorithms are not tested for scenarios containing interaction terms.Multiple scenarios are created for the comparative analysis of algorithms (Table 1).The training datasets vary from 50 to 100 samples, while the test datasets contain 500 samples.In each scenario, training samples and test samples are independent samples that came from same distribution.Along with computation time, we evaluate algorithms on their ability to select the target features and predictive performance of selected features.F1 score is used to determine the accuracy of selecting target features.Root Mean Square Error (RMSE) from the test data is used to determine the predictive performance of the model obtained from the embedded feature selection step.RMSE on test data would also help in comparing the overfitting problem of different algorithms.All the analysis is conducted using R 4.0.3[25].
In both the marginal and interaction models (Table 2), AIWrap has better or at par ability to discriminate between the target and noise features, especially for interaction models as compared to other algorithms.The similar performance of StW compared to hybrid methods could be the use of different search strategy.Similarly, predictive performance of the features shortlisted from AIWrap is better or at par with other algorithms, especially for high dimensional data and interaction models.AIWrap performance suggests that this methodology framework can be used as an alternative to the standard wrapper and hybrid framework.
The number of iterations in the genetic algorithm is predefined for both StW and proposed algorithm which indicates that proposed algorithm used lesser number of models to achieve the better outcome.However, AIWrap consumed more time as compared to standard wrapper approach which is counter intuitive.The current approach uses random forest to update PPM model and uses LASSO to build the base model for the unknown feature subset in both StW and proposed algorithm.Random forest used for PPM took more time in each run compared to the Lasso model used to build the model because during each upgrade, sample size used for training PPM model increases.LASSO needs to build the model on a sample size of 50 or 100 but random forest needs to build a PPM model using at least 225 samples (Model 1_I) with sample size increasing during the execution of genetic algorithm.

AIWrap comparison with standard algorithms
AIWrap performance is compared with existing standard penalized regression algorithms namely LASSO, ALASSO, GLASSO, Enet, AEnet and SPLS in ten different trials.GLASSO is used only for interaction models.All the analysis is conducted using R 4.0.3[25].The standard algorithms are run using the inbuilt packages in statistical language R. glmnet package [26] is used for most algorithms except GLASSO and SPLS for which glinternet [27] and spls [28] packages are used.Adaptive weights are obtained from ridge regression [29] for adaptive models.In the case of interaction models, all possible twoway interaction terms are created and entered the model.
Algorithms are evaluated on target feature selection and prediction performance.We evaluate their ability to discriminate between true and noise features by measuring the  selection of true features and rejection of noise features.We use RMSE from the test data as the predictive performance metric.

Model
Table 3 shows the feature selection performance of different algorithms for marginal models.All have selected the targeted ten features which means that they can identify the target features in the marginal dataset.However, in most cases, the number of selected features is much higher, indicating that methods also select noise features.AIWrap, compared to other algorithms, selected a similar or lesser number of noise features which suggests that it has better discrimination ability between noise and target features than standard methods (Fig. 1).It is shown that frequency of selecting a noise feature is consistently lesser than the target features in all methods, but the maximum separation is found only for AIWrap method.In addition, the area under curve (AUC) of the features was higher for AIWrap method as compared to standard methods.Thus, in the case of marginal datasets, while all methods can identify the target features, AIWrap outperforms all other methods with a lesser selection of noise features.
The results from the interaction models reiterate the results of the marginal scenario that the feature selection performance of AIWrap is better or at par with the standard algorithms.Similar to marginal models, Table 4 shows that the number of features selected by all algorithms in interaction models is more than the number of target features in most cases.They all selected noise features, but the number of noise features selected differs with algorithm.Figure 2 suggests that AIWrap may be selecting a lesser number of noise features compared to other methods.In low dimensional space, all algorithms can discriminate between the target and noise features by selecting the target features at a higher frequency as compared to noise features.However, in very high dimensions, only AIWrap and GLASSO can perform.AUC performance of different methods also shows better or at par performance of AIWrap as it can predict the target and noise features with greater or similar accuracy than other methods.AIWrap uses existing classic statistical techniques.The statistical techniques could influence the wrapper method performance [30].However, a performance comparison  between LASSO used in AIWrap and as a standalone feature selection algorithm clearly showed that AIWrap could improve the LASSO performance.The AIWrap performance suggests that the proposed algorithm could enhance the feature selection performance of the existing statistical methods by reducing the feature space and increasing the target feature percentage.
Table 5 shows the prediction performance of algorithms.RMSE performance suggests that AIWrap performs consistently better or at par with the existing algorithms.In low dimensionality data (2_M, 4_M and 1_I), it is expected that all algorithms should give similar performance as standard algorithms are primarily developed for handling low dimensionality data, and results support it.AIWrap can provide better performance even in high dimensional settings (1_M and 3_M) and in the presence of interaction terms (2_I).However, at very high dimensional data (3_I), all methods perform poorly.These findings suggest that the AIWrap may provide better or at par prediction performance than existing algorithms.Overall, the proposed algorithm could expand the capability of existing methods like non-penalized regression to operate in high-dimensional settings.However, computational intensiveness will be a significant limitation for the proposed algorithm compared to standard methods.In summary, performance of all algorithms deteriorates with an increase in data dimensionality, but performance of most standard methods decreases more drastically than AIWrap.

Real studies: population health data
Four real studies are analyzed to evaluate the performance of AIWrap and existing algorithms.Community Health Status Indicators (CHSI) study focuses on non-communicable diseases from US county with data (n = 3141) containing 578 features [31] (Study I).National Social Life, Health and Aging Project (NSHAP) datasets focusing on the health and well-being of aged Americans contains multiple datasets.We chose two datasets (Study II and Study III) containing data for 4377 residents on 1470 features [32] and 3005 residents on 820 features [33].Study IV is the Study of Women's Health Across the Nation (SWAN), 2006-2008 dataset focusing on 887 physical, biological, psychological and social features in middle-aged women in the USA (n = 2245) [34].
The raw data of the real studies are processed for ease of analysis to obtain final datasets (Table 6).Features and samples are filtered to remove highly correlated features, non-continuous features, and missing values.Then, each dataset is randomly split into training and testing datasets.As the sample size is large, only 20% of data is used for training while remaining 80% of data is used for testing to create a high dimensional data setting.We compare the performance of different algorithms for marginal models and interaction models using mean RMSE of the test data in ten trials.
Table 7 summarizes the feature selection results.It is shown that standard algorithms are selecting a lesser number of features as compared to AIWrap.However, the results from the previous simulated data studies suggest that standard methods may struggle to discriminate between target and noise features (Figs. 1 and 2).Further, the predictive performance results of AIWrap is better than the standard algorithms for both marginal as well as interaction models (Table 8).The better performance of the proposed algorithm suggests that it may be more reliable than standard algorithms in identifying the target features.The results show that in Study III, marginal models performed better than their interaction models for all algorithms.Better performance of the marginal model compared to the interaction model suggests that AIWrap cannot completely reject noise features and is sensitive to an increase in feature space.However, AIWrap is still more robust than standard algorithms and can perform in different dimensions and datasets.

Real studies: genomic data
AIWrap-L algorithm is compared with StW in the genomic datasets to determine the biological relevance of the solutions obtained from AIWrap.In many cancer studies, it is found that smoking can be detrimental to the cancer patient health [35,36].Further, an association between gene expression levels and cancer patient smoking habit has been reported [37].Thus, it would be relevant to identify the genes in cancer patients which are associated with smoking-related traits.In this study, The Cancer Genomic Atlas (TCGA) program is used to get the data from nine cancer projects (Table 9) which maintained records related to amount smoked and gene expression profile of patients [38].The sample size n for these projects range from 89 to 592 samples with feature space p of 56,602 genes.The gene expression profile is used as the input feature space and number of cigarettes smoked per day (CPD) is used as the outcome.Preliminary processing of all datasets is performed to reduce the input feature space and remove samples with missing values.The input feature space is reduced from 56,602 to 50 features through multi-stage processing (Table 9).Step one involved removing the features which are not differentially expressed in cancer patients as compared to normal patients using TCGAbiolinks package [39].Step two involved supervised dimensionality reduction of the differentially expressed genes using partial least squares technique and select top 100 features with highest absolute weights in first latent feature.Step three involved removing correlations among the features.Thus, among any pair of features with more than 0.8 absolute correlation, one feature is randomly selected.
Step four involves selecting the top 50 features among the noncorrelated features based on their absolute weight in the first latent feature obtained in step two.No interaction effects are considered for this analysis.The performance of AIWrap and StW in all datasets is compared on three metrics namely predictive performance, computation time and number of genes selected.The results are based on tenfold cross-validation (Table 10).It observed that in all the datasets the predictive performance of AIWrap based features is better or at par with StW based features.Further, it is observed that a smaller set of features are selected by AIWrap as compared to StW which suggests AIWrap could provide a more parsimonious set of features as compared to StW without compromising on the predictive performance of the features.In terms of computation time, the results are similar to those observed in simulation studies with StW taking less time than AIWrap in most cases.The stability of AIWrap is similar to StW when compared using standard deviation of predictive performance (Additional file 2: Table S1).
TCGA-CESC 1.00[0.84,1.16]0.98[0.84,In order to assess the biological relevance of the genes selected by each method, selected genes of each dataset are pooled together to create final list of genes selected by each method.The results show that some genes are selected at a very high frequency in dataset during tenfold feature selection process.Genes need to fulfill one of the two criteria of either having highest selection frequency or selection frequency of more than 80%.Accordingly, across nine datasets, AIWrap provided 13 genes while StW provided 40 genes.11 genes (VCX3A, WNT3A, CALHM5, ZMYND10, FOXE1, PLAT, BAAT, WFDC5, CGB5, FADD, APOE) are found to be common across the two methods.Based on the univariate of the genes selected from two algorithms, it is found that 9 out of 13 AIWRAP genes and 19 out of 40 StW genes are statistically significant.In multivariate analysis, it is found that 8 out of 13 AIWrap and 18 out of 40 StW genes are statistically significant (Additional file 2: Table S2).Among the 13 genes from AIWrap, seven genes (WNT3A [40], TMEM45A [41], BAAT [41], WFDC5 [42], HS3ST5 [43], CGB5 and APOE [44]) have been reported in literature to exert influence on tobacco or smokingrelated traits.Further, AIWrap identified six new genes (VCX3A, CALHM5, ZMYND10, FOXE1, PLAT, FADD) which could be related to smoking in cancer patients, thus providing an opportunity for identifying previously unknown biological functions.

Discussion
Building models for each sample feature set obtained during the feature sampling stage of wrapper methods consume computational resources and may not always provide the best results.AIWrap allows skipping the model building for many sample feature sets by training an AI model, i.e., PPM, which could predict the performance of sample feature sets.AIWrap feature selection performance and predictive performance are better or at par than both the standard wrapper method and penalized standard algorithms, namely LASSO, ALASSO, GLASSO, SPLS, Enet and AEnet.
The proposed algorithm has certain limitations.The current study primarily focuses on testing the concept; thus, the study performed testing on limited datatypes.Future research could focus on evaluating the robustness of the approach using different types of data such as temporal data and categorical data, and outcomes such as binary outcomes and time to event outcomes.Other than data types, the focus could also be directed towards the techniques used in the algorithm.Currently, the study uses a linear combination function for building actual models, but future studies could also explore the non-linear combination function for model building.Further, the current study reduced the need to build actual models in the wrapper approach but could not eliminate it.Therefore, future research could use other PPM building techniques like an artificial neural network and support vector machines to eliminate the need for actual models.Finally, time complexity is a challenge which could be explored in future research.

Conclusion
In the paper, we propose AIWrap, an innovative algorithm to perform wrapper based feature selection.The algorithm is flexible enough to work with both marginal and interaction terms.The algorithm could be easily embedded with any of the wrapper techniques as it does not alter existing methods, which allows users to integrate the algorithm in their existing wrapper pipelines.This approach could enhance the performance of existing wrapper techniques available in the literature for high dimensional datasets by reducing the number of models needed to search space.AIWrap can identify both the marginal features and interaction terms without using interaction terms in PPM, which could be critical in reducing the feature space any pipeline has to process.
The benefits of AIWrap comes from using AI to learn the dataset performance behavior and build the PPM, which replaces the actual model building process.The studies involving marginal effects with and without interaction effects in simulated data showed that AIWrap could outperform existing algorithms in feature selection and prediction performance.Similar performance in real datasets also demonstrates the practical relevance of AIWrap.

Conceptual framework
In a wrapper algorithm, given a dataset D of sample size n with p feature space and outcome y , a subset feature set q is created from p .In the standard wrapper algorithm (Fig. 3a), a model is built for the subset of D containing q features and performance is estimated by building model using the dataset.This performance is used to select the next subset of p .This dependence of a standard wrapper algorithms upon model build- ing step for each subset of feature to estimate its performance is targeted in our AIWrap algorithm.
The conceptual framework used to design AIWrap algorithm (Fig. 3b) aims at reducing (or removing) the dependence of the wrapper algorithm on model building step for obtaining performance value of q .PPM will compute the unknown subset performance based on the known subsets performance.A user may not have a predefined list of known k feature samples with their actual performance.AIWrap algorithm creates a random set q AI = q AI j |q AI j ∈ {{1}, . . ., {1, . . ., p}}, j ∈ 1, . . ., k of k feature samples, where each feature sample is a subset of p .The algorithm builds a model for q AI samples to estimate their performance C = c j .The algorithm creates PPM with q AI as the input and c as the outcome using a machine learning model to enable performance pre- diction of any subset of p .Finally, the algorithm executes the standard wrapper approach, but uses PPM as a surrogate to the actual model building step that predicts rather than estimates the actual performance of q.

Methodology
This section explains the design of AIWrap algorithm based on the conceptual framework.The algorithm is divided into four steps: PPM, wrapper based coarse feature selection, embedded-feature selection and performance-based feature selection (Fig. 4).

PPM
The algorithm generates k random sample datasets containing q AI j features, and sample size n fromD .A set of models M = m j are created from k sample datasets for an out- come , y using any modeling technique to obtain its performance,c .k is a hyperparam- eter which user needs to optimize.In the current study,k = 15 * p , which is determined heuristically.
A performance set C = c j contains the performance of M models.The algorithm creates a performance dataset D perf , a matrix of features used in each model of M ( q f ) and their performance, C.
As shown in Eq. 3, feature matrix ( q f ) is a binary matrix that consists of p columns and k rows.The matrix takes the value of 0 for i th column and j th row, if i th feature is not used in m j model, else i th column and j th row takes the value of 1. PPM is constructed using any machine learning technique from D perf to predict performance, C pred of any unknown dataset.
In this study, we have used LASSO to prepare m j models and random forest to build the PPM with RMSE as the performance metric.During the preliminary analysis (Additional file 3), it is found that predicted performance and actual performance is strongly and positively correlated, but predicted performance may not match the actual performance, as a result subset corresponding to best predicted performance may not be the best subset.

Wrapper based coarse feature selection
The standard wrapper algorithm as shown in Fig. 3a is an iterative process where a subset of feature is evaluated, and performance of the feature subset is used to select the next subset of features.In our work, we used genetic algorithm to search through the feature space (2) m j : y j = f q AI j |j ∈ 1, . . ., k (3) D perf = q f ij c j q f ij = 0, q AI ij / ∈ m j , i ∈ {1, p}, j ∈ 1, . . ., k 1, q AI ij ∈ m j , i ∈ {1, p}, j ∈ 1, . . ., k (4) PPM : C pred = f q f Fig. 4 AIWrap algorithm graphical flow chart.Dark Background represents main steps and light background represents sub-steps iteratively as it is used in wide range of datasets [45,46].In the proposed algorithm, we use PPM for all iterations to predict the performance C pred of a feature set q .Since, we found that best C pred may correspond to one of the high performing feature sets but not the best feature set, we validate C pred values by building a model using q features to estimate the performance C true (Fig. 4).The algorithm uses user-defined criteria val crit select sample feature sets for validation of C pred values.
In this study, the top quartile of C is used as the val crit criterion, thus q with C pred in top quartile of C are selected for model building.D perf is updated with feature set q whose C true value is available and consequently, is used to update PPM.The iteration stops when we get q wrap features, which provide the best performance.RMSE is used as the performance metric.

Embedded feature selection
The q wrap features obtained from the wrapper step are processed to obtain the final fea- tures because the prediction model does not explicitly provide the non-linear combinations of q wrap features.Thus, an embedded feature selection model is used on q wrap features for an outcome , y which allows the additional features χ like interactions terms to be incor- porated.LASSO framework is used as the embedded model in the proposed algorithm.LASSO hyper-parameters are optimized using tenfold cross validation of training data to reduce overfitting issue of the algorithm (Additional file 3).

Performance-based feature selection
The features selected from the embedded model q embed undergo the last stage of processing to provide final features q .This step selects features based on their contribution to the model performance.l models m perf l : y j = f (q embed − l)|l ∈ {1, . . ., q embed } are prepared with each model containing q embed − 1 features.l feature importance is determined from the m perf l performance.
To obtain l feature robust importance, we create multiple models using bootstrapping of samples, and their performance c j is pooled to get overall model performance c pool j .In this study, we use ridge regression for model building as we are focusing on high dimensional data and non-penalized linear regression could only work for cases with q embed < n .Good- ness of fit ( R 2 ) of out of the bag (OOB) samples is used as the performance metric.Finally, the performance metric is pooled to provide a coefficient of variation of R 2 as the overall model performance for l feature.
A performance threshold c cutoff needs to be defined to select the features.Rather than using an arbitrary threshold, our algorithm uses a dynamic cutoff.The algorithm tries different performance thresholds and selects the threshold which provides the best performance c best for the smallest feature space q best .In the current study, we use genetic algorithm to search through the performance threshold space.Two different techniques, namely nonpenalized regression and adaptive ridge regression are used for the model building.Pseudo Algorithm summarizes the complete AIWrap algorithm.

Fig. 1
Fig. 1 Comparison of different methods' feature selection performance in marginal models a Frequency of selection of target and noise features.b AUC for predicting the target and noise features

Fig. 2
Fig. 2 Feature selection performance comparison of different methods in interaction models a Frequency of selection of target and noise features.b AUC for predicting the target and noise features

Fig. 3
Fig. 3 Flow chart of A Standard wrapper approach and B Proposed wrapper (AIWrap) conceptual approach x 15 x 15 .Multiple scenarios are created with the different number of noise features (Table

Table 1
Description of the simulation data

Table 2
Algorithms comparison of computation time, target feature selection and predictive performance Values in Bold means best results

Table 3
Feature selection performance of different approaches in simulated scenarios for marginal models

Table 4
Feature selection performance of different approaches in simulated scenarios for interaction models

Table 5
Outcome prediction performance of different approaches in simulated scenarios for the test dataset

Table 6
Summary of the real datasets

Table 7
Number of features selected by different wrapper methods on the real studies

Table 8
RMSE performance of different wrapper methods on the real studies for test data

Table 9
Summary of the genomic datasets

Table 10
Wrapper methods comparison of predictive performance, number of genes selected and computation time