Skip to main content

Table 4 Features remaining in models after both feature-selection stages for the nested CV analysis (Stage 1)

From: A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator

Models developed using nested CV

Average features remaining after initial feature- selection stage

Average features remaining after elastic net stage (Std. Error)

Baseline (Elastic net)

30 (16.3)

F-test (FDR: 0.01)/Elastic net

23,685

4637 (283.5)

F-test (FDR: 0.05)/Elastic net

55,750

6398 (672.2)

Gradient Boosting/Elastic net

584

264 (8.7)

Pearson Correlation/Elastic net

750

68 (22.1)

Mutual Information/Elastic net

6500

394 (109.6)

Linear SVR/Elastic net

8500

4453 (12.6)

Random Forest Regression/Elastic net

3250

406 (10.6)

PCA/elastic net

1072*

41(9.5)*

  1. As described in “Feature-selection methods” Section, the optimal number of features for the four ranking feature-selection methods (Pearson correlation, mutual information, linear SVR and random forest) were selected from analysis of Figs. 2 and 3. For example, the minimum MAE corresponded to passing 3250 features from the random forest feature ranking to the elastic net regression stage, resulting in an average of 406 features being selected. The figures denote the average (rounded) from the five training sets of the nested CV process described in “Modelling overview”. Values in parentheses denote the standard error of the mean. *denotes principal components