Skip to main content

Table 2 Summary of feature-selection techniques used in this study

From: A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator

Type

Feature-selection method

Description

Filter—univariate

Correlation-based (Pearson’s r)

Pearson’s correlation for two random variables a and b is given as: ρ(a,b) = cov(a,b)/(σaσb) where cov(a,b) is the covariance or cross-correlation of a and b and σ denotes the standard deviation [60]. Pearson’s correlation coefficient (r) is a measure of linear association, and it is used in epigenetic studies to measure the strength of association between each feature and the response variable. A threshold value is usually applied, excluding all features below that value.

Multiple hypothesis testing (F-test with FDR)

F-test statistic or associated p-value can be used as a threshold score for association between a feature and response. The false discovery rate (FDR) can be used to detect true positives while controlling Type I errors at a designated level. F-tests between the methylation value of each CpG and TL are conducted, with those CpG sites with FDR (Benjamini-Hochberg [36]) less than a specified value being selected.

Mutual Information

Mutual information can be formulated as: MI = H(x) + H(x|y), where H(x) is the entropy of feature x and H(x|y) denotes the entropy of feature x after observing feature y. The values of mutual information per feature are typically ranked with a threshold utilised to remove the most redundant features (CpGs) [43].

Filter—regression

Support Vector Regression

The absolute value of the weights (coefficients) yielded by the support vector regression (SVR) algorithm can be utilised to create a set of ranked features. In the case of a linear kernel, the SVR model can take the form: prediction(x) = b + wTx where w = \(\mathop \sum \limits_{i} \alpha_{i} x_{i}\), with the vector of weights w directly accessible. The features with higher absolute weights are considered to be more likely to be useful for model training and prediction, conversely smaller weights are thought to not have a large influence on predictions [61].

Filter—ensemble

Random Forest Regression

Random forests [62] can handle correlated data and high dimensionality [63]. This ensemble method for classification and regression utilises bagging (subsets of samples) and boosting (subsets of features) to ensure diversity across constituent tree models. As trees only use a portion of samples in their construction, the remaining samples can be used to generate feature importance scores via feature value shuffling, with the impact of this shuffling assessed over the whole ensemble [64].

Embedded

Elastic net

This is a regularised regression method and embedded feature-selection approach. It includes both the l1 and l2 norms in the objective function and tunes the bias towards one of the norms using a hyperparameter [65].

XGBoost

XGBoost [66] utilises gradient boosted decision trees and can generate feature importance scores through the degree that each feature split point enhances performance, weighted by the number of observations relating to a node [67]. BoostARoota [68] is an embedded method which uses XGBoost as its base learner and returns a reduced feature set through regularisation.

Transformative

Principal Component Analysis (PCA)

PCA is applied to a data set containing variables, which are, in general, inter-correlated. It finds new variables which are linear combinations of the original variables that maximise variance but are uncorrelated with each other [69].