A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator

Doherty, Trevor; Dempster, Emma; Hannon, Eilis; Mill, Jonathan; Poulton, Richie; Corcoran, David; Sugden, Karen; Williams, Ben; Caspi, Avshalom; Moffitt, Terrie E.; Delany, Sarah Jane; Murphy, Therese M.

doi:10.1186/s12859-023-05282-4

Table 2 Summary of feature-selection techniques used in this study

From: A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator

Type	Feature-selection method	Description
Filter—univariate	Correlation-based (Pearson’s r)	Pearson’s correlation for two random variables a and b is given as: ρ(a,b) = cov(a,b)/(σ_aσ_b) where cov(a,b) is the covariance or cross-correlation of a and b and σ denotes the standard deviation [60]. Pearson’s correlation coefficient (r) is a measure of linear association, and it is used in epigenetic studies to measure the strength of association between each feature and the response variable. A threshold value is usually applied, excluding all features below that value.
	Multiple hypothesis testing (F-test with FDR)	F-test statistic or associated p-value can be used as a threshold score for association between a feature and response. The false discovery rate (FDR) can be used to detect true positives while controlling Type I errors at a designated level. F-tests between the methylation value of each CpG and TL are conducted, with those CpG sites with FDR (Benjamini-Hochberg [36]) less than a specified value being selected.
	Mutual Information	Mutual information can be formulated as: MI = H(x) + H(x\|y), where H(x) is the entropy of feature x and H(x\|y) denotes the entropy of feature x after observing feature y. The values of mutual information per feature are typically ranked with a threshold utilised to remove the most redundant features (CpGs) [43].
Filter—regression	Support Vector Regression	The absolute value of the weights (coefficients) yielded by the support vector regression (SVR) algorithm can be utilised to create a set of ranked features. In the case of a linear kernel, the SVR model can take the form: prediction(x) = b + w^Tx where w = \(\mathop \sum \limits_{i} \alpha_{i} x_{i}\), with the vector of weights w directly accessible. The features with higher absolute weights are considered to be more likely to be useful for model training and prediction, conversely smaller weights are thought to not have a large influence on predictions [61].
Filter—ensemble	Random Forest Regression	Random forests [62] can handle correlated data and high dimensionality [63]. This ensemble method for classification and regression utilises bagging (subsets of samples) and boosting (subsets of features) to ensure diversity across constituent tree models. As trees only use a portion of samples in their construction, the remaining samples can be used to generate feature importance scores via feature value shuffling, with the impact of this shuffling assessed over the whole ensemble [64].
Embedded	Elastic net	This is a regularised regression method and embedded feature-selection approach. It includes both the l₁ and l₂ norms in the objective function and tunes the bias towards one of the norms using a hyperparameter [65].
Embedded	XGBoost	XGBoost [66] utilises gradient boosted decision trees and can generate feature importance scores through the degree that each feature split point enhances performance, weighted by the number of observations relating to a node [67]. BoostARoota [68] is an embedded method which uses XGBoost as its base learner and returns a reduced feature set through regularisation.
Transformative	Principal Component Analysis (PCA)	PCA is applied to a data set containing variables, which are, in general, inter-correlated. It finds new variables which are linear combinations of the original variables that maximise variance but are uncorrelated with each other [69].

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Bioinformatics

Contact us