Skip to main content

Table 1 Feature selection

From: VarSight: prioritizing clinically reported variants with binary classification algorithms

Feature label

RF(sklearn)

BRF(imblearn)

HPO-cosine

0.2895

0.2471

PyxisMap

0.2207

0.2079

CADD Scaled

0.1031

0.1007

phylop100 conservation

0.0712

0.0817

phylop conservation

0.0641

0.0810

phastcon100 conservation

0.0572

0.0628

GERP rsScore

0.0357

0.0416

HGMD assessment type_DM

0.0373

0.0344

HGMD association confidence_High

0.0309

0.0311

Gnomad Genome total allele count

0.0192

0.0322

ClinVar Classification_Pathogenic

0.0228

0.0200

ADA Boost Splice Prediction

0.0081

0.0109

Random Forest Splice Prediction

0.0077

0.0105

Meta Svm Prediction_D

0.0088

0.0092

PolyPhen HV Prediction_D

0.0075

0.0071

Effects_Premature stop

0.0049

0.0057

SIFT Prediction_D

0.0026

0.0056

PolyPhen HD Prediction_D

0.0025

0.0049

Effects_Possible splicing modifier

0.0029

0.0035

ClinVar Classification_Likely Pathogenic

0.0034

0.0020

  1. This table shows the top 20 features that were used to train the classifiers ordered from most important to least important. After training, the two random forest classifiers report the importance of each feature in the classifier (total is 1.00 per classifier). We average the two importance values, and order them from most to least important. Feature labels with an ‘_’ represent a single category of a multi-category feature (i.e. “HGMD assessment type_DM” means the “DM” bin-count feature from the “HGMD assessment type” annotation in Codicem)