Skip to main content

Table 4 DOME Table consisting of essential information to assess the machine learning approach [22]

From: Faster and more accurate pathogenic combination predictions with VarCoPP2.0

VarCoPP2.0

Version

2.0

Data

Provenance

OLIDA [21] and 1000 genomes Project [12]. 1:500 ratio

 

Dataset splits

301 positive instances, 150,500 negative instances for training data. 53 positive and 10000 negative instances for validation set. Training with stratified LOGO cross-validation

 

Redundancy between data splits

No overlap

 

Availability of data

Yes: olida.ibsquare.be (new curated data will be added) and www.internationalgenome.org

Optimization

Algorithm

Balanced Random Forest

 

Meta-predictions

Yes: CADD features and ISPP features stem from a predictive model

 

Data encoding

Global features

 

Parameters

400 decision trees within RF

 

Features

15 features, obtained through wrapper approach on training data only, using mean f1 score of 5-fold cross validation

 

Fitting

Decision trees are pruned to avoid overfitting

 

Regularization

No

 

Availability of configuration

Yes: https://github.com/oligogenic/VarCoPP2.0

Model

Interpretability

Transparent model, 400 decision trees

 

Output

Probability, thresholded to classification

 

Execution time

10000 samples in .2 seconds

 

Availability of software

ORVAL: https://orval.ibsquare.be & Github: https://github.com/oligogenic/VarCoPP2.0

Evaluation

Evaluation method

Both stratified LOGO cross validation and independent validation data

 

Performance measures

Average precision score, Precision, Recall, Specificity, F1 and Geometric mean

 

Comparison

Confusion matrix and aforementioned performance methods on previous version of model and retrained model on new data

 

Confidence

Performance differences apparent

 

Availability of evaluation

Yes: Github: https://github.com/oligogenic/VarCoPP2.0