From: Faster and more accurate pathogenic combination predictions with VarCoPP2.0
VarCoPP2.0 | Version | 2.0 |
---|---|---|
Data | Provenance | |
Dataset splits | 301 positive instances, 150,500 negative instances for training data. 53 positive and 10000 negative instances for validation set. Training with stratified LOGO cross-validation | |
Redundancy between data splits | No overlap | |
Availability of data | Yes: olida.ibsquare.be (new curated data will be added) and www.internationalgenome.org | |
Optimization | Algorithm | Balanced Random Forest |
Meta-predictions | Yes: CADD features and ISPP features stem from a predictive model | |
Data encoding | Global features | |
Parameters | 400 decision trees within RF | |
Features | 15 features, obtained through wrapper approach on training data only, using mean f1 score of 5-fold cross validation | |
Fitting | Decision trees are pruned to avoid overfitting | |
Regularization | No | |
Availability of configuration | ||
Model | Interpretability | Transparent model, 400 decision trees |
Output | Probability, thresholded to classification | |
Execution time | 10000 samples in .2 seconds | |
Availability of software | ORVAL: https://orval.ibsquare.be & Github: https://github.com/oligogenic/VarCoPP2.0 | |
Evaluation | Evaluation method | Both stratified LOGO cross validation and independent validation data |
Performance measures | Average precision score, Precision, Recall, Specificity, F1 and Geometric mean | |
Comparison | Confusion matrix and aforementioned performance methods on previous version of model and retrained model on new data | |
Confidence | Performance differences apparent | |
Availability of evaluation | Yes: Github: https://github.com/oligogenic/VarCoPP2.0 |