Application of machine learning in SNP discovery

BMC Bioinformatics

Table 2 Comparison of ML and PolyBayes on test data set

Measure	Decision Tree	Production Rules	PolyBayes
TP	1153	1202	1435
TN	16,748	16,706	NA
FP	207	249	16,955
FN	282	233	NA
Accuracy	97.3	97.4	7.8
Sensitivity	80.3	83.8	100 (Set)
Specificity	98.7	98.5	NA
Positive Predictive Value	84.8	82.8	7.8
Negative Predictive Value	98.3	98.6	NA

We define the following terms used to contrast ML performance with PolyBayes: We say that a SNP prediction program produces a true positive (TP) if it predicts a SNP that is judged true by the expert. Likewise, a false positive (FP) is a predicted SNP that is judged false by the expert, a true negative (TN) is a prediction of a non-SNP that concurs with the expert, and a false negative (FN) is a failure to identify a SNP that is identified by the expert. Also the following parameters were used to measure the performance of the ML output: Accuracy (i.e., fraction of candidate SNP correctly classified), sensitivity (i.e., fraction of positive outcomes correctly identified), specificity (i.e., fraction of the negative outcomes correctly identified), positive predictive value (i.e., fraction of predicted SNP being true) and negative predictive value (i.e., fraction of predicted false SNP being correctly classified)
Accuracy = (TP + TN)/total
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN)
Positive Predictive Value (PPV) = TP/(TP + FP)Negative Predictive Value (NPV) = TN/(TN + FN)
Application of machine learning program substantially reduces the number of false positives from 16,955 to only about 250. Other statistical measures also demonstrate considerable advantage in the application of machine learning.

ISSN: 1471-2105