Skip to main content

Table 7 Classifier performance in the enzyme dataset

From: Improving classification in protein structure databases using text mining

CONDITIONS PERFORMANCE
     Superfamily Classification
N = 352, 5 classes
  Test Set Reference Set Lucene Analyser AUC MCC F
1 Abstract Dp20 – DX33 -Ann Stop 0.75 0.51 0.56
2 Annotations Dp20 – DX33 -Ann Stop 0.77 0.53 0.58
3 Abstract Dp20 – Ann Stop 0.74 0.50 0.53
4 Abstract Dp20 – Ann Standard 0.70 0.33 0.44
5 Abstract Dp20 Stop 0.74 0.49 0.52
6 Abstract Dp1 Stop 0.64 0.31 0.40
  1. Classifier performance was assessed using AUC, MCC and F-measure under six conditions of the reference set PDB145: Inclusion of additional abstracts from related articles (Dp20); inclusion of annotations (Ann); filtering using the SVM model (DX33). Conditions 1 and 2 : 20 SVM filtered abstracts per enzyme, Stop analyser, inclusion of PDB/UniProt annotations.Condition 3 : 20 abstracts per enzyme, Stop analyser, inclusion of PDB/UniProt annotations. Condition 4 : 20 abstracts per enzyme, Standard analyser, inclusion of PDB/UniProt annotations. Condition 5 : 20 abstracts per enzyme, Stop analyser. Condition 6 : 1 abstract per enzyme, Stop analyser. For the Superfamily classification task all 352 enzymes of REGS352 were classified in 5 superfamilies. Abstracts were used in the test set, except of condition 2 were annotations from UniProt and PDB fields were used for comparison.