Skip to main content

Table 7 Classifier performance in the enzyme dataset

From: Improving classification in protein structure databases using text mining

CONDITIONS

PERFORMANCE

    

Superfamily Classification

N = 352, 5 classes

 

Test Set

Reference Set

Lucene Analyser

AUC

MCC

F

1

Abstract

Dp20 – DX33 -Ann

Stop

0.75

0.51

0.56

2

Annotations

Dp20 – DX33 -Ann

Stop

0.77

0.53

0.58

3

Abstract

Dp20 – Ann

Stop

0.74

0.50

0.53

4

Abstract

Dp20 – Ann

Standard

0.70

0.33

0.44

5

Abstract

Dp20

Stop

0.74

0.49

0.52

6

Abstract

Dp1

Stop

0.64

0.31

0.40

  1. Classifier performance was assessed using AUC, MCC and F-measure under six conditions of the reference set PDB145: Inclusion of additional abstracts from related articles (Dp20); inclusion of annotations (Ann); filtering using the SVM model (DX33). Conditions 1 and 2 : 20 SVM filtered abstracts per enzyme, Stop analyser, inclusion of PDB/UniProt annotations.Condition 3 : 20 abstracts per enzyme, Stop analyser, inclusion of PDB/UniProt annotations. Condition 4 : 20 abstracts per enzyme, Standard analyser, inclusion of PDB/UniProt annotations. Condition 5 : 20 abstracts per enzyme, Stop analyser. Condition 6 : 1 abstract per enzyme, Stop analyser. For the Superfamily classification task all 352 enzymes of REGS352 were classified in 5 superfamilies. Abstracts were used in the test set, except of condition 2 were annotations from UniProt and PDB fields were used for comparison.