Skip to main content

Table 4 Parsing results on the test set with predicted POS tags and gold tokenization (except [\(\mathcal {G}\)] which denotes results when employing gold POS tags in both training and testing phases)

From: From POS tagging to dependency parsing for biomedical event extraction

System

With punctuation

Without punctuation

  

Overall

Exact match

Overall

Exact match

  

LAS

UAS

LAS

UAS

LAS

UAS

LAS

UAS

GENIA

         

Pre-trained

Stanford-NNdep [ ∙]

86.66

88.22

25.15

29.26

87.31

89.02

25.88

30.22

 

Stanford-Biaffine-v1 [ ∙]

84.69

87.95

16.25

26.10

84.92

88.55

16.99

28.24

 

Stanford-NNdep

86.79

88.13

25.22

29.19

87.43

88.91

25.88

30.15

 

Stanford-Biaffine-v1

84.72

87.89

16.47

25.81

84.94

88.45

17.06

27.79

 

BLLIP+Bio

88.38

89.92

28.82

35.96

88.76

90.49

29.93

37.43

GENIA

         

Retrained

Stanford-NNdep

87.02

88.34

25.74

30.07

87.56

89.02

26.03

30.59

 

NLP4J-dep

88.20

89.45

28.16

31.99

88.87

90.25

28.90

32.94

 

jPTDP-v1

90.01

91.46

29.63

35.74

90.27

91.89

30.29

37.06

 

Stanford-Biaffine-v2

91.04

92.31

33.38

39.56

91.23

92.64

34.41

41.10

 

Stanford-Biaffine-v2 [\(\mathcal {G}\)]

91.68

92.51

36.99

40.44

91.92

92.84

38.01

41.84

CRAFT

         

Retrained

Stanford-NNdep

84.76

86.64

25.31

30.40

85.59

87.81

25.48

30.96

 

NLP4J-dep

86.98

88.85

27.60

33.71

87.62

89.80

28.16

34.60

 

jPTDP-v1

88.27

90.08

29.68

36.06

88.66

90.79

30.24

37.12

 

Stanford-Biaffine-v2

90.41

92.02

33.20

40.03

90.77

92.67

33.87

41.10

 

Stanford-Biaffine-v2 [\(\mathcal {G}\)]

91.43

92.93

35.22

41.99

91.69

93.47

35.61

42.95

  1. “Without punctuation” refers to results excluding punctuation and other symbols from evaluation. “Exact match” denotes the percentage of sentences whose predicted trees are entirely correct [25]. [ ∙] denotes the use of the pre-trained Stanford tagger for predicting POS tags on test set, instead of using the retrained NLP4J-POS model. Score differences between the “retrained” parsers on both corpora are significant at p≤0.001 using McNemar’s test (except UAS scores obtained by Stanford-Biaffine-v2 for gold and predicted POS tags on GENIA, i.e. 92.51 vs. 92.31 and 92.84 vs. 92.64, where p≤0.05)