Skip to main content

Table 23 Results of constituent parsers using retrained CRAFT models for each CRAFT fold and the development set compared to untrained results on the development set; labeled bracket precision (LB-P), recall (LB-R) and F-score (LB-F)

From: A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Parser

Fold 0

Fold 1

Fold 2

Fold 3

Fold 4

Training Average

Dev Set

Dev Set Untrained

Berkeley

LB-P

82.75

92.02

84.63

83.70

83.85

85.39

83.98

61.60

LB-R

82.64

90.82

84.01

83.29

82.88

84.73

83.20

64.50

LB-F

82.70

91.41

84.32

83.49

83.36

85.06

83.59

63.02

Bikel

LB-P

80.49

81.10

81.18

80.77

91.43

82.99

80.86

63.97

LB-R

79.68

79.77

80.10

80.46

91.06

82.21

80.44

65.82

LB-F

80.08

80.43

80.64

80.62

91.24

82.60

80.65

64.89

Stanford 1.6.6

LB-P

75.65

75.86

77.71

76.21

77.86

76.65

76.17

60.76

LB-R

76.81

76.84

78.65

77.24

77.85

77.48

75.92

64.70

LB-F

76.23

76.34

78.18

76.72

77.86

77.07

76.04

62.67