A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Table 23 Results of constituent parsers using retrained CRAFT models for each CRAFT fold and the development set compared to untrained results on the development set; labeled bracket precision (LB-P), recall (LB-R) and F-score (LB-F)

Parser	Fold 0	Fold 1	Fold 2	Fold 3	Fold 4	Training Average	Dev Set	Dev Set Untrained
Berkeley
LB-P	82.75	92.02	84.63	83.70	83.85	85.39	83.98	61.60
LB-R	82.64	90.82	84.01	83.29	82.88	84.73	83.20	64.50
LB-F	82.70	91.41	84.32	83.49	83.36	85.06	83.59	63.02
Bikel
LB-P	80.49	81.10	81.18	80.77	91.43	82.99	80.86	63.97
LB-R	79.68	79.77	80.10	80.46	91.06	82.21	80.44	65.82
LB-F	80.08	80.43	80.64	80.62	91.24	82.60	80.65	64.89
Stanford 1.6.6
LB-P	75.65	75.86	77.71	76.21	77.86	76.65	76.17	60.76
LB-R	76.81	76.84	78.65	77.24	77.85	77.48	75.92	64.70
LB-F	76.23	76.34	78.18	76.72	77.86	77.07	76.04	62.67

ISSN: 1471-2105