Skip to main content

Table 7 Comparison of BioALBERT versus SOTA methods in BioNLP tasks

From: Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Dataset

SOTA

BioALBERT

Difference over SOTA

Base1

Base2

Large1

Large2

Base3

Base4

Large3

Large4

 

Named entity recognition task

Share/Clefe

75.40

94.27

94.47

93.16

94.30

94.84*

94.82

94.70

94.66

19.44 \(\uparrow\)

BC5CDR (disease)

87.15

97.66

97.62

97.78*

97.61

90.03

90.01

90.29

91.44

10.63 \(\uparrow\)

BC5CDR (chemical)

93.47

97.90

98.08*

97.76

97.79

89.83

90.08

90.01

91.48

4.61 \(\uparrow\)

JNLPBA

82.00

82.72

83.22

84.01

83.53

86.74*

86.56

86.20

85.72

4.74 \(\uparrow\)

Linnaeus

93.54

99.71

99.72

99.73

99.73*

95.72

98.27

98.24

98.23

6.19 \(\uparrow\)

NCBI (disease)

89.71

95.89

95.61

97.18*

95.85

85.82

85.93

85.86

85.83

7.47 \(\uparrow\)

S800

75.31

98.76

98.49

99.02*

98.72

93.53

93.63

93.63

93.63

23.71 \(\uparrow\)

BC2GM

85.10

96.34

96.02

96.97*

96.33

83.35

83.38

83.44

84.72

11.87 \(\uparrow\)

BLURB

84.61

95.41

95.41

95.70*

95.48

89.98

90.34

90.30

90.71

11.09\(\uparrow\)

Relation extraction task

DDI

82.36

82.32

79.98

83.76

84.05*

76.22

75.57

76.28

76.46

1.69 \(\uparrow\)

ChemProt

77.50

78.32*

76.42

77.77

77.97

62.85

62.34

61.69

57.46

0.82 \(\uparrow\)

i2b2

76.40

76.49

76.54

76.86*

76.81

73.83

73.08

72.19

75.09

0.46 \(\uparrow\)

Euadr

86.51

82.32

74.07

84.56

81.32

62.52

76.93

70.41

70.48

− 1.95 \(\downarrow\)

GAD

84.30

73.82

66.32

76.74

69.65

72.68

69.14

71.81

68.17

− 7.56 \(\downarrow\)

BLURB

79.14

78.66

74.67

79.94*

77.96

69.62

71.41

70.50

69.53

0.80\(\uparrow\)

Sentence similarity task

BIOSSES

92.30

82.27

73.14

92.80*

81.90

24.94

55.80

47.86

30.48

0.50 \(\uparrow\)

MedSTS

84.80

85.70

85.00

85.70*

85.40

51.80

56.70

45.80

42.00

0.90 \(\uparrow\)

BLURB

88.20

83.99

79.07

89.25*

83.65

38.37

56.25

46.83

36.24

1.05\(\uparrow\)

Inference task

MedNLI

84.00

77.69

76.35

79.38

79.52

78.25

77.20

76.34

75.51

− 4.48 \(\downarrow\)

Document classification task

HoC

87.30

83.21

84.52

87.92*

84.32

64.20

75.20

61.00

81.70

0.62 \(\uparrow\)

Question answering task

BioASQ 4b

47.82

47.90

48.34

48.90*

48.25

47.10

47.35

45.90

46.10

1.08 \(\uparrow\)

BioASQ 5b

60.00

61.10

61.90

62.31*

61.57

58.54

59.21

58.98

58.50

2.31 \(\uparrow\)

BioASQ 6b

57.77

59.80

62.00

62.88*

61.54

56.10

56.22

56.60

56.85

5.11 \(\uparrow\)

BLURB

55.20

56.27

57.41

58.03*

57.12

53.91

54.26

53.83

53.82

2.83\(\uparrow\)

  1. The ‘difference over SOTA’ indicate the absolute change (\(\uparrow\) for increase and \(\downarrow\) for decrease) in metric performance over SOTA. Bold is the best results. We present the SOTA model performances on several datasets as follows: (1) JNLPBA, BC2GM, ChemProt, and GAD from Yuan et al. [15] (KeBioLM), (2) DDI and BIOSSES are from Gu et al. [14] (PubMedBERT), (3) Share/Clefe, i2b2, MedSTS, MedNLI and HoC from Peng et al. [12] (BLUE), (4) BC5CDR (disease), BC5CDR (chemical), NCBI (Disease), S800, Euadr, BioASQ 4b, BioASQ 5b, and BioASQ 6b, from Lee et al. [13] (BioBERT), and (5) LINNAEUS from Giorgi and Bader [33]. The biomedical language understanding and reasoning benchmark (BLURB) is an average score among all tasks used in previous studies [14, 15]
  2. *Indicates that BioALBERT (bold) achieved a significant \((\textit{p} < 0.05)\) performance improvement over SOTA model under one-sample t-test