Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Naseem, Usman; Dunn, Adam G.; Khushi, Matloob; Kim, Jinman

doi:10.1186/s12859-022-04688-w

BMC Bioinformatics

Table 7 Comparison of BioALBERT versus SOTA methods in BioNLP tasks

From: Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Dataset	SOTA	BioALBERT								Difference over SOTA
Dataset	SOTA	Base1	Base2	Large1	Large2	Base3	Base4	Large3	Large4
Named entity recognition task
Share/Clefe	75.40	94.27	94.47	93.16	94.30	94.84*	94.82	94.70	94.66	19.44 \(\uparrow\)
BC5CDR (disease)	87.15	97.66	97.62	97.78*	97.61	90.03	90.01	90.29	91.44	10.63 \(\uparrow\)
BC5CDR (chemical)	93.47	97.90	98.08*	97.76	97.79	89.83	90.08	90.01	91.48	4.61 \(\uparrow\)
JNLPBA	82.00	82.72	83.22	84.01	83.53	86.74*	86.56	86.20	85.72	4.74 \(\uparrow\)
Linnaeus	93.54	99.71	99.72	99.73	99.73*	95.72	98.27	98.24	98.23	6.19 \(\uparrow\)
NCBI (disease)	89.71	95.89	95.61	97.18*	95.85	85.82	85.93	85.86	85.83	7.47 \(\uparrow\)
S800	75.31	98.76	98.49	99.02*	98.72	93.53	93.63	93.63	93.63	23.71 \(\uparrow\)
BC2GM	85.10	96.34	96.02	96.97*	96.33	83.35	83.38	83.44	84.72	11.87 \(\uparrow\)
BLURB	84.61	95.41	95.41	95.70*	95.48	89.98	90.34	90.30	90.71	11.09\(\uparrow\)
Relation extraction task
DDI	82.36	82.32	79.98	83.76	84.05*	76.22	75.57	76.28	76.46	1.69 \(\uparrow\)
ChemProt	77.50	78.32*	76.42	77.77	77.97	62.85	62.34	61.69	57.46	0.82 \(\uparrow\)
i2b2	76.40	76.49	76.54	76.86*	76.81	73.83	73.08	72.19	75.09	0.46 \(\uparrow\)
Euadr	86.51	82.32	74.07	84.56	81.32	62.52	76.93	70.41	70.48	− 1.95 \(\downarrow\)
GAD	84.30	73.82	66.32	76.74	69.65	72.68	69.14	71.81	68.17	− 7.56 \(\downarrow\)
BLURB	79.14	78.66	74.67	79.94*	77.96	69.62	71.41	70.50	69.53	0.80\(\uparrow\)
Sentence similarity task
BIOSSES	92.30	82.27	73.14	92.80*	81.90	24.94	55.80	47.86	30.48	0.50 \(\uparrow\)
MedSTS	84.80	85.70	85.00	85.70*	85.40	51.80	56.70	45.80	42.00	0.90 \(\uparrow\)
BLURB	88.20	83.99	79.07	89.25*	83.65	38.37	56.25	46.83	36.24	1.05\(\uparrow\)
Inference task
MedNLI	84.00	77.69	76.35	79.38	79.52	78.25	77.20	76.34	75.51	− 4.48 \(\downarrow\)
Document classification task
HoC	87.30	83.21	84.52	87.92*	84.32	64.20	75.20	61.00	81.70	0.62 \(\uparrow\)
Question answering task
BioASQ 4b	47.82	47.90	48.34	48.90*	48.25	47.10	47.35	45.90	46.10	1.08 \(\uparrow\)
BioASQ 5b	60.00	61.10	61.90	62.31*	61.57	58.54	59.21	58.98	58.50	2.31 \(\uparrow\)
BioASQ 6b	57.77	59.80	62.00	62.88*	61.54	56.10	56.22	56.60	56.85	5.11 \(\uparrow\)
BLURB	55.20	56.27	57.41	58.03*	57.12	53.91	54.26	53.83	53.82	2.83\(\uparrow\)

The ‘difference over SOTA’ indicate the absolute change (\(\uparrow\) for increase and \(\downarrow\) for decrease) in metric performance over SOTA. Bold is the best results. We present the SOTA model performances on several datasets as follows: (1) JNLPBA, BC2GM, ChemProt, and GAD from Yuan et al. [15] (KeBioLM), (2) DDI and BIOSSES are from Gu et al. [14] (PubMedBERT), (3) Share/Clefe, i2b2, MedSTS, MedNLI and HoC from Peng et al. [12] (BLUE), (4) BC5CDR (disease), BC5CDR (chemical), NCBI (Disease), S800, Euadr, BioASQ 4b, BioASQ 5b, and BioASQ 6b, from Lee et al. [13] (BioBERT), and (5) LINNAEUS from Giorgi and Bader [33]. The biomedical language understanding and reasoning benchmark (BLURB) is an average score among all tasks used in previous studies [14, 15]
*Indicates that BioALBERT (bold) achieved a significant \((\textit{p} < 0.05)\) performance improvement over SOTA model under one-sample t-test

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com