Improving the recall of biomedical named entity recognition with label re-correction and knowledge distillation

BMC Bioinformatics

Table 1 Various statistics of the datasets

Dataset		#Abstract	#Chemical	#Disease	$Chemical	$Disease
Weakly labeled	CDWC	70,026	706,593	514,964	34,696	58,985
	CDWA	70,026	503,700	283,293	17,939	24,600
	CDRC (BiLSTM-CRF)	70,026	770,159	541,235	40,135	38,715
	CDRA (BiLSTM-CRF)	70,026	781,039	532,198	38,858	42,420
	CDRC (BioBERT-CRF)	70,026	795,096	557,434	50,018	52,447
	CDRA (BioBERT-CRF)	70,026	812,516	542,353	51,458	47,687
	DRC (BiLSTM-CRF)	70,026	–	469,849	–	69,567
	DRA (BiLSTM-CRF)	70,026	–	473,728	–	69,342
	DRC (BioBERT-CRF)	70,026	–	546,515	–	83,436
	DRA (BioBERT-CRF)	70,026	–	487,636	–	66,582
Human annotated	CDR training data	500	5203	4182	991	1384
	CDR development data	500	5347	4244	976	1254
	CDR test data	500	5385	4424	1239	1474
	NCBI disease training data	593	–	5145	–	1495
	NCBI disease development data	100	–	787	–	334
	NCBI disease test data	100	–	960	–	382

ISSN: 1471-2105