Skip to main content

Table 1 Various statistics of the datasets

From: Improving the recall of biomedical named entity recognition with label re-correction and knowledge distillation

Dataset

#Abstract

#Chemical

#Disease

$Chemical

$Disease

Weakly labeled

CDWC

70,026

706,593

514,964

34,696

58,985

CDWA

70,026

503,700

283,293

17,939

24,600

CDRC (BiLSTM-CRF)

70,026

770,159

541,235

40,135

38,715

CDRA (BiLSTM-CRF)

70,026

781,039

532,198

38,858

42,420

CDRC (BioBERT-CRF)

70,026

795,096

557,434

50,018

52,447

CDRA (BioBERT-CRF)

70,026

812,516

542,353

51,458

47,687

DRC (BiLSTM-CRF)

70,026

–

469,849

–

69,567

DRA (BiLSTM-CRF)

70,026

–

473,728

–

69,342

DRC (BioBERT-CRF)

70,026

–

546,515

–

83,436

DRA (BioBERT-CRF)

70,026

–

487,636

–

66,582

Human annotated

CDR training data

500

5203

4182

991

1384

CDR development data

500

5347

4244

976

1254

CDR test data

500

5385

4424

1239

1474

NCBI disease training data

593

–

5145

–

1495

NCBI disease development data

100

–

787

–

334

 

NCBI disease test data

100

–

960

–

382

  1. #Abstract: the number of abstracts
  2. #Chemical: the number of chemical mentions
  3. #Disease: the number of disease mentions
  4. $Chemical: the number of unique chemical mentions
  5. $Disease: the number of unique disease mentions