Skip to main content

Table 1 The statistics of the four benchmark datasets

From: Improving biomedical named entity recognition with syntactic information

Datasets

Entity type

 

Token #

Sent. #

Entity #

BC2GM

Gene/protein

Train

355.4k

12.5k

15.1k

Dev

71.0k

2.5k

3.0k

Test

143.4k

5.0k

6.3k

JNLPBA

 

Train

443.6k

14.6k

32.1k

Dev

117.2k

3.8k

8.5k

Test

114.7k

3.8k

6.2k

BC5CDR-chemical

Chemical

Train

118.1K

4.5K

5.2K

Dev

117.4K

4.5K

5.3K

Test

124.7K

4.7K

5.3K

NCBI-disease

Disease

Train

135.7K

5.4K

5.1K

Dev

23.9K

923

787

Test

24.4K

940

960

LINNAEUS

Species

Train

281.2k

11.9k

2.1k

Dev

93.8k

4.0k

711

Test

165k

7.1k

1.4k

Species-800

 

Train

147.2K

5.7K

2.5K

Dev

22.2K

830

384

Test

42.2K

1.6K

767

  1. “Token #”, “Sent. #” and “Entity #” represent the number of tokens, sentences, and entities