Skip to main content

Table 1 GENETAG corpus statistics The 20K sentences were split into four subsets called Train, Test, Round1 and Round2.

From: GENETAG: a tagged corpus for gene/protein named entity recognition

 

Train

Test

Round1

Round 2

Total

Number of Sentences

7,500

2,500

5,000

5,000

20,000

Number of Words

204,195

68,043

137,586

137,977

547,801

Number of Tagged Genes = G

8,935

2,987

5,949

6,125

23,996

Total Number of Alternative Forms of Gene Names in G

6,583

2,158

4,275

4,505

17,531

Number of Gene Names in G with Alternative Forms = N

4,675

1,522

3,057

3,186

12,440

Average Number of Alternatives per Gene Name in N

1.66

1.67

1.62

1.65

1.65