Datasets | Train/Test data | Sentences | Tokens |
---|
NCBI-disease | Train | 6347 | 159,670 |
Test | 940 | 24,497 |
SPRs | Train | 39,443 | 2,050,125 |
Test | 1000 | 49,668 |
- Two datasets were divided into training and test data to evaluate the performance of typo correction and NER. The total number of sentences and tokens in the NCBI-disease dataset were 7287 and 184,167, respectively. The total number of sentences and tokens in SPR dataset were 40,443 and 2,099,793, respectively. The proportions of training data in each dataset were 87% and 98%, respectively.