Skip to main content

Table 3 Data specification

From: MLM-based typographical error correction of unstructured medical texts for named entity recognition

Datasets

Train/Test data

Sentences

Tokens

NCBI-disease

Train

6347

159,670

Test

940

24,497

SPRs

Train

39,443

2,050,125

Test

1000

49,668

  1. Two datasets were divided into training and test data to evaluate the performance of typo correction and NER. The total number of sentences and tokens in the NCBI-disease dataset were 7287 and 184,167, respectively. The total number of sentences and tokens in SPR dataset were 40,443 and 2,099,793, respectively. The proportions of training data in each dataset were 87% and 98%, respectively.