MLM-based typographical error correction of unstructured medical texts for named entity recognition

BMC Bioinformatics

Table 3 Data specification

Datasets	Train/Test data	Sentences	Tokens
NCBI-disease	Train	6347	159,670
NCBI-disease	Test	940	24,497
SPRs	Train	39,443	2,050,125
SPRs	Test	1000	49,668

Two datasets were divided into training and test data to evaluate the performance of typo correction and NER. The total number of sentences and tokens in the NCBI-disease dataset were 7287 and 184,167, respectively. The total number of sentences and tokens in SPR dataset were 40,443 and 2,099,793, respectively. The proportions of training data in each dataset were 87% and 98%, respectively.

ISSN: 1471-2105