Skip to main content

Table 6 Error type and number per dataset

From: MLM-based typographical error correction of unstructured medical texts for named entity recognition

Datasets

Type of typo

No Typo

Total

Replace

Delete

Transpose

Insert

NCBI-disease

1014

1073

911

946

20,553

24,497

SPRs

965

948

877

827

46,051

49,668

  1. The total number of tokens was 24,497 in the NCBI-disease dataset. the total number of tokens with four error types was 3944, accounting for 16.10% of words. The total number of tokens without error (No typo) was 20,553, accounting for 83.91% of words. The total number of tokens was 49,668 in the SPR dataset. The total number tokens with four error types was 3617, accounting for 7.28% of words. The total number of tokens without error was 46,051, accounting for 92.72% words.