Skip to main content

Table 2 Rules for Tokenization before Lucene Indexing

From: Soft tagging of overlapping high confidence gene mention variants for cross-species full-text gene normalization

Rule Regular Expression Replacement
1 ([A-Z]{2,})([a-z]{2,}) $1$2
2 ([a-z]{2,})([A-Z]{2,}) $1$2
3 [\w\_&&[^\.]]
4 ([\d\.]+) $1
5 \s+
  1. is used to represent a character space.