Skip to main content

Table 2 Rules for Tokenization before Lucene Indexing

From: Soft tagging of overlapping high confidence gene mention variants for cross-species full-text gene normalization

Rule

Regular Expression

Replacement

1

([A-Z]{2,})([a-z]{2,})

$1⌴$2

2

([a-z]{2,})([A-Z]{2,})

$1⌴$2

3

[\w\_&&[^\.]]

⌴

4

([\d\.]+)

⌴$1⌴

5

\s+

⌴

  1. ⌴ is used to represent a character space.