BMC Bioinformatics

Table 2 Rules for Tokenization before Lucene Indexing

From: Soft tagging of overlapping high confidence gene mention variants for cross-species full-text gene normalization

Rule	Regular Expression	Replacement
1	([A-Z]{2,})([a-z]{2,})	$1_⌴$2
2	([a-z]{2,})([A-Z]{2,})	$1_⌴$2
3	[\w\_&&[^\.]]	_⌴
4	([\d\.]+)	_⌴$1_⌴
5	\s+	_⌴

_⌴ is used to represent a character space.

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com