Skip to main content

Table 1 Orthographic features.

From: Identifying gene and protein mentions in text using conditional random fields

Orthographic Feature

Reg. Exp.

Init Caps

[A-Z].*

Init Caps Alpha

[A-Z] [a-z]*

All Caps

[A-Z]+

Caps Mix

[A-Za-z]+

Has Digit

.*[0-9].*

Single Digit

[0-9]

Double Digit

[0-9][0-9]

Natural Number

[0-9]+

Real Number

[-0-9]+ [.,]+[0-9].,]+

Alpha-Num

[A-Za-z0-9]+

Roman

[ivxdlcm]+ or [IVXDLCM]+

Has Dash

.*-.*

Init Dash

-.*

End Dash

.*-

Punctuation

[,.;:?!-+'"']

  1. This defines the complete set of orthographic predicate used by the system. The observation list for each token will include a predicate for every regular expression that token matches.