Skip to main content

Table 2 Overview of features used by our system

From: A system for de-identifying medical message board text

Feature Example
MMB non-structure features
token Kathy
token lower-cased kathy
length 5
case isLower=True, isCapitalized=False, …
suffix/prefix suffix2=hy, prefix2=ka, suffix3=thy, …
distance from beginning/end w/in1FromEdge=True, w/in2FromEdge=True, …
in word list isProperName=True, isCommon=False, isUsername=False, …
possibly in word list editDist1ProperName=True, editDist2ProperName=True, …
Also include features of two previous and following tokens
MMB structure features
tf-idf over message boards inTop10=False, inTop1%=False, …
tf-idf over user posts InTop10=False, inTop1%=True, ...
border of paragraph likelihood inTop5=True, inTop10%=True, ...