Skip to main content

Table 1 Feature classes and their impact prediction quality. Table of all feature classes. *: classes used in the BioCreAtIvE submission, ◦: classes implemented afterwards, partly adopted from other participants of the contest. The forth column gives the impact of each single feature class compared to the baseline (only tokens). This figures include post-processing. The fifth column shows which how precision and recall are affected. Letter surface clues (last rows) refer to the following features: {special, allCaps, initCap, capMix, lowMix, Idl, ddd}.

From: Systematic feature evaluation for gene name recognition

Feature

Example

Short name

Impact

 

Token*

Sro7

Token

= 54%

- baseline -

Unseen token*

 

UToken

  

n-grams of token*

 

1G, 2G, ..

+15%

+14%

1..4-grams, P+, R++

1..3-grams

Previous & next tokens

 

P/NToken

-5%

-6%

[1,1]-window, P+, R-

[2,2]-window

n-grams of tokens in window

 

2PG/2NG/..

  

Prefixes, suffixes

 

1P, 2P, 3P, 1S..

±0

 

Stop word

the, or

Stop

-5%

-1%

-.5%

10,000 words, P+, R-

1000 words, P+, R-

100 words, P+, R-

POS tag

NN, DT

POS

-50%

P-, R-

Initial upper case*

Msp

initCap

+.5%

P=, R+

All chars are upper case*

MMTV

allCaps

+.5%

P-, R+

Upper case letters*

InlC, GUS

Upper

  

Upper case (skip first)*

MsPRP2

Upper2

  

Single capital

A

singleCap

+.5%

P+, R+

Two capitals

RalGDS

twoCaps

+.5%

P+, R+

Capital, then mixed letters ◦

IgM

capMix

  

Lower case, then mixed ◦

kDa

lowMix

+1%

P-, R+

Special symbols*

ICAM-1

special

±0

P-, R+

Characters and numbers*

p50

CharNum

  

Numbers*

p50, HSF1

Number

  

Letters, digits, letters ◦

H2kd

Idl

±0

 

Digit, dot, digit ◦

5.78

ddd

-.1%

P-, R-

Greek letter ◦

alpha

greek

+.5%

P+, R-

Roman numeral ◦

II, xii

roman

±0

R+, R-

Number followed by '%' ◦

75.0%

percentage

-.1%

P-, R-

DNA, RNA sequences ◦

ACCGT

DNA, RNA

-.1%

P-, R-

Longest consonant chain *

Sro7 → 2

LCC

-2%

P-, R-

Keyword distance*

 

keyDist

-20%

P+, R-

Gazetteer*

 

Gaz

-3%

P-, R-

Prev./next token is NEWGENE

 

PTG, NTG

-18%

prev. only, P+, R-

Tokens + letter surface clues

  

+2%

P+, R-

Tokens + 1,2,3-grams + greek + roman + letter surface clues

  

+14%

P+, R++

Tokens + 1,2,3,-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap *

  

+16%

P+, R++

Tokens + 1,2,3,4-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap* + lowMix ◦

  

+18%

P+, R++