From: Systematic feature evaluation for gene name recognition
Feature | Example | Short name | Impact | |
---|---|---|---|---|
Token* | Sro7 | Token | = 54% | - baseline - |
Unseen token* | UToken | |||
n-grams of token* | 1G, 2G, .. | +15% +14% | 1..4-grams, P+, R++ 1..3-grams | |
Previous & next tokens | P/NToken | -5% -6% | [1,1]-window, P+, R- [2,2]-window | |
n-grams of tokens in window | 2PG/2NG/.. | |||
Prefixes, suffixes | 1P, 2P, 3P, 1S.. | ±0 | ||
Stop word | the, or | Stop | -5% -1% -.5% | 10,000 words, P+, R- 1000 words, P+, R- 100 words, P+, R- |
POS tag | NN, DT | POS | -50% | P-, R- |
Initial upper case* | Msp | initCap | +.5% | P=, R+ |
All chars are upper case* | MMTV | allCaps | +.5% | P-, R+ |
Upper case letters* | InlC, GUS | Upper | ||
Upper case (skip first)* | MsPRP2 | Upper2 | ||
Single capital | A | singleCap | +.5% | P+, R+ |
Two capitals | RalGDS | twoCaps | +.5% | P+, R+ |
Capital, then mixed letters ◦ | IgM | capMix | ||
Lower case, then mixed ◦ | kDa | lowMix | +1% | P-, R+ |
Special symbols* | ICAM-1 | special | ±0 | P-, R+ |
Characters and numbers* | p50 | CharNum | ||
Numbers* | p50, HSF1 | Number | ||
Letters, digits, letters ◦ | H2kd | Idl | ±0 | |
Digit, dot, digit ◦ | 5.78 | ddd | -.1% | P-, R- |
Greek letter ◦ | alpha | greek | +.5% | P+, R- |
Roman numeral ◦ | II, xii | roman | ±0 | R+, R- |
Number followed by '%' ◦ | 75.0% | percentage | -.1% | P-, R- |
DNA, RNA sequences ◦ | ACCGT | DNA, RNA | -.1% | P-, R- |
Longest consonant chain * | Sro7 → 2 | LCC | -2% | P-, R- |
Keyword distance* | keyDist | -20% | P+, R- | |
Gazetteer* | Gaz | -3% | P-, R- | |
Prev./next token is NEWGENE | PTG, NTG | -18% | prev. only, P+, R- | |
Tokens + letter surface clues | +2% | P+, R- | ||
Tokens + 1,2,3-grams + greek + roman + letter surface clues | +14% | P+, R++ | ||
Tokens + 1,2,3,-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap * | +16% | P+, R++ | ||
Tokens + 1,2,3,4-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap* + lowMix ◦ | +18% | P+, R++ |