Skip to main content

Table 2 Feature function generators description.

From: Sieve-based relation extraction of gene regulatory networks from biological literature

Name

Description

Options

Observable data

Prefix value

Value of the prefix for the mention on offset distance from the current mention.

string length: {2, 3}; offset: [−5, 5]

text

Suffix value

Value of the suffix for the mention on offset distance from the current mention.

string length: {2, 3}; offset: [−5, 5]

text

Consequent value

A combination of values of the two consequent mentions on offset distance from the current mention, e.g., PDT/NNS.

offset: [−4,4]

text, part-of-speech, lemma, entity type, coreference

Current value

A value of the mention on offset distance from the current mention, e.g., NNS.

offset: [−4,4]

text, part-of-speech, lemma, entity type, coreference

Context value

Matching of specified length of character-based ngram values within the selected range of words from the current and previous mentions using Jaccard coefficient. According to the match result, feature function values are discretized into eight levels. Different feature functions are generated for the context left/right of both mentions, between the two, outside the two and union of all.

range: 5, ngram: 3

text

Previous / next value combination

A combination of token values from the selected distance to the current and the previous mentions.

distance: {−2, 2}

text, part-of-speech, lemma

Left / right / between value

Token values on the left/right or in between the two mentions on the selected distance.

distance: [1, 5]

text, part-of-speech, lemma

Split to values

Split the current mention into tokens by the selected delimiter and output first N tokens.

N: 2, delimiter: '

text, lemma

  1. According to the implementation, different options and observable values, the generators generate specific feature functions using a single scan over training data. The feature functions are used by all CRF-based sieves for all selected skip-mention CRF models. All extracted features are modeled both as unigram and bigram features (except prefix and suffix, which are of unigram type only). Unigram features are used for current label factor and bigram features are used for transition factor between two labels.