Sieve-based relation extraction of gene regulatory networks from biological literature

BMC Bioinformatics

Table 1 Feature functions description.

Name	Description	Options
Target label distribution	Distribution of target labels.	--
Starts upper	Does a mention start with an upper case leter.	current, previous mention
Starts upper twice	Do two consequent mentions start with an upper case letter.	current, previous mention
Hearst co-occurence [58]	Does the text between the two mentions follow some predefined rules, e.g., mi such as mj.	--
Mention token distance	Distance between the two mentions in number of mentions.	--
Parse tree mention depth	Depth of the mention within the parse tree.	--
Parse tree parent value	Parse tree value of the mention on length l	l ∈ {1, 2, 3}
Parse tree path	Path values between the two mentions in a parse tree, e.g., DT/NP/NNS/.../NP/NP/VBG.	up to three tokens from every mention
BSubtilis	If the two mentions are known as B. subtilis, what is the probability of protein-protein interaction using STRING data [29], i.e., very low, low, medium, high, very high.	--
IsBSubtilis	Is the current mention known as B. subtilis gene.	--
IsBsubtilisPair	Which of the two consequent mentions is known as B. subtilis genes, i.e., left, right, both or none.	--

The feature functions are used by all CRF-based sieves for all selected skip-mention CRF models. All extracted features are modeled both as unigram and bigram features. Unigram features are used for current label factor and bigram features are used for transition factor between two labels.

ISSN: 1471-2105