Skip to main content

Table 2 Word context features obtained directly from sentences

From: Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Features

Definitions/Remarks

Values

Examples

Distance_KP1

The distance defined by number of words appearing between keyword K and protein name P1 in the sentence.

Integer value

In sentence LLL.d33.s1 of LLL corpus, “GerE binds to a site on one of these promoters, cotX, that overlaps its -35 region,” keyword is ‘bind’ and Distance_KP1 is 0.

Distance_KP2

The distance between keyword K and P2 in the sentence.

Integer value

In sentence LLL.d33.s1 above, Distance_KP2 is 8.

Distance_P1P2

The distance between two protein names in the sentence.

Integer value

In sentence LLL.d33.s1 above, Distance_P1P2 is 9.

Position_P1

The value adding word distance between protein name P1 and beginning of the sentence to one.

Integer value

In sentence LLL.d33.s1 above, Position_P1 is 1.

Position_P2

The value adding word distance between protein name P2 and beginning of the sentence to one.

Integer value

In sentence LLL.d33.s1 above, Position_P2 is 11.

Position of keyword

The word order of keyword K and protein pair P1 and P2. ‘Infix’: order of words is [ P1-K- P2]), ‘prefix’: order of words is [K- P1- P2]), or ‘postfix’: order of words is [ P1- P2-K]).

‘Infix’, ‘prefix’, or ‘postfix’

In sentence LLL.d33.s1 above, feature value is ‘infix’.

Comma between keyword and protein pair

Because topic of the sentence frequently changes before and after commas, we utilize the information if there is a comma between protein pair and keyword. ‘ x 1 x 2’: x 1 is ‘t’ if a comma exists between A and B, and x 2 is ‘t’ if a comma exists between B and C, otherwise x 1 or x 2 is ‘f’, where A, B, and C represent a keyword and two protein names in order of their appearance in the sentence.

‘tt’, ‘ff’, ‘tf’, or ‘ft’

In sentence LLL.d33.s1 above, feature value is ‘ft’.

Multiple occurrences of keywords

Check whether there is more than one keyword in a sentence.

‘true’ or ‘false’

In sentence LLL.d33.s1 above, feature value is ‘false’.

Parallel expression of a protein pair

Check whether the two protein names of the protein pair are contiguous in the word order of the sentence containing them (they are also considered contiguous even if ‘-’, ‘/’, ‘and’, ‘or’, ‘(’ appears between them). If two protein names are described in parallel in a sentence, an interaction between them is unlikely.

‘true’ or ‘false’

In sentence LLL.d30.s0, “In vitro, both sigma(A) and sigma(X) holoenzymes recognize promoter elements within the sigX-ypuN control region,” feature values of protein pairs (sigma(A), sigma(X)) and (sigX, ypuN) are ‘true’ (only PPIs are in the remaining protein pairs).

  1. Word context features extracted from sentences. P1, P2, and K denote the protein name appearing first, the protein name appearing later, and the keyword in a sentence, respectively. ‘t’ and ‘f’ are abbreviations of ‘true’ and ‘false’