Skip to main content

Table 3 Syntactic features obtained from parse trees

From: Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Features

Definitions/Remarks

Values

Examples

Height_P1

The height of first protein name P1 of instance at constituent parse tree. This height differs from features Distance_KP1, Distance_KP2, and Distance_P1P2, i.e., distances between protein pair and keyword described in Table 2.

Integer value

In Fig. 1, Height_P1 is 2.

Height_P2

The height of second protein name P2 of instance at constituent parse tree. This height differs from features Distance_KP1, Distance_KP2, and Distance_P1P2.

Integer value

In Fig. 1, Height_P2 is 4.

Height_K

The height of keyword K at constituent parse tree. This height differs from features Distance_KP1, Distance_KP2, and Distance_P1P2.

Integer value

In Fig. 1, Height_K is 2.

POS_P1

We take into account the part-of-speech information of path from root at constituent parse tree of the two protein names constituting instance and keyword. It is possible to represent syntax structure and train classifiers to learn the pseudo grammar structure by this information. POS_P1 denotes part-of-speech information of path from root of leaf representing first protein P1 of instance.

The list of part-of-speech information of path from root at constituent parse tree.

In Fig. 1, POS_P1 is ‘NP, NNP’.

POS_P2

POS_P2 denotes part-of-speech information of path from root of leaf representing second protein P2 of instance.

The list of part-of-speech information of path from root at constituent parse tree.

In Fig. 1, POS_P2 is ‘VP, NP, NP, CD’.

POS_K

POS_K denotes part-of-speech information of path from root of leaf representing keywork K.

The list of part-of-speech information of path from root at constituent parse tree.

In Fig. 1, POS_K is ‘VP, VBZ’.

  1. All sentences were transformed into representations called constituent parse trees, output from the Stanford parser [7]. Syntactic features were extracted from constituent parse trees. P1, P2, and K denote the protein name appearing first, the protein name appearing later, and the keyword in a sentence, respectively