Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Thuy Phan, Thi Thanh; Ohkawa, Takenao

doi:10.1186/s12859-016-1100-z

Table 3 Syntactic features obtained from parse trees

From: Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Features	Definitions/Remarks	Values	Examples
Height_P1	The height of first protein name P1 of instance at constituent parse tree. This height differs from features Distance_KP1, Distance_KP2, and Distance_P1P2, i.e., distances between protein pair and keyword described in Table 2.	Integer value	In Fig. 1, Height_P1 is 2.
Height_P2	The height of second protein name P2 of instance at constituent parse tree. This height differs from features Distance_KP1, Distance_KP2, and Distance_P1P2.	Integer value	In Fig. 1, Height_P2 is 4.
Height_K	The height of keyword K at constituent parse tree. This height differs from features Distance_KP1, Distance_KP2, and Distance_P1P2.	Integer value	In Fig. 1, Height_K is 2.
POS_P1	We take into account the part-of-speech information of path from root at constituent parse tree of the two protein names constituting instance and keyword. It is possible to represent syntax structure and train classifiers to learn the pseudo grammar structure by this information. POS_P1 denotes part-of-speech information of path from root of leaf representing first protein P1 of instance.	The list of part-of-speech information of path from root at constituent parse tree.	In Fig. 1, POS_P1 is ‘NP, NNP’.
POS_P2	POS_P2 denotes part-of-speech information of path from root of leaf representing second protein P2 of instance.	The list of part-of-speech information of path from root at constituent parse tree.	In Fig. 1, POS_P2 is ‘VP, NP, NP, CD’.
POS_K	POS_K denotes part-of-speech information of path from root of leaf representing keywork K.	The list of part-of-speech information of path from root at constituent parse tree.	In Fig. 1, POS_K is ‘VP, VBZ’.

All sentences were transformed into representations called constituent parse trees, output from the Stanford parser [7]. Syntactic features were extracted from constituent parse trees. P1, P2, and K denote the protein name appearing first, the protein name appearing later, and the keyword in a sentence, respectively

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Bioinformatics

Contact us