Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge

Table 7 Prediction performance for molecular function classes, over the CAFA evaluation dataset. (The number of proteins in each class is shown below each function header)

Function	Text-KNN (confidence = 0.95)			CAFA-Prior (confidence = 0.01)			CAFA-Seq (confidence = 0.95)			GOtcha (confidence = 0.95)
	P	R	S	P	R	S	P	R	S	P	R	S
binding (212 proteins)	0.643	0.17	0.87	0.579	1	0.00	0.9	0.085	0.987	0.723	0.16	0.916
transporter activity (28 proteins)	0.00	0.00	0.97	0.077	1	0.00	0.5	0.036	0.997	0.714	0.179	0.994
catalytic activity (165 proteins)	0.312	0.03	0.95	0.451	1	0.00	0.714	0.03	0.990	0.917	0.067	0.995

The text-based classifier, Text-KNN, is compared with baseline results provided by the CAFA challenge: CAFA-Prior, CAFA-Seq, and GOtcha. The confidence threshold used for each classifier is shown under its name in the respective column. A confidence threshold of 0.01 is used for CAFA-Prior because the classifier does not make any predictions for the 'transporter activity' class at higher confidence thresholds.
The columns P, R, and S refer, respectively, to the Precision, Recall, and Specificity of the classifiers over individual classes. Precision and recall values of 0 for a class indicate that all the proteins belonging to that class are misclassified (when the confidence score is 0.95). CAFA-Prior always has a specificity value of 0, because it assigns all the proteins to each class, and as such the number of true negatives is always 0.
A specificity value that is close to 1, for a class whose precision and recall are both 0, indicates that most proteins in the dataset are not in the class (true negatives) and are indeed not assigned to the class. A few proteins from other classes are misclassified into the class (false positives), hence the specificity is slightly less than 1.

ISSN: 1471-2105