Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge

Table 8 Prediction performance for biological process classes, over the CAFA evaluation dataset. (The number of proteins in each class is shown below each function header)

Function	Text-KNN (confidence = 0.75)			CAFA-Prior (confidence = 0.01)			CAFA-Seq (confidence = 0.95)			GOtcha (confidence = 0.14)
	P	R	S	P	R	S	P	R	S	P	R	S
biological regulation (114 proteins)	0.5	0.009	0.997	0.261	1	0	0.632	0.105	0.978	0.404	0.351	0.817
multi-organism process (29 proteins)	0.00	0.00	0.939	0.067	1	0	0.00	0.00	0.99	0.286	0.069	0.988
localization (60 proteins)	0.2	0.017	0.989	0.138	1	0	0.44	0.067	0.976	0.297	0.317	0.88
establishment of localization (38 proteins)	0.25	0.026	0.992	0.087	1	0	0.5	0.105	0.99	0.263	0.395	0.894
response to stimulus (106 proteins)	0.125	0.009	0.979	0.243	1	0	0.5	0.047	0.985	0.39	0.302	0.848
developmental process (83 proteins)	0.00	0.00	0.997	0.19	1	0	0.556	0.06	0.989	0.263	0.181	0.881
multicellular organismal process (87 proteins)	0.069	0.023	0.923	0.2	1	0	0.625	0.115	0.983	0.343	0.264	0.874
signalling (33 proteins)	0.5	0.03	0.998	0.076	1	0	0.25	0.061	0.985	0.077	0.061	0.94
biological adhesion (52 proteins)	0.00	0.00	0.971	0.06	1	0	0.00	0.00	0.998	0.00	0.00	0.993
cellular component organization (64 proteins)	0.00	0.00	0.997	0.147	1	0	0.286	0.031	0.987	0.192	0.156	0.887
cellular process (368 proteins)	0.857	0.016	0.985	0.844	1	0	0.867	0.071	0.941	0.866	0.829	0.309
metabolic process (213 proteins)	0.00	0.00	0.991	0.489	1	0	0.588	0.047	0.969	0.633	0.559	0.691
reproduction (25 proteins)	0.083	0.08	0.946	0.057	1	0	0.00	0.00	0.995	0.214	0.12	0.973
reproductive process (25 proteins)	0.083	0.08	0.946	0.057	1	0	0.00	0.00	0.995	0.273	0.12	0.981

The text-based classifier, Text-KNN, compared with baseline results provided by the CAFA challenge: CAFA-Prior, CAFA-Seq, and GOtcha. The confidence threshold used for each classifier is shown under its name in the respective column. The confidence threshold for Text-kNN, GOtcha, and CAFA-Prior are, respectively, set at 0.75, 0.14, and 0.01 since these classifiers make no predictions for over 75% of the classes at higher confidence thresholds.
The columns P, R, and S refer, respectively, to the Precision, Recall, and Specificity of the classifier over individual classes. Precision and recall values of 0 for a class indicate that all the proteins belonging to that class are misclassified (at the respective confidence level). CAFA-Prior always has a specificity value of 0, because it assigns all the proteins to each class, and as such the number of true negatives is always 0.
A specificity value that is close to 1, for a class whose precision and recall are both 0, indicates that most proteins in the dataset are not in the class (true negatives) and are indeed not assigned to the class. A few proteins from other classes are misclassified into the class (false positives), hence the specificity is slightly less than 1.

ISSN: 1471-2105