Automatic discovery of cross-family sequence features associated with protein function

BMC Bioinformatics

Table 1 Sequence-to-function correlations.

A. All keywords
	Experiment		Control I		Control II
	mean	SE	mean	SE	mean	SE
CC on training set	0.265	0.00461	0.162	0.00226	0.209	0.00288
CC on testing set	0.112	0.00453	0.00451	0.00294	0.0664	0.00406
Top 10 keywords	secreted nuclear membrane cytoplasmic DNA biosynthesis RNA integral meiosis catalyzes
B. Subcellular location keywords excluded
	Experiment		Control I		Control II
	mean	SE	mean	SE	mean	SE
CC on training set	0.231	0.00325	0.179	0.00240	0.213	0.00303
CC on testing set	0.0603	0.00619	0.00619	0.00276	0.0402	0.00350
Top 10 keywords	inhibits biosynthesis transcription catalyzes DNA atp bacteria stimulates transcriptional gram-negative

This table presents the results of two experiments performed with the self-supervised learning approach. The numbers shown are the means and standard errors (SE) of correlation coefficients (CC) calculated over 250 separate runs. The CC measures the agreement, over a set of proteins, between the functional class predicted from sequence and the functional class assigned on the basis of the presence or absence of a certain combination of words in a protein's annotation. In A. a standard run and two controls are performed using the full vocabulary of 150 keywords. Control I simply involves the random reallocation of annotations to sequences prior to training. In Control II, sequences and annotations are correctly allocated but the amino acid sequences are randomly shuffled prior to training. In experiment B. the vocabulary is stripped of subcellular location keywords. The 10 most common keywords appearing in the evolved predictors are listed. In all cases the two-tailed unpaired Student's t-test shows a significant difference (P < 0.001) between the mean CCs of experiments and controls for both training and testing data. (These statistics may be calculated from the given means and standard errors, using n = 250.) From this one can conclude that at least some of the sequence-to-function associations learned in the training data also apply to the unrelated testing data, and therefore to proteins in general.

ISSN: 1471-2105