Skip to main content

Table 1 Sequence-to-function correlations.

From: Automatic discovery of cross-family sequence features associated with protein function

A. All keywords

 

Experiment

Control I

Control II

 

mean

SE

mean

SE

mean

SE

CC on training set

0.265

0.00461

0.162

0.00226

0.209

0.00288

CC on testing set

0.112

0.00453

0.00451

0.00294

0.0664

0.00406

Top 10 keywords

secreted nuclear membrane cytoplasmic DNA biosynthesis RNA integral meiosis catalyzes

B. Subcellular location keywords excluded

 

Experiment

Control I

Control II

 

mean

SE

mean

SE

mean

SE

CC on training set

0.231

0.00325

0.179

0.00240

0.213

0.00303

CC on testing set

0.0603

0.00619

0.00619

0.00276

0.0402

0.00350

Top 10 keywords

inhibits biosynthesis transcription catalyzes DNA atp bacteria stimulates transcriptional gram-negative

  1. This table presents the results of two experiments performed with the self-supervised learning approach. The numbers shown are the means and standard errors (SE) of correlation coefficients (CC) calculated over 250 separate runs. The CC measures the agreement, over a set of proteins, between the functional class predicted from sequence and the functional class assigned on the basis of the presence or absence of a certain combination of words in a protein's annotation. In A. a standard run and two controls are performed using the full vocabulary of 150 keywords. Control I simply involves the random reallocation of annotations to sequences prior to training. In Control II, sequences and annotations are correctly allocated but the amino acid sequences are randomly shuffled prior to training. In experiment B. the vocabulary is stripped of subcellular location keywords. The 10 most common keywords appearing in the evolved predictors are listed. In all cases the two-tailed unpaired Student's t-test shows a significant difference (P < 0.001) between the mean CCs of experiments and controls for both training and testing data. (These statistics may be calculated from the given means and standard errors, using n = 250.) From this one can conclude that at least some of the sequence-to-function associations learned in the training data also apply to the unrelated testing data, and therefore to proteins in general.