Skip to main content


Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Table 1 Sequence-to-function correlations.

From: Automatic discovery of cross-family sequence features associated with protein function

A. All keywords
  Experiment Control I Control II
  mean SE mean SE mean SE
CC on training set 0.265 0.00461 0.162 0.00226 0.209 0.00288
CC on testing set 0.112 0.00453 0.00451 0.00294 0.0664 0.00406
Top 10 keywords secreted nuclear membrane cytoplasmic DNA biosynthesis RNA integral meiosis catalyzes
B. Subcellular location keywords excluded
  Experiment Control I Control II
  mean SE mean SE mean SE
CC on training set 0.231 0.00325 0.179 0.00240 0.213 0.00303
CC on testing set 0.0603 0.00619 0.00619 0.00276 0.0402 0.00350
Top 10 keywords inhibits biosynthesis transcription catalyzes DNA atp bacteria stimulates transcriptional gram-negative
  1. This table presents the results of two experiments performed with the self-supervised learning approach. The numbers shown are the means and standard errors (SE) of correlation coefficients (CC) calculated over 250 separate runs. The CC measures the agreement, over a set of proteins, between the functional class predicted from sequence and the functional class assigned on the basis of the presence or absence of a certain combination of words in a protein's annotation. In A. a standard run and two controls are performed using the full vocabulary of 150 keywords. Control I simply involves the random reallocation of annotations to sequences prior to training. In Control II, sequences and annotations are correctly allocated but the amino acid sequences are randomly shuffled prior to training. In experiment B. the vocabulary is stripped of subcellular location keywords. The 10 most common keywords appearing in the evolved predictors are listed. In all cases the two-tailed unpaired Student's t-test shows a significant difference (P < 0.001) between the mean CCs of experiments and controls for both training and testing data. (These statistics may be calculated from the given means and standard errors, using n = 250.) From this one can conclude that at least some of the sequence-to-function associations learned in the training data also apply to the unrelated testing data, and therefore to proteins in general.