Skip to main content

Table 1 Summary of ProteinNet features relative to other database and repositories

From: ProteinNet: a standardized data set for machine learning of protein structure

Database

Structure

Sequence

PSSM/MSA

Clustering

Train/Val splits

Historical CASP reset

ML framework file format

PDB

Raw

✓

✗

Sequence

✗

✗

✗

CulledPDB

Processed

✓

✗

MSA

✗

✗

✗

HSSP

✗

✓

HSSP

✗

✗

✗

✗

ProteinNet

Processed

✓

JackHMMer

MSA

✓

✓

TensorFlow

  1. Three existing databases are compared with ProteinNet in terms of available sequence, structure, and evolutionary profile information (PSSMs: position-specific scoring matrix; MSA: multiple sequence alignment), as well as standardized splits and tooling to facilitate machine learning (ML) applications. A ✓ indicates inclusion of a feature while a ✗ indicates exclusion. Structures can be raw or processed, with the latter indicating structure selection based on experimental quality metrics (e.g. R-factor) and annotation of structural pathologies (e.g. missing residues). PSSM/MSA indicates method used to derive evolutionary profiles. Note that HSSP is no longer widely used by the protein structure prediction community, while JackHMMer is one of the standard methods. Clustering can either be performed by sequence alignment or by exploiting MSAs to detect low sequence homology. The MSA approach used in ProteinNet can detect homology down to 10% sequence identity, which is not done by CulledPDB. Data splits segregating training and validation sets and resetting the historical record to reflect the state of prior CASPs is also unique to ProteinNet