ProteinNet: a standardized data set for machine learning of protein structure

BMC Bioinformatics

Table 1 Summary of ProteinNet features relative to other database and repositories

Database	Structure	Sequence	PSSM/MSA	Clustering	Train/Val splits	Historical CASP reset	ML framework file format
PDB	Raw	✓	✗	Sequence	✗	✗	✗
CulledPDB	Processed	✓	✗	MSA	✗	✗	✗
HSSP	✗	✓	HSSP	✗	✗	✗	✗
ProteinNet	Processed	✓	JackHMMer	MSA	✓	✓	TensorFlow

Three existing databases are compared with ProteinNet in terms of available sequence, structure, and evolutionary profile information (PSSMs: position-specific scoring matrix; MSA: multiple sequence alignment), as well as standardized splits and tooling to facilitate machine learning (ML) applications. A ✓ indicates inclusion of a feature while a ✗ indicates exclusion. Structures can be raw or processed, with the latter indicating structure selection based on experimental quality metrics (e.g. R-factor) and annotation of structural pathologies (e.g. missing residues). PSSM/MSA indicates method used to derive evolutionary profiles. Note that HSSP is no longer widely used by the protein structure prediction community, while JackHMMer is one of the standard methods. Clustering can either be performed by sequence alignment or by exploiting MSAs to detect low sequence homology. The MSA approach used in ProteinNet can detect homology down to 10% sequence identity, which is not done by CulledPDB. Data splits segregating training and validation sets and resetting the historical record to reflect the state of prior CASPs is also unique to ProteinNet

ISSN: 1471-2105