Skip to main content

Table 1 A catalog of protein feature datasets that can be used in ML

From: Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data

Database/dataset

Wikidata entry

Domain Level

Residue level

Atom Level

Residue–residue graph

\(\varvec{2^{\circ }}\) Structure

Electrostatics and charge

Surface and curvature

Protein interaction sites

Train/validation splits

Clusters

Evolutionary info

File format

PDB [9]

Q766195

 

\(\checkmark\)

\(\checkmark ^{\dagger }\)

 

\(\checkmark\)

 

\(\checkmark\)

  

\(\checkmark\)

 

Web, mmCIF, MMTF

UniProt [13]

Q905695

 

\(\checkmark\)

  

\(\checkmark\)

    

\(\checkmark\)

 

Web, ReST

CATH [14]

Q5008897

\(\checkmark\)

\(\checkmark\)

\(\checkmark ^{\dagger }\)

      

\(\checkmark\)

\(\checkmark\)

PDB, ReST

FEATURE [15]

Q114878648

 

\(\checkmark\)

\(\checkmark\)

 

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

    

ASCII

PredictProtein [16]

Q7239681

 

\(\checkmark\)

  

\(\checkmark\)

 

\(\checkmark\)

\(\checkmark\)

  

\(\checkmark\)

Web, ReST, JSON

DescribePROT [17]

Q111288739

 

\(\checkmark\)

    

\(\checkmark\)

\(\checkmark\)

  

\(\checkmark\)

Web, JSON

ATOM3D/DIPS [18]

Q114878673

 

\(\checkmark\)

\(\checkmark ^{\dagger }\)

    

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

 

JSON, PyTorch

ProteinNet [19]

Q114878717

 

\(\checkmark\)

\(\checkmark ^{\dagger }\)

 

\(\checkmark\)

   

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

TensorFlow

SidechainNet [20]

Q114878822

 

\(\checkmark\)

\(\checkmark ^{\dagger }\)

 

\(\checkmark\)

   

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

PyTorch, Pickle

Prop3D [this work]

Q108040542

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

 

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

HDF, HSDS, PyTorch

  1. Many different datasets of sequences, structures, and biophysical properties exist. They all contain different amounts of data, data on different levels/scales (chain, domain, residue, atom), and some contain biophysical properties attached to each atom and/or residue. Databases that use atomic coordinates, but without biophysical properties associated with the geometric coordinates, are denoted by daggers (\(^{\dagger }\))