Skip to main content
Fig. 2 | BMC Bioinformatics

Fig. 2

From: Protein language models can capture protein quaternary state

Fig. 2

Overall data statistics and distribution of the quaternary state dataset used in this study A. Distribution of qs: For each qs (x-axis) the number of entries is shown (y-axis, in log-scale). Left and right bars show the training set (in color, after down-sampling; dark gray: samples removed for training) and hold-out set (light gray), respectively. Down-sampling was necessary due to the uneven distribution of the data, which is significantly skewed towards monomers, followed by dimers, trimers, tetramers and hexamers. B. Distribution of count of families (y-axis) with different qs within a given ECOD family (x-axis). While most families show the same qs for all their entries, a significant fraction contains a diverse set of qs (exact numbers are included in the boxes). C. Details of the composition of qs in different ECOD families, shown as a network, where the nodes represent different qs (colored as in A.), sized according to their amount (and percentage indicated). The edges represent families containing two different qs, with width proportional to the amount (and numbers indicated). Note that B and C show numbers in the training set, after down-sampling and removal of small groups.

Back to article page