Skip to main content

Table 5 Data statistics of training and testing datasets after the removal of homologous sequences using CD-HIT program

From: iDPGK: characterization and identification of lysine phosphoglycerylation sites based on sequence-based features

Sequence identity cut-off Number of phosphoglycerylation sites Number of non-phosphoglycerylation sites
Raw data 150 3997
90% 107 3031
80% 104 2610
70% 98 2319
60% 96 2040
50% 93 1845
40% 89 1318
Training data 89 178
Independent testing data 37 74