Skip to main content

Table 15 Classification datasets

From: Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology

  Original (unfiltered) data   
    Statistical Properties
Repair pathway GO ID No.Sequences Ave. Length Median Length Std. Dev
Base excision repair 0006284 2624 276 251 134.7
DNA dealkylation 0006307 25 203 174 106.8
DNA synthesis during DNA repair 0000731 28 996 1103 581.5
Double-strand break repair 0006302 364 616 609 306.4
Error-prone DNA repair 0045020 46 1075 1077 46.8
Mismatch repair 0006298 1777 617 653 317.3
Nucleotide-excision repair 0006289 2106 732 685 261.6
Postreplication repair 0006301 28 449 350 350.7
Regulation of DNA repair 0006282 264 211 172 161.6
Single strand break repair 0000012 40 476 614 297.8
Other pathways N/A 45 592 373 486.9
Total 7347 515 415 321.2
  Maximum 90% similarity   
    Statistical Properties
Repair pathway No.Sequences Ave. Length Median Length Std. Dev
Base excision repair   1721 284 260 135.9
Double-strand break repair   266 603 611 309.8
Error-prone DNA repair   36 1077 1082 51.4
Mismatch repair   1020 710 768 274.9
Nucleotide-excision repair   1325 737 689 268.9
Regulation of DNA repair   174 205 168 153.4
Single strand break repair   25 490 403 316.9
Other pathways   78 579.8 373 539.1
Total   4645 534 516 322.7
  Maximum 50% similarity   
    Statistical Properties
Repair pathway   No.Sequences Ave. Length Median Length Std. Dev
Base excision repair   630 321 278 185.7
Double-strand break repair   174 656 655 343.6
Mismatch repair   468 718 710 302.6
Nucleotide-excision repair   363 684 633 360.6
Regulation of DNA repair   114 213 168 185.5
Other pathways   81 659 582 496.2
Total   1830 535 434 349.8
  1. The number of proteins extracted from the UniProt database for each of the DNA repair pathways used in classification experiments is shown along with statistical properties regarding seqeuence lengths. Only the types of repair pathways which had sufficient data (minimum 25 proteins) for experiments are shown, along with a combined dataset of the types which did not have enough data.