Skip to main content

Table 15 Classification datasets

From: Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology

 

Original (unfiltered) data

  
   

Statistical Properties

Repair pathway

GO ID

No.Sequences

Ave. Length

Median Length

Std. Dev

Base excision repair

0006284

2624

276

251

134.7

DNA dealkylation

0006307

25

203

174

106.8

DNA synthesis during DNA repair

0000731

28

996

1103

581.5

Double-strand break repair

0006302

364

616

609

306.4

Error-prone DNA repair

0045020

46

1075

1077

46.8

Mismatch repair

0006298

1777

617

653

317.3

Nucleotide-excision repair

0006289

2106

732

685

261.6

Postreplication repair

0006301

28

449

350

350.7

Regulation of DNA repair

0006282

264

211

172

161.6

Single strand break repair

0000012

40

476

614

297.8

Other pathways

N/A

45

592

373

486.9

Total

7347

515

415

321.2

 

Maximum 90% similarity

  
   

Statistical Properties

Repair pathway

No.Sequences

Ave. Length

Median Length

Std. Dev

Base excision repair

 

1721

284

260

135.9

Double-strand break repair

 

266

603

611

309.8

Error-prone DNA repair

 

36

1077

1082

51.4

Mismatch repair

 

1020

710

768

274.9

Nucleotide-excision repair

 

1325

737

689

268.9

Regulation of DNA repair

 

174

205

168

153.4

Single strand break repair

 

25

490

403

316.9

Other pathways

 

78

579.8

373

539.1

Total

 

4645

534

516

322.7

 

Maximum 50% similarity

  
   

Statistical Properties

Repair pathway

 

No.Sequences

Ave. Length

Median Length

Std. Dev

Base excision repair

 

630

321

278

185.7

Double-strand break repair

 

174

656

655

343.6

Mismatch repair

 

468

718

710

302.6

Nucleotide-excision repair

 

363

684

633

360.6

Regulation of DNA repair

 

114

213

168

185.5

Other pathways

 

81

659

582

496.2

Total

 

1830

535

434

349.8

  1. The number of proteins extracted from the UniProt database for each of the DNA repair pathways used in classification experiments is shown along with statistical properties regarding seqeuence lengths. Only the types of repair pathways which had sufficient data (minimum 25 proteins) for experiments are shown, along with a combined dataset of the types which did not have enough data.