Skip to main content

Table.2 Number of protein sequences in benchmark dataset 4802 created by Wei et al.

From: Predicting subcellular location of protein with evolution information and sequence-based deep learning

Subcellular location

Number of protein sequences

Nucleus

1720

Cytoplasm

1050

Plasma membrane

836

Extracellular

487

Mitochondria

407

Endosome

342

Golgi apparatus

272

Nucleolus

268

Lysosomes

125

Endoplasmic reticulum

120

Cytoskeleton

89

Centrosome

81

Peroxisome

67

Early endosomes

52

Nuclear envelope

47

Cytoplasmic vesicles

46

Basolateral plasma membrane

29

Synaptic vesicles

28

Microtubule

26

Apical plasma membrane

16

Late endosomes

16

Golgi trans face

11

Secretory granule

10

Tight junction

9

Golgi cis cisterna

7

Medial-golgi

7

Melanosome

6

Secretory vesicles

5

Cellular component

4

ERGIC

4

Inner mitochondrial membrane

4

Transport vesicle

4

Golgi trans cisterna

3

Total

6198

  1. In this dataset, 4802 protein sequences are identified in 33 subcellular locations. The first column is the name of the subcellular covered by this dataset, and the second column is the number of proteins located at each subcellular location. The total number of subcellular locations is 6198 since each sequence can be found in multiple subcellular locations. The sequences distribute at those 33 locations unevenly. 35.8% of sequence samples are located at Nucleus and 21.9% sequences are located in Cytoplasm, while only 3 sequences are identified in Golgi Trans Cisterna. The number of positive cases in this dataset is 6198, and the positive case rate is 3.9% (6198/(4802 × 33))