Skip to main content

Table.1 Number of protein sequences in benchmark dataset D3106 created by Shen et al.

From: Predicting subcellular location of protein with evolution information and sequence-based deep learning

Subcellular location

Number of protein sequences

Nucleus

1021

Cytoplasm

817

Extracellular

385

Mitochondrion

364

Plasma membrane

354

Endoplasmic reticulum

229

Golgi apparatus

161

Cytoskeleton

79

Centriole

77

Lysosome

77

Peroxisome

47

Endosome

24

Microsome

24

Synapse

22

Total

3681

  1. The dataset D3106 covers 14 subcellular which are listed at the first column of this table. And the numbers of proteins located at each subcellular location are listed at the second column. There are 3106 protein sequences in this dataset, and the total number of subcellular locations is 3681 since many certain sequences can be found in multiple locations. The sequences distribute at those 14 locations unevenly. 32.9% sequences are located at Nucleus and 26.3% sequences are at Cytoplasm, while less than 1% sequences are located at Synapse. This dataset is unbalanced. 3681 positive cases take only 8.47% of all 3106 × 14 cases