A comprehensive assessment of N-terminal signal peptides prediction methods

BMC Bioinformatics

Table 1 Description of the three datasets developed for benchmarking the thirteen SP prediction tools. Only the first 70aa of the sequence are retained as input. All the negative dataset are subjected to redundancy reduction. T denotes the sequence identity threshold set for redundancy reduction. ¹ From a first-pass-filtered set of 9,851 reduced to 4,989 upon redundancy reduction (T = 40%) and atypical/spurious sequences removed; ² From a first-pass-filtered set of 427 reduced to 230 (T = 40%); ³ From a first-pass-filtered set of 370 reduced to 307 (T = 65%); ⁴ From a first-pass-filtered set of 8,930 reduced to 4,445 (T = 40%); ⁵ From a first-pass-filtered set of 110 reduced to 61 (T = 40%); ⁶ From a first-pass-filtered set of 290 reduced to 150 (T = 40%).

	Dataset for Experiment #1: Zhang and Henzel[20] (Experimentally verified SPs)	Dataset for Experiment #2: SPdb 5.1[33] (SPdb 5.1 is derived from Swiss-Prot Release 55.0)	Dataset for Experiment #3: UniProtKB/Swiss-Prot Release 57.0 (excludes datasets used in Experiment #1 and #2)
Positive	270 human secreted recombinant proteins	2,349 secretory proteins consisting of:	228 secretory proteins consisting of:
		- Euk: 1874	- Euk: 199
		- Gpos: 168	- Gpos: 17
		- Gneg: 307	- Gneg: 12
Negative	270 human non-secretory proteins extracted from SigHMM [26] dataset which is in turn derived from Swiss-Prot Release 40.0.	2,349 non-secretory proteins	228 non-secretory proteins
		- Euk: 1874 (Cytoplasmic and nuclear)¹	Euk: 199 (Cytoplasmic and nuclear)⁴
		- Gpos: 168 (all cytoplasmic)²	- Gpos: 17 (all cytoplasmic)⁵
		- Gneg: 307 (all cytoplasmic)³	- Gneg: 12 (all cytoplasmic)⁶

ISSN: 1471-2105