Table 4 Composition of the 8 datasets

From: Spliceator: multi-species splice site prediction using convolutional neural networks

Dataset Quality of sequences No. of positive sequences No. of negative sequences Type of negative sequences Ratio
Donor Acceptor
AS_0 Unconfirmed and confirmed 12,000 12,000 12,000 FP only 1:1
AS_1 12,000 4000 exons, 4000 introns and 4000 FP
AS_2 24,000 FP only 1:2
AS_10 120,000 FP only 1:10
GS_0 Confirmed 10,973 11,179 11,000 FP only 1:1
GS_1 11,000 3650 exons, 3650 introns and 3700 FP
GS_2 22,000 FP only 1:2
GS_10 110,000 FP only 1:10
  1. Composition of the 8 datasets used to study the impact of (i) the type of negative examples (only FP sequences vs. heterogeneous data with exons, introns and FP sequences), (ii) the ratio of positive to negative examples (1:1, 1:2 and 1:10), (iii) data quality (‘Confirmed’ and ‘Unconfirmed’ sequences in the AS datasets vs. only Confirmed sequences in the GS datasets
  2. FP, False Positive; GS, Gold Standard; AS, All Sequences