Skip to main content

Table 3 C4.5 Classification Tree Results

From: Bioinformatic analyses of mammalian 5'-UTR sequence properties of mRNAs predicts alternative translation initiation sites

Data Sets

# of mRNAs from positive set with aTIS

Training Set Size, positive + negative

Correctly Classified

Incorrectly Classified

Mean Absolute Error

False Negatives

False Positives

Training

41

82

81

1

0.0122

0

1

Independ. Testing

4

8

7

1

0.1250

0

1

Cross-Validation

45

90

87

3

0.0333

0

3

Full Negative Set

0

500

469

31

0.062

0

31

Provisional

43

86

83

3

0.0349

0

3

  1. Results of the C4.5 classification tree on training, testing, cross-validation, full negative set and provisional data sets are presented here. The classification tree was able to effectively distinguish between sequences that contain aTIS sites and those that do not. The 45 RefSeq sequences validated as containing alternative start sites were combined with 45 randomly selected sequences without alternative start sites resulting in 90 sequences used for the data sets and training. Of these 90 sequences, 82 were used for the 'training' set and the remaining 8 were used as an 'independent' testing set. 'Cross-validation' of the C4.5 classification tree was performed using 10 fold cross-validation on the full set of 90 sequences, represented by the training and independent testing sets. The 'Full Negative Set' was the full set of 500 non-aTIS sequences which was compiled for generation of testing sets. The resulting performance of this set provides a measure of the ability of the classification tree to generalize to larger datasets. The 'provisional' set consisted of sequences predicted to contain at least one aTIS; this data set performed well during classification. The Mean Absolute Error is calculated as the fraction of incorrectly classified sequences compared to the total number of mRNA sequences in the designated training sets.