Skip to main content

Table 3 C4.5 Classification Tree Results

From: Bioinformatic analyses of mammalian 5'-UTR sequence properties of mRNAs predicts alternative translation initiation sites

Data Sets # of mRNAs from positive set with aTIS Training Set Size, positive + negative Correctly Classified Incorrectly Classified Mean Absolute Error False Negatives False Positives
Training 41 82 81 1 0.0122 0 1
Independ. Testing 4 8 7 1 0.1250 0 1
Cross-Validation 45 90 87 3 0.0333 0 3
Full Negative Set 0 500 469 31 0.062 0 31
Provisional 43 86 83 3 0.0349 0 3
  1. Results of the C4.5 classification tree on training, testing, cross-validation, full negative set and provisional data sets are presented here. The classification tree was able to effectively distinguish between sequences that contain aTIS sites and those that do not. The 45 RefSeq sequences validated as containing alternative start sites were combined with 45 randomly selected sequences without alternative start sites resulting in 90 sequences used for the data sets and training. Of these 90 sequences, 82 were used for the 'training' set and the remaining 8 were used as an 'independent' testing set. 'Cross-validation' of the C4.5 classification tree was performed using 10 fold cross-validation on the full set of 90 sequences, represented by the training and independent testing sets. The 'Full Negative Set' was the full set of 500 non-aTIS sequences which was compiled for generation of testing sets. The resulting performance of this set provides a measure of the ability of the classification tree to generalize to larger datasets. The 'provisional' set consisted of sequences predicted to contain at least one aTIS; this data set performed well during classification. The Mean Absolute Error is calculated as the fraction of incorrectly classified sequences compared to the total number of mRNA sequences in the designated training sets.