Automated Alphabet Reduction for Protein Datasets

BMC Bioinformatics

Table 4 Performance of BioHEL in the CN datasets.

Strategy	Alphabet Size	% Accuracy	#Rules	#expr. atts.
Orig	20+1	74.0 ± 0.6	34.4 ± 1.7	9.0 ± 0.1
MI	2	72.3 ± 0.6•	21.4 ± 1.0	6.2 ± 0.7
	3	73.2 ± 0.6	30.2 ± 1.7	6.7 ± 1.0
	4	72.4 ± 0.8•	26.4 ± 2.1	7.1 ± 1.1
	5	71.8 ± 0.9•	23.4 ± 4.8	7.8 ± 1.0
RMI	2	72.3 ± 0.6•	21.4 ± 1.0	6.2 ± 0.7
	3	73.2 ± 0.6	30.2 ± 1.7	6.7 ± 1.0
	4	73.3 ± 0.5	30.2 ± 1.5	6.1 ± 1.1
	5	--	--	--
DualRMI	2	72.4 ± 0.5•	24.0 ± 1.3	7.0 ± 1.0
	3	73.0 ± 0.6•	29.1 ± 1.6	6.5 ± 1.1
	4	73.3 ± 0.6	29.7 ± 1.3	6.3 ± 1.0
	5	73.3 ± 0.5	30.4 ± 1.1	6.2 ± 1.1

Accuracy is the average test accuracy from the ten cross-validation folds. A • marks reduced datasets where performance is significantly worse than the original full AA representation according to statistical t-tests with 99% confidence level.

ISSN: 1471-2105