Automated Alphabet Reduction for Protein Datasets

BMC Bioinformatics

Table 6 Performance of BioHEL in the RSA datasets.

Strategy	Alphabet Size	% Accuracy	#Rules	#expr. atts.
Orig.	20+1	70.7 ± 0.4	58.6 ± 2.3	9.0 ± 0.2
MI	2	67.6 ± 0.3•	52.9 ± 4.2	5.8 ± 1.3
	3	69.4 ± 0.3•	54.9 ± 1.1	5.4 ± 1.2
	4	68.9 ± 0.6•	54.5 ± 1.3	5.9 ± 1.2
	5	67.9 ± 0.9•	53.1 ± 3.8	6.8 ± 1.2
RMI	2	67.6 ± 0.3•	52.9 ± 4.2	5.8 ± 1.3
	3	69.7 ± 0.4•	56.5 ± 1.3	5.5 ± 1.2
	4	69.9 ± 0.4•	57.5 ± 1.2	6.3 ± 1.4
	5	--	--	--
DualRMI	2	66.6 ± 0.4•	33.4 ± 4.8	3.7 ± 0.8
	3	69.9 ± 0.4•	56.7 ± 1.3	5.3 ± 1.1
	4	70.1 ± 0.4	58.0 ± 1.2	6.0 ± 1.4
	5	70.3 ± 0.4	58.2 ± 1.1	6.5 ± 1.6

Accuracy is the average test accuracy from the ten cross-validation folds. A • marks reduced datasets where performance is significantly worse than the original full AA representation, according to the statistical t-tests with a 99% confidence level.

ISSN: 1471-2105