Using the information embedded in the testing sample to break the limits caused by the small sample size in microarray-based classification

Table 1 Comparison of the results obtained with different classifiers in a variety of data-sets.

Data-set	genes	samples	DP	k NN	WV	LDA	SVM	ML-s	ML-d
BRCA1	3226	7 BRCA1-positive	21/22	18/22 (1)	18/22	18/22	18/22	19/22	16/22
		15 BRCA1-negative
BRCA2	3226	8 BRCA2-positive	21/22	21/22 (1)	17/22	19/22	18/22	17/22	17/22
		14 BRCA2-negative
PROS	12600	52 tumor tissue	93/102	90/102 (5)	61/102	92/102	93/102	64/102	50/102
		50 normal tissue
PROS-OUT	12625	8 non-recurrence	15/21	12/21 (1)	12/21	13/21	14/21	13/21	13/21
		13 recurrence
DLBCL-FL	6817	52 DLBCL	74/77	71/77 (7)	63/77	74/77	74/77	65/77	58/77
		25 FL
ALL-AML	6817	27 AML	38/38	37/38 (3)	38/38	38/38	38/38	30/38	27/38
		11 ALL
I-2000	2000	40 tumor colon tissue	61/62	59/62 (3)	58/62	61/62	61/62	59/62	58/62
		22 normal colon tissue

Columns indicate the algorithm used, rows the data-set. In each cell the number in the numerator specifies the number of left-out-samples that has been correctly classified by the corresponding algorithm. The value in the denominator is the total number of samples n. The k NN algorithm has a free parameter that needs to be determined – the number of neighbors k. To allow for a fair comparison, we have optimized this value for each of the databases using cross-validation [12]. The optimal resulting value is specified in parenthesis. In the ML classifier, we consider two cases: those where the two classes are assumed to have the same variance, and those where the variances are assumed to be different. These are referred to as ML-s (same) and ML-d (different).

ISSN: 1471-2105