Dataset

genes

samples

DP

k NN

WV

LDA

SVM

MLs

MLd


BRCA1

3226

7 BRCA1positive

21/22

18/22 (1)

18/22

18/22

18/22

19/22

16/22

 
15 BRCA1negative
       
BRCA2

3226

8 BRCA2positive

21/22

21/22 (1)

17/22

19/22

18/22

17/22

17/22

 
14 BRCA2negative
       
PROS

12600

52 tumor tissue

93/102

90/102 (5)

61/102

92/102

93/102

64/102

50/102

 
50 normal tissue
       
PROSOUT

12625

8 nonrecurrence

15/21

12/21 (1)

12/21

13/21

14/21

13/21

13/21

 
13 recurrence
       
DLBCLFL

6817

52 DLBCL

74/77

71/77 (7)

63/77

74/77

74/77

65/77

58/77

 
25 FL
       
ALLAML

6817

27 AML

38/38

37/38 (3)

38/38

38/38

38/38

30/38

27/38

 
11 ALL
       
I2000

2000

40 tumor colon tissue

61/62

59/62 (3)

58/62

61/62

61/62

59/62

58/62

 
22 normal colon tissue
       
 Columns indicate the algorithm used, rows the dataset. In each cell the number in the numerator specifies the number of leftoutsamples that has been correctly classified by the corresponding algorithm. The value in the denominator is the total number of samples n. The k NN algorithm has a free parameter that needs to be determined – the number of neighbors k. To allow for a fair comparison, we have optimized this value for each of the databases using crossvalidation [12]. The optimal resulting value is specified in parenthesis. In the ML classifier, we consider two cases: those where the two classes are assumed to have the same variance, and those where the variances are assumed to be different. These are referred to as MLs (same) and MLd (different).