The authors use 40 samples to find genes that differentiate between samples with different survival times. However the selected genes are used to test the same 40 samples. The fact that the network is trained on only 20 samples and validated on the rest 20 at the end of the flow schematic does not mean that a correct validation has been done since the genes used were selected using all samples. The error obtained can be very optimistic compared to the true error which one would get if both gene selection and classifier training are done on a training set and the testing is done on a completely independent testing set. Please see
1) Simon R, Radmacher MD, Dobbin K, McShane LM, Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. J Natl. Cancer Inst 2003; 95:14-18
2) Ambroise C, McLachlan GJ, Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS. 2002 May 14; 99(10):6562-6566.
3) Reunanen J, Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research. 2003; 3:1371-1382.
Possible bias in error estimates
11 January 2005
The authors use 40 samples to find genes that differentiate between samples with different survival times. However the selected genes are used to test the same 40 samples. The fact that the network is trained on only 20 samples and validated on the rest 20 at the end of the flow schematic does not mean that a correct validation has been done since the genes used were selected using all samples. The error obtained can be very optimistic compared to the true error which one would get if both gene selection and classifier training are done on a training set and the testing is done on a completely independent testing set. Please see
1) Simon R, Radmacher MD, Dobbin K, McShane LM, Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. J Natl. Cancer Inst 2003; 95:14-18
2) Ambroise C, McLachlan GJ, Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS. 2002 May 14; 99(10):6562-6566.
3) Reunanen J, Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research. 2003; 3:1371-1382.
Sudhir
___________________
Sudhir Varma, Ph.D.
Biometric Research Branch,
National Cancer Institute, NIH.
6130 Executive Blvd. EPN/8142
Rockville, MD-20852, USA
(301)443-1723
varmas@mail.nih.gov
Competing interests
None declared