Variable selection for binary classification using error rate p-values applied to metabolomics data

BMC Bioinformatics

Table 1 Algorithm to simulate the null cumulative distribution functions

• Generate N IIUD[0,1] u _n’s
• Assign the first N ₀ y _n’s as 0 and the remainder as 1
• Minimize \( \frac{w_0}{N_0}{\displaystyle {\sum}_n^N\left(1-{y}_n\right)I\left({u}_n>b\right)+\frac{w_1}{N_1}{\displaystyle {\sum}_{n=1}^N{y}_nI\left({u}_n\le b\right)}} \) by varying b over the midpoints of the increasingly ordered u _n’s to obtain \( {\widehat{er}}_{up}^{*} \)
• Repeat these steps M times to build up a file of iid copies of \( {\widehat{er}}_{up}^{} \), say \( {\widehat{er}}_{up}^{}(m),\;m=1,\dots, M \), whose empirical distribution function provides a simulation approximation of the null CDF
• If T of the \( {\widehat{er}}_{up}^{}(m)\hbox{'}s \) fall below an actually observed \( {\widehat{er}}_{up}^{} \) its associated p-value is approximately T/M. Approximations are more accurate for large M.

ISSN: 1471-2105