Let *x*_{
i
j
} be the value of variable *j* (*j*=1,…,*p*) for sample *i* (*i*=1,...,*n*). Each sample belongs to one of *K* classes (1,2,…,*K*) and *y*_{
i
} is the class of the *i* th sample. Let *z*_{
i
k
} be a class membership indicator variable (*z*_{
i
k
}=1 if *y*_{
i
}=*k*, *z*_{
i
k
}=0 otherwise), and {n}_{k}=\sum _{i=1}^{n}{z}_{\mathit{\text{ik}}} be the number of samples in class *k*. The *j* th component of the centroid in class *k* is {\overline{x}}_{\mathit{\text{kj}}}=\sum _{i=1}^{n}{z}_{\mathit{\text{ik}}}{x}_{\mathit{\text{ij}}}/{n}_{k} and the *j* th component of the overall centroid is {\overline{x}}_{j}=\sum _{i=1}^{n}{x}_{\mathit{\text{ij}}}/n. The shrunken centroid is defined as

{\overline{x}}_{\mathit{\text{kj}}}^{,}={\overline{x}}_{j}+{\hat{d}}_{\mathit{\text{kj}}}\xb7{m}_{k}\xb7({s}_{j}+{s}_{0}),

(1)

where *s*_{
j
} is the pooled within-class standard deviation for the *j* th variable, *s*_{0} is a constant (set at the median of *s*_{
j
}), {m}_{k}=\sqrt{1/{n}_{k}-1/n} and in PAM {\hat{d}}_{\mathit{\text{kj}}} is defined as

{\hat{d}}_{\mathit{\text{kj}}}=\text{sgn}\left({d}_{\mathit{\text{kj}}}\right){\left(\right|{d}_{\mathit{\text{kj}}}|-\lambda )}_{+},

(2)

where {d}_{\mathit{\text{kj}}}=\frac{{\overline{x}}_{\mathit{\text{kj}}}-{\overline{x}}_{j}}{{m}_{k}({s}_{j}+{s}_{0})}, *λ*≥0 is a threshold parameter that needs to be tuned, and (·)_{+} is the positive part of (·).

The classification rule of PAM for a new sample **x**^{∗} is

\mathcal{C}\left({\mathbf{x}}^{\ast}\right)={\text{argmin}}_{k}{\delta}_{k}\left({\mathbf{x}}^{\ast}\right),

(3)

where *δ*_{
k
}(**x**^{∗}) is the discriminant score for class *k*, defined as

{\delta}_{k}\left({\mathbf{x}}^{\ast}\right)=\sum _{j=1}^{p}\frac{{({x}_{j}^{\ast}-{\overline{x}}_{\mathit{\text{kj}}}^{,})}^{2}}{{({s}_{j}+{s}_{o})}^{2}}-2log\left({\pi}_{k}\right)=\sum _{j=1}^{p}{L}_{\mathit{\text{kj}}}-2log\left({\pi}_{k}\right);

(4)

*π*_{
k
} is the proportion of class *k* samples in the population (\sum _{k=1}^{K}{\pi}_{k}=1), −2*l* *o* *g*(*π*_{
k
}) is a class prior correction and {L}_{k}=\sum _{j=1}^{p}{L}_{\mathit{\text{kj}}}.

Variable *j* is effectively not considered in the classification rule (inactive variable) when all {\overline{x}}_{\mathit{\text{kj}}}^{,} are shrunken to {\overline{x}}_{j} as {L}_{1j}=\cdots ={L}_{\mathit{\text{Kj}}}; we call the other variables active.

Wang and Zhu [12] showed that if the observation *x*_{
i
}=(*x*_{i 1},...,*x*_{
i
p
}) from class *k* follows a multivariate normal distribution (*M* *V* *N*(**μ**_{
k
},**Σ**_{
k
})) and the covariance matrices are the same across different classes and are diagonal ({\mathbf{\Sigma}}_{k}=\mathit{\text{diag}}({\sigma}_{1}^{2},\mathrm{...},{\sigma}_{p}^{2})) then (2) is a solution to

{\hat{d}}_{\mathit{\text{kj}}}={\text{argmin}}_{{\stackrel{~}{d}}_{\mathit{\text{kj}}}}\frac{1}{2}\sum _{i=1}^{n}\sum _{j=1}^{p}\sum _{k=1}^{K}\frac{{z}_{\mathit{\text{ik}}}}{{n}_{k}}{({\stackrel{~}{x}}_{\mathit{\text{ij}}}-{\stackrel{~}{d}}_{\mathit{\text{kj}}})}^{2}+\lambda \phantom{\rule{0.3em}{0ex}}\sum _{j=1}^{p}\sum _{k=1}^{K}\left|{\stackrel{~}{d}}_{\mathit{\text{kj}}}\right|

(5)

where {\stackrel{~}{x}}_{\mathit{\text{ij}}}=\frac{{x}_{\mathit{\text{ij}}}-{\overline{x}}_{j}}{{m}_{k}({s}_{j}+{s}_{0})}, {\stackrel{~}{d}}_{\mathit{\text{kj}}}=\frac{{\mu}_{\mathit{\text{kj}}}-{\overline{x}}_{j}}{{m}_{k}({s}_{j}+{s}_{0})} and \sum _{j=1}^{p}\sum _{k=1}^{K}\left|{\stackrel{~}{d}}_{\mathit{\text{kj}}}\right| is a penalty function. Based on the observation that (5) is a LASSO type estimator for {\hat{d}}_{\mathit{\text{kj}}}, Wang and Zhu [12] proposed two different penalty functions,

\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\lambda \sum _{j=1}^{p}{w}_{j}\xb7\underset{k}{max}\left(\right|{\stackrel{~}{d}}_{1j},\mathrm{...},{\stackrel{~}{d}}_{\mathit{\text{Kj}}}\left|\right)\text{, and}

(6)

\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\lambda}_{\gamma}\sum _{j=1}^{p}{w}_{j}^{\gamma}{\gamma}_{j}+{\lambda}_{\theta}\sum _{j=1}^{p}\sum _{k=1}^{K}{w}_{\mathit{\text{kj}}}^{\theta}\left|{\theta}_{\mathit{\text{kj}}}\right|,

(7)

where *w*_{
j
}, {w}_{j}^{\gamma} and {w}_{\mathit{\text{kj}}}^{\theta} are pre-specified weights and *λ*, *λ*_{
γ
} and *λ*_{
θ
} are threshold parameters (see Additional file 1 for the definition of *γ*_{
j
} and *θ*_{
k
j
}).

The shrunken centroids, discriminant scores and classification rules are the same as in PAM; the classification rules that use (6) and (7) are denoted with ALP and AHP, respectively.

PAM, ALP and AHP require the estimation of the threshold parameter *λ*, *λ*_{
γ
} and *λ*_{
θ
}. A normal procedure is to use the training data to estimate a cross-validated (CV) error rate for different values of the threshold and use the threshold that produces the lowest overall error [5]. Note that when the threshold is zero, then the classification rules of PAM, ALP and AHP are essentially the same as the classification rule of DLDA (with the exception of an added constant *s*_{0} that is not considered in DLDA), which is defined as

\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\delta}_{k}\left({\mathbf{x}}^{\ast}\right)=\sum _{j=1}^{p}\frac{{({x}_{j}^{\ast}-{\overline{x}}_{\mathit{\text{kj}}})}^{2}}{{s}_{j}^{2}}-2log\left({\pi}_{k}\right)={L}_{k}-2log\left({\pi}_{k}\right),

(8)

where *L*_{
k
} is the discriminant score omitting the class prior correction.

In practice for high dimensional data the class prior correction contributes little to the discriminant scores (|*L*_{
k
}|>>−2 log(*π*_{
k
}) and *δ*_{
k
}≈*L*_{
k
} for large *p*), while it can bias the NSC classification towards the majority class if all or most of the variables are inactive (*L*_{
k
}≈0 and *δ*_{
k
}≈−2 log(*π*_{
k
})). For these reasons we used equal class priors for all the classes (−2 log(1/*K*)), similarly as Huang *et al.*[14]. Moreover, in case of ties the class membership was assigned at random to one of the classes with the smallest discriminant scores.