Let x
i
j
be the value of variable j (j=1,…,p) for sample i (i=1,...,n). Each sample belongs to one of K classes (1,2,…,K) and y
i
is the class of the i th sample. Let z
i
k
be a class membership indicator variable (z
i
k
=1 if y
i
=k, z
i
k
=0 otherwise), and be the number of samples in class k. The j th component of the centroid in class k is and the j th component of the overall centroid is . The shrunken centroid is defined as
(1)
where s
j
is the pooled within-class standard deviation for the j th variable, s0 is a constant (set at the median of s
j
), and in PAM is defined as
(2)
where , λ≥0 is a threshold parameter that needs to be tuned, and (·)+ is the positive part of (·).
The classification rule of PAM for a new sample x∗ is
(3)
where δ
k
(x∗) is the discriminant score for class k, defined as
(4)
π
k
is the proportion of class k samples in the population (), −2l o g(π
k
) is a class prior correction and .
Variable j is effectively not considered in the classification rule (inactive variable) when all are shrunken to as ; we call the other variables active.
Wang and Zhu [12] showed that if the observation x
i
=(xi 1,...,x
i
p
) from class k follows a multivariate normal distribution (M V N(μ
k
,Σ
k
)) and the covariance matrices are the same across different classes and are diagonal () then (2) is a solution to
(5)
where , and is a penalty function. Based on the observation that (5) is a LASSO type estimator for , Wang and Zhu [12] proposed two different penalty functions,
(6)
(7)
where w
j
, and are pre-specified weights and λ, λ
γ
and λ
θ
are threshold parameters (see Additional file 1 for the definition of γ
j
and θ
k
j
).
The shrunken centroids, discriminant scores and classification rules are the same as in PAM; the classification rules that use (6) and (7) are denoted with ALP and AHP, respectively.
PAM, ALP and AHP require the estimation of the threshold parameter λ, λ
γ
and λ
θ
. A normal procedure is to use the training data to estimate a cross-validated (CV) error rate for different values of the threshold and use the threshold that produces the lowest overall error [5]. Note that when the threshold is zero, then the classification rules of PAM, ALP and AHP are essentially the same as the classification rule of DLDA (with the exception of an added constant s0 that is not considered in DLDA), which is defined as
(8)
where L
k
is the discriminant score omitting the class prior correction.
In practice for high dimensional data the class prior correction contributes little to the discriminant scores (|L
k
|>>−2 log(π
k
) and δ
k
≈L
k
for large p), while it can bias the NSC classification towards the majority class if all or most of the variables are inactive (L
k
≈0 and δ
k
≈−2 log(π
k
)). For these reasons we used equal class priors for all the classes (−2 log(1/K)), similarly as Huang et al.[14]. Moreover, in case of ties the class membership was assigned at random to one of the classes with the smallest discriminant scores.