Assume in the training dataset there are in total *p* genes in the microarray chips, and assume we have in total *n* chips/samples that have been grouped into *L* classes. Let *a*_{
ij
}denote the expression value of gene *j* in sample *i*. The training dataset can thus be represented as a matrix

*A*_{n×p}= (*a*_{
ij
})_{n×p}.

We will define two gene scoring functions using entry values in matrix *A*_{n×p}. These two scoring functions might be considered to better use the means and the variations of the gene expression values.

Let *n*_{
k
}denote the number of samples in class *C*_{
k
}, for *k* = 1, 2,..., *L* (i.e., {\displaystyle {\sum}_{k=1}^{L}{n}_{k}=n}). Let {\overline{a}}_{kj}={\scriptscriptstyle \frac{1}{{n}_{k}}}{\displaystyle {\sum}_{i\in {C}_{k}}{a}_{ij}}, which is the average expression value of gene *j* in class *C*_{
k
}, for *k* = 1, 2,..., *L*. The expression vector ({\overline{a}}_{k1},{\overline{a}}_{k2},\dots ,{\overline{a}}_{kp}) is the *centroid* of class *C*_{
k
}. Correspondingly, the centroid matrix is

{\overline{A}}_{L\times p}={({\overline{a}}_{kj})}_{L\times p}.

The mean of these centroids is \widehat{A}=\left({\widehat{a}}_{1},{\widehat{a}}_{2},\dots ,{\widehat{a}}_{p}\right), where {\widehat{a}}_{j}={\scriptscriptstyle \frac{1}{L}}{\displaystyle {\sum}_{k=1}^{L}{\overline{a}}_{kj}}.

For sample *i* belonging to class *C*_{
k
}, the difference between the expression value of gene *j* and the class mean is *x*_{
ij
}= |*a*_{
ij
}- {\overline{a}}_{kj}|. The matrix

*X*_{n×p}= (*x*_{
ij
})_{n×p}

is the *deviation matrix* of the training dataset. Let {\overline{x}}_{kj}={\scriptscriptstyle \frac{1}{{n}_{k}}}{\displaystyle {\sum}_{i\in {C}_{k}}{x}_{ij}} denote the average deviation for samples in class *C*_{
k
}with respect to the centroid. The centroid deviation matrix is

{\overline{X}}_{L\times p}={({\overline{x}}_{kj})}_{L\times p}.

Intuitively, gene *j* has a strong discriminating power if the means {\overline{a}}_{kj}, *k* = 1, 2,..., *L*, differ significantly and {\overline{x}}_{kj}, *k* = 1, 2,..., *L*, indicating the *intra-class variations*, are all small.

For example, suppose we have a microarray expression matrix *A*_{12×4} as shown, in which 12 samples have been known in 3 classes:

{A}_{12\times 4}=\left(\begin{array}{r}\hfill \underset{\xaf}{\begin{array}{rrrr}\hfill 0.65& \hfill 0.2& \hfill 0.2& \hfill 0.7\\ \hfill 0.85& \hfill 1.0& \hfill 1.4& \hfill 1.0\\ \hfill 0.9& \hfill 1.2& \hfill 0.8& \hfill 1.3\end{array}}\\ \hfill \underset{\xaf}{\begin{array}{cccc}0.9& 0.7& 0.6& 0.5\\ 1.1& 1.4& 2.0& 0.7\\ 1.5& 1.8& 1.2& 1.6\\ 1.3& 0.9& 1.0& 1.2\end{array}}\\ \hfill \begin{array}{cccc}1.2& 1.0& 0.8& 1.5\\ 1.5& 1.7& 1.6& 0.2\end{array}\\ \hfill \begin{array}{cccc}1.7& 2.1& 2.3& 1.8\\ 2.0& 2.0& 1.9& 0.3\\ 1.6& 1.2& 1.4& 1.2\end{array}\end{array}\right).

Then the centroid matrix {\overline{A}}_{3\times 4} is

{\overline{A}}_{3\times 4}=\left(\begin{array}{cccc}0.8& 0.8& 0.8& 1.0\\ 1.2& 1.2& 1.2& 1.0\\ 1.6& 1.6& 1.6& 1.0\end{array}\right),

and the mean of the centroids is

= (1.2, 1.2, 1.2, 1.0).

The deviation matrix *X*_{12×4} is

{X}_{12\times 4}=\left(\begin{array}{r}\hfill \underset{\xaf}{\begin{array}{rrrr}\hfill 0.15& \hfill 0.6& \hfill 0.6& \hfill 0.3\\ \hfill 0.05& \hfill 0.2& \hfill 0.6& \hfill 0.0\\ \hfill 0.1& \hfill 0.4& \hfill 0.0& \hfill 0.3\end{array}}\\ \hfill \underset{\xaf}{\begin{array}{llll}0.3\hfill & 0.5\hfill & 0.6\hfill & 0.5\hfill \\ 0.1\hfill & 0.2\hfill & 0.8\hfill & 0.3\hfill \\ 0.3\hfill & 0.6\hfill & 0.0\hfill & 0.6\hfill \\ 0.1\hfill & 0.3\hfill & 0.2\hfill & 0.2\hfill \end{array}}\\ \hfill \begin{array}{cccc}0.4& 0.6& 0.8& 0.5\\ 0.1& 0.1& 0.0& 0.8\end{array}\\ \hfill \begin{array}{cccc}0.1& 0.5& 0.7& 0.8\\ 0.4& 0.4& 0.3& 0.7\\ 0.0& 0.4& 0.2& 0.2\end{array}\end{array}\right),

and the intra-class average deviations are

{\overline{X}}_{3\times 4}=\left(\begin{array}{cccc}0.1& 0.4& 0.4& 0.2\\ 0.2& 0.4& 0.4& 0.4\\ 0.2& 0.4& 0.4& 0.6\end{array}\right).

Figures 3, 4, 5, 6 illustrate the expression values of these four genes across all 12 samples, with the intra-class means and average deviations also shown. There are three key ideas in our design of gene scoring functions, which will be exemplified through these four genes. First of all, gene 1 has quite different mean expression values across three classes, compared to gene 4 that has the same means. Therefore, gene 1 is intuitively better than gene 4 in terms of discriminating power. Note that the goal of gene selection is to select genes that have significantly different means across different classes. For each gene *j*, the quantity {\widehat{a}}_{j} is the mean of all the centroids on gene *j* and it represents all the samples. {\widehat{a}}_{j} is stable, that is, it would not change when the samples in one class are duplicated (since the number of classes, *L*, and all the means, {\overline{a}}_{kj}, for *k* = 1, 2,..., *L*, do not change). We define the *scatter* of gene *j* to capture the *inter-class* variations, which takes in {\widehat{a}}_{j} as a component:

scatter(j)=\sqrt{\frac{1}{L}{\displaystyle \sum _{k=1}^{L}{({\overline{a}}_{kj}-{\widehat{a}}_{j})}^{2}}}+\frac{1}{2}\underset{w\ne v}{\mathrm{min}}\left|{\overline{a}}_{wj}-{\widehat{a}}_{vj}\right|,

in which the square root is the standard (estimated) deviation of all the centroids on gene *j*. Clearly seen, *scatter*(*j*) is a stable function. More discriminatory genes are expected to have bigger *scatter*-values. In the following, we prove an upper bound and a lower bound for *scatter*(*j*).

**Lemma 1** *Given n arbitrary nonnegative numbers a*_{1}, *a*_{2},...,*a*_{
n
}, *the inequality* \frac{1}{n}(*a*_{1} + *a*_{2} + ... + *a*_{
n
}) ≤ \sqrt{\frac{1}{n}({a}_{1}^{2}+{a}_{2}^{2}+\dots +{a}_{n}^{2})}*holds, and it becomes equality if and only if a*_{1} = *a*_{2} = ... = *a*_{
n
}.

Lemma 1 can be proven by a mathematical induction.

**Lemma 2** *Given n arbitrary nonnegative numbers sorted in order a*_{1} ≤ *a*_{2} ≤ ... ≤ *a*_{
n
}, *define* \widehat{a}=\frac{1}{n}{\displaystyle {\sum}_{i=1}^{n}{a}_{i}}, \tilde{a}=\frac{1}{2} min_{1≤i≤n-1}(*a*_{i+1}- *a*_{
i
}), *and* S=\sqrt{\frac{1}{n}{\displaystyle {\sum}_{i=1}^{n}{({a}_{i}-\widehat{a})}^{2}}}*. Then*,

≤ *S* ≤ (*a*_{
n
}- *a*_{1}) - \tilde{a}.

PROOF. Note that both *S* and \tilde{a} are nonnegative. Therefore, if \tilde{a} = 0, then *S* ≥ \tilde{a} holds trivially. In the other case, we have *a*_{1} <*a*_{2} < ... <*a*_{
n
}. Assume without loss of generality that the minimum is achieved at *i* = *k*, that is, \tilde{a}=\frac{1}{2} (*a*_{k+1}- *a*_{
k
}). If \widehat{a} ∈ [*a*_{
k
}, *a*_{k+1}], from Lemma 1, we have

\frac{1}{2}\left({({a}_{k}-\widehat{a})}^{2}+{({a}_{k+1}-\widehat{a})}^{2}\right)\ge {\left(\frac{1}{2}((\widehat{a}-{a}_{k})+({a}_{k+1}-\widehat{a})\right)}^{2}={\tilde{a}}^{2}.

For *i* ≠ *k*, *k* + 1, (*a*_{
i
}≥ \widehat{a})^{2} ≥ \tilde{a}^{2}. Therefore, n{S}^{2}={\displaystyle {\sum}_{i=1}^{n}{({a}_{i}-\widehat{a})}^{2}}\ge n{\tilde{a}}^{2}. if \widehat{a} ∈ [*a*_{
p
}, *a*_{p+1}] but *p* ≠ *k*, similarly we will have \frac{1}{2}((*a*_{
p
}- \widehat{a})^{2} + (*a*_{p+1}- \widehat{a})^{2}) ≥ (*a*_{p+1}- *a*_{
p
})^{2} ≥ \tilde{a}^{2} and for *i* ≠ *p*, *p* + 1, (*a*_{
i
}- \widehat{a})^{2} ≥ \tilde{a}^{2}. This proves that \tilde{a} ≤ *S*.

Inequality *S* + \tilde{a} ≤ *a*_{
n
}- *a*_{1} holds again if \tilde{a} = 0, since (*a*_{
i
}- \widehat{a})^{2} ≤ (*a*_{
n
}- *a*_{1})^{2} for every *i*. Therefore, we may assume that *a*_{1} <*a*_{2} < ... <*a*_{
n
}. A similar enlarging process gives *S* ≤ max{*a*_{
n
}- \widehat{a}, \widehat{a} - *a*_{1}}. Since

{a}_{n}-\widehat{a}+\tilde{a}\le {a}_{n}-\frac{1}{2}({a}_{2}+{a}_{1})+\frac{1}{2}({a}_{2}-{a}_{1})={a}_{n}-{a}_{1}

and

\widehat{a}-{a}_{1}+\tilde{a}\le \frac{1}{2}({a}_{n}+{a}_{n-1})-{a}_{1}+\frac{1}{2}({a}_{n}-{a}_{n-1})={a}_{n}-{a}_{1},

we conclude that *S* + \tilde{a} ≤ *a*_{
n
}- *a*_{1}. □

According to Lemma 2, the following theorem on the bounds on *scatter*(*j*) holds.

**Theorem 3** *For gene j, define* max(*j*) = max_{w≠v}\left|{\overline{a}}_{wj}-{\overline{a}}_{vj}\right| and min(*j*) = min_{w≠v}\left|{\overline{a}}_{wj}-{\overline{a}}_{vj}\right|. *We have* min(*j*) ≤ *scatter* (*j*) ≤ max(*j*).

A differentially expressed gene is expected to have not only large inter-class variations, which can be represented by its *scatter*-value, but also small intra-class variations. Secondly, we define a function based on the deviation matrix *X*_{n×p}and the centroid deviation matrix \overline{X}_{L×p}:

\mu ({\overline{X}}_{L\times p},j)={\widehat{x}}_{j}=\frac{1}{L}{\displaystyle \sum _{k=1}^{L}{\overline{x}}_{kj}}=\frac{1}{L}{\displaystyle \sum _{k=1}^{L}(\frac{1}{{n}_{k}}}{\displaystyle \sum _{i\in {C}_{k}}{x}_{ij}}),

which is stable. Intuitively, discriminatory genes are expected to have smaller *μ*-values. In the example dataset, genes 1 and 2 have the same mean expression values across all three classes, that is, they have the equal *scatter* values. Nonetheless, *μ*(\overline{X}_{3×4}, 1) = 0.167 and *μ*(\overline{X}_{3×4}, 2) = 0.4, and thus gene 1 is better than gene 2 in this sense.

In the same example, we have *μ*(\overline{X}_{3×4}, 3) = *μ*(\overline{X}_{3×4}, 4) = 0.4. However, for gene 3, the centroids of three intra-class average deviations are the same, that is, \overline{x}*k*_{3} = 0.4 for *k* = 1, 2, 3; for gene 4, the scenario is totally different, \overline{x}*x*_{4} = 0.2, 0.4, 0.6 for *k* = 1, 2, 3. This raises a question of, basing on *μ*(\overline{X}_{L×p}, *j*), what we can tell about the quality of gene *j*. The contradictory fact is that gene 3 has a smaller maximum intra-class average deviation and a larger minimum intra-class average deviation. To further differentiate the genes, thirdly, we define function *d*_{1}(*j*):

{d}_{1}(j)=\sqrt{\frac{1}{L}{\displaystyle \sum _{k=1}^{L}{\overline{x}}_{kj}^{2}}}.

From Lemma 1, *d*_{1}(*j*) ≥ *μ*(\overline{X}_{L×p}, *j*). *d*_{1}(*j*) is also stable, and in the above example we have *d*_{1}(3) <*d*_{1}(4), which indicates that function *d*_{1}(*j*) could be more sensitive and conservative than function *μ*(\overline{X}_{L×p}, *j*) on judgment ability. Another stable function can be defined based on *μ*(\overline{X}_{L×p},*j*) is

{d}_{2}(j)=\sqrt{\frac{1}{L}{\displaystyle \sum _{k=1}^{L}\left(\frac{1}{{n}_{k}}{\displaystyle \sum _{i\in {C}_{k}}{x}_{ij}^{2}}\right)}}.

Intuitively, *d*_{2}(*j*) includes more details in its calculation than *d*_{1}(*j*) does. In the above example, gene 2 and gene 3 have the same mean expression values across all three classes: \overline{x}_{1j}= \overline{x}_{2j}= \overline{x}_{3j}. Therefore, we have *d*_{1}(2) = *d*_{1}(3) but *d*_{2}(2) <*d*_{2}(3). Since intuitively gene 2 has a stronger separability than gene 3, *d*_{2}(*j*) could be even more sensitive than *d*_{1}(*j*).

The above two functions *d*_{1}(·) and *d*_{2}(·) basically consider the means of intra-class variations. The following two functions *δ*_{1}(*j*) and *δ*_{2}(*j*) are introduced to capture the variations of intra-class deviations, corresponding to *d*_{1}(·) and *d*_{2}(·), respectively:

\begin{array}{ccc}{\delta}_{1}(j)=\sqrt{\frac{1}{L}{\displaystyle \sum _{k=1}^{L}{\left({\overline{x}}_{kj}-\mu ({\overline{X}}_{L\times p},j)\right)}^{2}}},& \text{and}& {\delta}_{2}(j)=\sqrt{\frac{1}{L}{\displaystyle \sum _{k=1}^{L}\frac{1}{{n}_{k}}{\displaystyle \sum _{i\in {C}_{k}}{\left({\overline{x}}_{ij}-\mu ({\overline{X}}_{L\times p},j)\right)}^{2}}}}\end{array}.

**Theorem 4**
\begin{array}{ccc}{\delta}_{\text{1}}(j)=\sqrt{{d}_{1}{(j)}^{2}-\mu {({\overline{X}}_{L\times p},j)}^{2}}& and& {\delta}_{\text{2}}(j)=\sqrt{{d}_{2}{(j)}^{2}-\mu {({\overline{X}}_{L\times p},j)}^{2}}\end{array}.

PROOF. The proof is easily done by simplifying the definition formulae for *δ*_{1}(*j*) and *δ*_{2}(*j*). □

Similar to functions *d*_{1}(*j*) and *d*_{2}(*j*), for an ideal differentially expressed gene *j*, both *δ*_{1}(*j*) and *δ*_{2}(*j*) are expected to have small values. Moreover, similar to the relation between *d*_{1}(*j*) and *d*_{2}(*j*), *δ*_{2}(*j*) is considered more sensitive than *δ*_{1}(*j*). We define function *compact*_{
k
}(*j*) = *d*_{
k
}(*j*) + *δ*_{
k
}(*j*), for *k* = 1, 2, to evaluate the intra-class variations for gene *j*. And we define the gene scoring function *s*_{
k
}(*j*) = *compact*_{
k
}(*j*)/*scatter*(*j*) to rank the genes according to their differentiability. Note that a smaller value of *s*_{
k
}(*j*) indicates a higher differentiability.

We denote the gene selection method using *compact*_{1}(*j*) = *d*_{1}(*j*) + *δ*_{1}(*j*) as GS1, and the other using *compact*_{2}(*j*) = *d*_{2}(*j*) + *δ*_{2}(*j*) as GS2. Both GS1 and GS2 are model-free and stable. In each of them, the scores for all genes are calculated and genes are sorted in non-decreasing order of their scores. Since the number of genes, *p*, is typically much larger than the number of samples, *n*, the overall running time to compute this order is *O*(*p* log *p*). In practice, there are several ways to select the informative genes using this order. For example, one may select the top ranked *x* genes for further analysis, or the top ranked *x%* genes, or all the genes with score no larger than some constant, among others.

### F-test method

F-test method [1, 19] is also a single gene scoring approach. Besides the notations used in our methods, it uses {\sigma}_{k}^{2} to denote the variance of expression value of gene *j* in the *k*-th class:

{\sigma}_{k}^{2}=\frac{{\displaystyle {\sum}_{i\in {C}_{k}}{({a}_{ij}-{\overline{a}}_{kj})}^{2}}}{{n}_{k}-1},

and {\sigma}^{2}=\frac{{\displaystyle {\sum}_{k=1}^{L}{n}_{k}({n}_{k}-1){\sigma}_{k}^{2}}}{n-L} to denote the variance in the whole dataset. Gene *j* has a score defined to be:

F(j)=\frac{{\displaystyle {\sum}_{k=1}^{L}{n}_{k}({\overline{a}}_{kj}-{\widehat{a}}_{j})/(L-1)}}{{\sigma}^{2}}

### Cho's method

Using the same notations used as in the above, Cho's method [17] defines a weight factor *w*_{
i
}for sample *i*, which is \frac{1}{{n}_{k}} if sample *i* belongs to class *k*. Let W={\displaystyle {\sum}_{i=1}^{n}{w}_{i}}. The weighted *mean*(*j*) for gene *j* is defined as

mean(j)={\displaystyle \sum _{i=1}^{n}\frac{{w}_{i}}{W}{x}_{ij}}.

The weighted standard deviation is defined as

std(j)=\sqrt{\frac{{\displaystyle {\sum}_{i=1}^{n}{({x}_{ij}-mean(j))}^{2}}}{(n-1/n){\displaystyle {\sum}_{i=1}^{n}{w}_{i}}}}.

Then the score of gene *j* is calculated as

C(j)=\frac{mean(j)\times std(j)}{std({\overline{a}}_{j})},

where *std*({\overline{a}}_{j}) is the standard deviation of centroid expression values ({\overline{a}}_{ij},{\overline{a}}_{2j},\dots ,{\overline{a}}_{Lj}).