Assume in the training dataset there are in total p genes in the microarray chips, and assume we have in total n chips/samples that have been grouped into L classes. Let a
ij
denote the expression value of gene j in sample i. The training dataset can thus be represented as a matrix
An×p= (a
ij
)n×p.
We will define two gene scoring functions using entry values in matrix An×p. These two scoring functions might be considered to better use the means and the variations of the gene expression values.
Let n
k
denote the number of samples in class C
k
, for k = 1, 2,..., L (i.e., ). Let , which is the average expression value of gene j in class C
k
, for k = 1, 2,..., L. The expression vector () is the centroid of class C
k
. Correspondingly, the centroid matrix is
The mean of these centroids is , where .
For sample i belonging to class C
k
, the difference between the expression value of gene j and the class mean is x
ij
= |a
ij
- |. The matrix
Xn×p= (x
ij
)n×p
is the deviation matrix of the training dataset. Let denote the average deviation for samples in class C
k
with respect to the centroid. The centroid deviation matrix is
Intuitively, gene j has a strong discriminating power if the means , k = 1, 2,..., L, differ significantly and , k = 1, 2,..., L, indicating the intra-class variations, are all small.
For example, suppose we have a microarray expression matrix A12×4 as shown, in which 12 samples have been known in 3 classes:
Then the centroid matrix is
and the mean of the centroids is
(1)
= (1.2, 1.2, 1.2, 1.0).
The deviation matrix X12×4 is
and the intra-class average deviations are
Figures 3, 4, 5, 6 illustrate the expression values of these four genes across all 12 samples, with the intra-class means and average deviations also shown. There are three key ideas in our design of gene scoring functions, which will be exemplified through these four genes. First of all, gene 1 has quite different mean expression values across three classes, compared to gene 4 that has the same means. Therefore, gene 1 is intuitively better than gene 4 in terms of discriminating power. Note that the goal of gene selection is to select genes that have significantly different means across different classes. For each gene j, the quantity is the mean of all the centroids on gene j and it represents all the samples. is stable, that is, it would not change when the samples in one class are duplicated (since the number of classes, L, and all the means, , for k = 1, 2,..., L, do not change). We define the scatter of gene j to capture the inter-class variations, which takes in as a component:
in which the square root is the standard (estimated) deviation of all the centroids on gene j. Clearly seen, scatter(j) is a stable function. More discriminatory genes are expected to have bigger scatter-values. In the following, we prove an upper bound and a lower bound for scatter(j).
Lemma 1 Given n arbitrary nonnegative numbers a1, a2,...,a
n
, the inequality (a1 + a2 + ... + a
n
) ≤ holds, and it becomes equality if and only if a1 = a2 = ... = a
n
.
Lemma 1 can be proven by a mathematical induction.
Lemma 2 Given n arbitrary nonnegative numbers sorted in order a1 ≤ a2 ≤ ... ≤ a
n
, define , min1≤i≤n-1(ai+1- a
i
), and . Then,
(2)
≤ S ≤ (a
n
- a1) - .
PROOF. Note that both S and are nonnegative. Therefore, if = 0, then S ≥ holds trivially. In the other case, we have a1 <a2 < ... <a
n
. Assume without loss of generality that the minimum is achieved at i = k, that is, (ak+1- a
k
). If ∈ [a
k
, ak+1], from Lemma 1, we have
For i ≠ k, k + 1, (a
i
≥ )2 ≥ 2. Therefore, . if ∈ [a
p
, ap+1] but p ≠ k, similarly we will have ((a
p
- )2 + (ap+1- )2) ≥ (ap+1- a
p
)2 ≥ 2 and for i ≠ p, p + 1, (a
i
- )2 ≥ 2. This proves that ≤ S.
Inequality S + ≤ a
n
- a1 holds again if = 0, since (a
i
- )2 ≤ (a
n
- a1)2 for every i. Therefore, we may assume that a1 <a2 < ... <a
n
. A similar enlarging process gives S ≤ max{a
n
- , - a1}. Since
and
we conclude that S + ≤ a
n
- a1. □
According to Lemma 2, the following theorem on the bounds on scatter(j) holds.
Theorem 3 For gene j, define max(j) = maxw≠v and min(j) = minw≠v. We have min(j) ≤ scatter (j) ≤ max(j).
A differentially expressed gene is expected to have not only large inter-class variations, which can be represented by its scatter-value, but also small intra-class variations. Secondly, we define a function based on the deviation matrix Xn×pand the centroid deviation matrix L×p:
which is stable. Intuitively, discriminatory genes are expected to have smaller μ-values. In the example dataset, genes 1 and 2 have the same mean expression values across all three classes, that is, they have the equal scatter values. Nonetheless, μ(3×4, 1) = 0.167 and μ(3×4, 2) = 0.4, and thus gene 1 is better than gene 2 in this sense.
In the same example, we have μ(3×4, 3) = μ(3×4, 4) = 0.4. However, for gene 3, the centroids of three intra-class average deviations are the same, that is, k3 = 0.4 for k = 1, 2, 3; for gene 4, the scenario is totally different, x4 = 0.2, 0.4, 0.6 for k = 1, 2, 3. This raises a question of, basing on μ(L×p, j), what we can tell about the quality of gene j. The contradictory fact is that gene 3 has a smaller maximum intra-class average deviation and a larger minimum intra-class average deviation. To further differentiate the genes, thirdly, we define function d1(j):
From Lemma 1, d1(j) ≥ μ(L×p, j). d1(j) is also stable, and in the above example we have d1(3) <d1(4), which indicates that function d1(j) could be more sensitive and conservative than function μ(L×p, j) on judgment ability. Another stable function can be defined based on μ(L×p,j) is
Intuitively, d2(j) includes more details in its calculation than d1(j) does. In the above example, gene 2 and gene 3 have the same mean expression values across all three classes: 1j= 2j= 3j. Therefore, we have d1(2) = d1(3) but d2(2) <d2(3). Since intuitively gene 2 has a stronger separability than gene 3, d2(j) could be even more sensitive than d1(j).
The above two functions d1(·) and d2(·) basically consider the means of intra-class variations. The following two functions δ1(j) and δ2(j) are introduced to capture the variations of intra-class deviations, corresponding to d1(·) and d2(·), respectively:
Theorem 4
PROOF. The proof is easily done by simplifying the definition formulae for δ1(j) and δ2(j). □
Similar to functions d1(j) and d2(j), for an ideal differentially expressed gene j, both δ1(j) and δ2(j) are expected to have small values. Moreover, similar to the relation between d1(j) and d2(j), δ2(j) is considered more sensitive than δ1(j). We define function compact
k
(j) = d
k
(j) + δ
k
(j), for k = 1, 2, to evaluate the intra-class variations for gene j. And we define the gene scoring function s
k
(j) = compact
k
(j)/scatter(j) to rank the genes according to their differentiability. Note that a smaller value of s
k
(j) indicates a higher differentiability.
We denote the gene selection method using compact1(j) = d1(j) + δ1(j) as GS1, and the other using compact2(j) = d2(j) + δ2(j) as GS2. Both GS1 and GS2 are model-free and stable. In each of them, the scores for all genes are calculated and genes are sorted in non-decreasing order of their scores. Since the number of genes, p, is typically much larger than the number of samples, n, the overall running time to compute this order is O(p log p). In practice, there are several ways to select the informative genes using this order. For example, one may select the top ranked x genes for further analysis, or the top ranked x% genes, or all the genes with score no larger than some constant, among others.
F-test method
F-test method [1, 19] is also a single gene scoring approach. Besides the notations used in our methods, it uses to denote the variance of expression value of gene j in the k-th class:
and to denote the variance in the whole dataset. Gene j has a score defined to be:
Cho's method
Using the same notations used as in the above, Cho's method [17] defines a weight factor w
i
for sample i, which is if sample i belongs to class k. Let . The weighted mean(j) for gene j is defined as
The weighted standard deviation is defined as
Then the score of gene j is calculated as
where std() is the standard deviation of centroid expression values ().