POS approach for binary class problems
Microarray data are usually presented in the form of a gene expression matrix, X=[x
i
j
], such that X∈ℜP×N and x
i
j
is the observed expression value of gene i for tissue sample j where i=1, …, P and j=1, …, N. Each sample is also characterized by a target class label, y
j
, representing the phenotype of the tissue sample being studied. Let be the vector of class labels such that its jth element, y
j
, has a single value c which is either 1 or 2.
Analyzing the overlap between expression intervals of a gene for different classes can provide a classifier with an important aspect of a gene’s characteristic. The idea is that a certain gene i can assign samples (patients) to class c because their gene i expression interval in that class is not overlapping with gene i intervals of the other class. In other words, gene i has the ability to correctly classify samples for which their gene i expressions fall within the expression interval of a single class. For instance, Figure 1a presents expression values of gene i1 with 36 samples belonging to two different classes. It is clear that gene i1 is relevant for discriminating samples between the target classes, because their values are falling in non‐overlapping ranges. Figure 1b, on the other hand, shows expression values for another gene i2, which looks less useful for distinguishing between these target classes, because their expression values have a highly overlapping range.
POS initially exploits the interquartile range approach to robustly define gene masks that report the discriminative power of genes with a training set of samples avoiding outlier effects. Then, two measures are assigned for each gene: proportional overlapping score (POS) and relative dominant class (RDC). Analogously to [7] these two novel measures are exploited in the ranking phase to produce the final set of ranked genes. POS is a gene relevance score that estimates the overlapping degree between the expression intervals of both given classes taking into account three factors: (1) length of overlapping region; (2) number of overlapped samples; (3) the proportion of classes’ contribution to the overlapped samples. The latter factor is the incentive for the name we gave to our procedure, Proportional Overlapping Scores (POS). The relative dominant class (RDC) of a gene is the class that has the highest proportion, relative to class sizes, of correctly assigned samples.
Definition of core intervals
For a certain gene i, by considering the expression values x
i
j
with a class label c
j
for each sample j, we can define two expression intervals, one for each class, for that gene. The cth class interval for gene i can be defined in the form:
(1)
such that:
(2)
where , and I Q R(i,c) denote the first, third empirical quartiles, and the interquartile range of gene i expression values for class c respectively. Figure 2 shows the potential effect of expression outliers on extending the underlying intervals, if the range of training expressions are considered. Based on the defined core intervals, we present the following definitions: Non‐outlier samples set, , for gene i is defined as the set of samples whose expression values fall inside their own target classes core interval. This set can be expressed as:
(3)
where c
j
is the correct class label for sample j. Total core interval, I
i
, for gene i is given by the region between the global minimum and global maximum boundaries of core intervals for both classes. It is defined as:
such that: a
i
=m i n{ai,1, ai,2}, b
i
=m a x{bi,1, bi,2}, where ai,c, bi,c respectively represent the minimum and maximum boundaries of core interval, Ii,c, of gene i with target class c=1, 2, (see equations 1 and 2). The overlap region, , for gene i is defined as the interval yielded by the intersection between core expression intervals of both target classes. It can be addressed as:
(5)
Overlapping samples set, , for gene i is the set containing the samples whose expression values fall within the overlap interval , defined in the overlap region definition (see equation 5). The overlapping sample set can be defined as:
where represents the non‐overlapping samples set which is defined as follows. Non‐overlapping samples set, , for gene i is defined as the set consisting of elements of , defined in equation 3, whose expression values don’t fall within the overlap interval , defined in equation 5. In this way, we can define this set as:
(7)
For convenience, 〈I〉 notation is used with interval I to represent its length while |.| notation is used with set {.} to represent its size.
Gene masks
For each gene, we define a mask based on its observed expression values and constructed core intervals presented in subsection ‘Definition of core intervals’. Gene i mask reports the samples that gene i can unambiguously assign to their correct target classes, i.e. the non‐overlapping samples set . Thus, gene masks can represent the capability of genes to classify correctly each sample, i.e. it represents a gene’s classification power. For a particular gene i, element j of its mask is set to 1 if the corresponding expression value x
i
j
belongs only to core expression interval of the single class c
j
, i.e. if sample j is a member of the set . Otherwise, it is set to zero.
We define the gene masks matrix M=[m
i
j
] in which the mask of gene i is presented by Mi.(the ith row of M) such that gene mask element m
i
j
is defined as:
(8)
Figure 2 shows the constructed core expression intervals Ii,1 and Ii,2 associated with a particular gene i along‐with its gene mask. The gene mask presented in this figure is sorted corresponding to the observations ordered by increasing expression values.
The proposed POSmeasure and relative dominant class assignments
A novel overlapping score is developed to estimate the overlapping degree between different expression intervals. Figures 3a and 3b represent examples of 2 different genes, i1 and i2, with the same length of overlap interval, , length of total core interval, , and total number of overlapped samples, . These figures demonstrate that performing the ordinary overlapping scores, proposed in earlier papers [6, 7], result in the same value for both genes. But, there is an element which differs in those examples and it may also affect the overlap degree between classes. This element is the distribution of overlapping samples by classes. Gene i1 has six overlapped samples from each class, whereas gene i2 has ten and two overlapping samples from class 1 and 2 respectively. By taking this status into account, gene i2 should be reported to have less overlap degree compared to gene i1. In this article, we develop a new score, called proportional overlapping score (POS), that estimates the overlapping degree of a gene taking into account this element, i.e. proportion of each class’s overlapped samples to the total number of overlapping samples.
POS for a gene i is defined as:
(9)
where θ
c
is the proportion of class c samples among overlapping samples. Hence, θ
c
can be defined as:
where represent set of overlapping samples belonging to class c, . According to equation 9, values of POS measure are and for genes i1 and i2 in Figures 3a and 3b respectively.
Larger overlapping intervals or higher numbers of overlapping samples results in an increasing POS value. Furthermore, as proportions θ1 and θ2 get closer to each other, the POS value increases. The most overlapping degree for a particular gene is achieved when θ1=θ2=0.5 while the other two factors are fixed. We include the multiplier “4” in equation 9 to scale POS score to be within the closed interval [0,1]. In this way, a lower score denotes gene with higher discriminative power.
Once the gene mask is defined and POS index is computed, we assign each gene to its relative dominant class (RDC). RDC for gene i is defined as follows:
(11)
where is the set of class c samples . Note that , while m
i
j
is the jth mask element of gene i (see equation 8). I(m
i
j
=1) represents an indicator which sets to 1 if m
i
j
=1, otherwise it sets to zero.
In this definition, the samples that belong to the set categorized into their target classes are only considered for each class. These samples are the ones that the gene could unambiguously assign to their target classes. According to our gene mask definition (see equation 8) they are the samples with 1 bits in the corresponding gene mask. Afterwards, the proportion of the class’s samples to its total sample size has been evaluated. The class with the highest proportion is the relative dominant class of the gene. Ties are randomly distributed on both classes. Genes are assigned to their RDC in order to associate each gene with the class it is more able to distinguish. As a result, the number of selected genes could be balanced per class at our final selection process. The relative evaluation for detecting the dominant class can avoid the misleading assignment due to unbalanced class sizes distribution effects.
Selecting minimum subset of genes
Selecting a minimum subset of genes is one of the POS method stages in which the information provided by the constructed gene masks and the POS scores are analyzed. This subset is designated to be the minimum one that correctly classify the maximum number of samples in a given training set, avoiding the effects of expression outliers. Such a procedure allows disposing of redundant information e.g., genes with similar expression profiles.
Baralis et al. [25] have proposed a method that is somewhat similar to our procedure for detecting a minimum subset of genes from microarray data. The main differences are that [25] use the expression range to define the intervals which are employed for constructing gene masks, and then apply a set‐covering approach to obtain the minimum feature subset. The same technique is performed by [7] to get a minimum gene subset using a greedy approach rather than the set‐covering.
Let
be a set containing all genes (i.e., ). Also, let be its aggregate mask which is defined as the logical disjunction (logic OR) between all masks corresponding to genes that belong to the set. It can be expressed as follows:
(12)
Our objective is to search for the minimum subset, denoted by , for which equals to the aggregate mask of the set of genes, . In other words, our minimum set of genes should satisfy the following statement:
(13)
A modified version of the greedy search approach used by [7] is applied. The pseudo code of our procedure is reported in Algorithm 1. Its inputs are the matrix of gene masks, M; the aggregate mask of genes, ; and POS scores. It produces the minimum set of genes, , as output.
At the initial step (k=0), we let and (lines 2, 3); where is the aggregate mask of the set , while 0
N
is a vector of zeros with the length N. Then, at each iteration, k, the following steps are performed:
-
1.
The gene(s) with the highest number of mask bits set to 1 is (are) chosen to form the set (line 6). This set could not be empty as long as the loop condition is still satisfied, i.e. . Under this condition, our selected genes don’t cover yet the maximum number of samples that should be covered by our target gene set. Note that our definition for gene masks allows to report in advance which samples should be covered by the minimum subset of genes. Therefore, there would be at least one gene mask which has at least one bit set to 1 if that condition is to hold.
-
2.
The gene with the lowest POS score among genes in , if there are more than one, is then selected (line 7). It is denoted by g
k
.
-
3.
The set is updated by adding the selected gene, g
k
(line 8).
-
4.
All gene masks are also updated by performing the logical conjunction (logic AND) with negated aggregate mask of set (line 10). The negated mask of the mask is the one obtained by applying logical negation (logical complement) on this mask. Consequently, the bits of ones corresponding to the classification of still uncovered samples are only considered. Note that represents updated mask of gene i at the kth iteration such that is its original gene mask whose elements are computed according to equation 8.
-
5.
The procedure is successively iterated and ends when all gene masks have no one bits anymore, i.e. the selected genes cover the maximum number of samples. This situation is accomplished iff .
Thus, this procedure detects the minimum set of genes required to provide the best classification coverage for a given training set. In addition, genes are descendingly ordered by number of 1 bits within the minimum set, .
Final gene selection
The POS score alone can rank genes according to their overlapping degree, without taking into account the class that has more correctly assigned samples by each gene (which can be addressed as the dominant class of that gene). Consequently, high‐ranked genes may all have an ability to only correctly classify samples belonging to the same class. Such a case is more likely to happen in situations with unbalanced class‐size distributions. As a result, a biased selection could result. Assigning the dominant class on a relative basis, as proposed in subsection ‘The proposed POS measure and relative dominant class assignments’, and taking these assignments into account during the gene ranking process allows us to overcome this problem.
Therefore, the gene ranking process is performed by considering both POS scores and RDC. Within each relative dominant class c (where c=1,2), all genes that have not been chosen in the minimum set, , and whose R D C=c are sorted by an increasing order of POS values. Now, we have two disjoint groups (one for each class) of ranked genes. The topmost gene is selected from each group in a round‐robin fashion to compose the gene ranking list.
The minimum subset of genes, presented in subsection ‘Selecting minimum subset of genes’, is extended by adding the top ν ranked genes in the gene ranking list, where ν is the required number extending the minimum subset up to the total number of requested genes, r, which is an input of the POS method set by the user. The resulting final set includes the minimum subset of genes regardless of their POS values, because these genes allow the considered classifier to correctly classify the maximum number of training samples.
The pseudo code of the Proportional Overlapping Scores (POS) method is reported in Algorithm 2.