Very Important Pool (VIP) genes – an application for microarray-based molecular signatures

Background Advances in DNA microarray technology portend that molecular signatures from which microarray will eventually be used in clinical environments and personalized medicine. Derivation of biomarkers is a large step beyond hypothesis generation and imposes considerably more stringency for accuracy in identifying informative gene subsets to differentiate phenotypes. The inherent nature of microarray data, with fewer samples and replicates compared to the large number of genes, requires identifying informative genes prior to classifier construction. However, improving the ability to identify differentiating genes remains a challenge in bioinformatics. Results A new hybrid gene selection approach was investigated and tested with nine publicly available microarray datasets. The new method identifies a Very Important Pool (VIP) of genes from the broad patterns of gene expression data. The method uses a bagging sampling principle, where the re-sampled arrays are used to identify the most informative genes. Frequency of selection is used in a repetitive process to identify the VIP genes. The putative informative genes are selected using two methods, t-statistic and discriminatory analysis. In the t-statistic, the informative genes are identified based on p-values. In the discriminatory analysis, disjoint Principal Component Analyses (PCAs) are conducted for each class of samples, and genes with high discrimination power (DP) are identified. The VIP gene selection approach was compared with the p-value ranking approach. The genes identified by the VIP method but not by the p-value ranking approach are also related to the disease investigated. More importantly, these genes are part of the pathways derived from the common genes shared by both the VIP and p-ranking methods. Moreover, the binary classifiers built from these genes are statistically equivalent to those built from the top 50 p-value ranked genes in distinguishing different types of samples. Conclusion The VIP gene selection approach could identify additional subsets of informative genes that would not always be selected by the p-value ranking method. These genes are likely to be additional true positives since they are a part of pathways identified by the p-value ranking method and expected to be related to the relevant biology. Therefore, these additional genes derived from the VIP method potentially provide valuable biological insights.


Background
DNA microarray technology [1,2] has rapidly advanced due to the intrinsic and unprecedented ability to simultaneously measure gene expression on a whole genome basis. Microarray technology continues to develop and is widely cited as offering much utility for translational science, from improved drug discovery, including target discovery, to improved clinical diagnostics and disease stage determination, prognostics and treatment selection, and more. With the prospect of microarray-derived biomarkers being applied in clinical applications, the bar is substantially raised for identification of informative genes enabling accurate classifiers, and efforts to this end are prevalent in the literature [3][4][5][6][7][8][9][10][11]. More specifically, there is a compelling need to identify a subset of genes from among the more than 20,000 in the entire genome that allow robust classifiers to be developed. The difficulty and challenge is to overcome the intrinsic characteristics of microarray data that contains a substantially small number of samples when compared to the number of genes [12,13]. These characteristics lead to the risk of fitting to noise as genes with high variability unrelated to phenotype masquerade as informative genes. The truly differentiating signals derived from small numbers of experimental replicates are difficult to distinguish in the sea of noise, leading to the appearance of unstable (i.e., non-reproducible) significant gene lists [14][15][16].
Gene selection is synonymous with feature selection or variable selection in machine learning, a process extensively used to mitigate the so called "curse of dimensionality" [17][18][19][20]. Generally, gene selection is done for either hypothesis testing or hypothesis generation. Selecting a subset of genes as molecular signatures or biomarkers that could be used for developing a generalized and accurate classifier for differentiating phenotypes is a hypothesis testing process [21], wherein rigorous validation is needed. On the other hand, identifying a list of putatively relevant genes related to a phenotype or endpoint of interest for subsequent research is a hypothesis generating process [22], wherein validation of the genes is much more relaxed; the genes so identified often shed light on the fundamental molecular mechanisms and biological processes under study.
Selecting and validating an "optimal" set of genes for a molecular signature or biomarker for a robust classifier is a complicated and time-consuming task. An exhaustive search encompassing all possible gene subsets to find the set yielding the smallest error can be an intractable computational task. Worse still, because the number of genes far outnumber samples, the potential for fitting to random noise is high, making stringent testing and validation essential [23,24]. Most methods to select informative genes for classification model development reported in the literature rely on ranked genes by fold change, correlation coefficient, or pvalue from a t-statistic, Wilcoxon statistic, or analysis of variance (ANOVA), or some combination of these [22,[25][26][27][28][29][30]. To a greater or lesser degree, all of these methods yield an informative gene list varying on the sample size, which has led doubt on microarray reliability [14][15][16]. In theory, true phenotype differentiating genes should be expected to express consistently with each other regardless of the sample size. In other words, the list of informative genes as well as the underlying mechanisms inferred by these genes should have nothing to do with the sample size.
In this study, a bagging [31] based new hybrid gene selection approach was investigated to identify informative genes. The rationale of the approach is that informative genes should consistently show significance for different variations of sample size. Accordingly, many re-sampling iterations are conducted to generate different variations of sample size and the frequency of genes exhibiting significance throughout the iterations formed the basis for identification of the informative genes that are considered as a Very Important Pool (VIP) of genes. In reality, the VIP genes can be identified using any existing gene selection approach or their combinations and can be used to derive molecular signatures to build robust classifiers with good generalization capability, or to narrow subsequent research to reveal relevant, fundamental molecular mechanisms in biological processes. In this study, t-statistic and discriminatory analysis are used to evaluate the significance of genes. In the t-statistic, the significant genes are identified based on p-values. In the discriminatory analysis, disjoint Principal Component Analyses (PCAs) are conducted for each class of samples, and those genes with high discrimination power (DP) [32] are identified as significant genes. The VIP genes are those having high frequency of showing significance in the re-sampling iterations. The utility of the proposed approach was demonstrated with nine diverse microarray datasets for identifying the informative genes for classifier development and compared with commonly used p-value ranking gene selection approaches.

Results
The VIP gene selection approach for microarray based molecular signatures was applied to the nine publicly available microarray gene expression datasets described in Table 1. For the purpose of comparison, the p-value ranking method was also used. For each dataset, an unbiased sample splitting, gene selection, and validation dataset prediction process as depicted in Figure 1 was carried out. Briefly, a dataset is first randomly split into a training set with two thirds of the samples and a validation set with the remaining samples. With validation samples set aside, gene selection and classifier development are done using the training samples. Two lists of 50 genes are selected, one using the proposed VIP gene selection approach and the other using p-value ranking. The p-value ranking is based on an unpaired, two-tailed t-statistic with pooled variance estimate. In order to exam whether the VIP gene selection approach can identify informative genes or not, three sets of classifiers were generated, one for the VIP genes, one for the p-value genes and another for the genes uniquely identified by the VIP method (called unique genes hereafter). A Nearest-Centroid[33] classification method was used to develop classifiers. These classifiers are applied to predict the validation samples. The prediction performance of classifiers were compared by accuracies, specificities, sensitivities, and the Matthew's correlation coefficients (MCCs). The definitions of these measures are given in the section titled "materials and methods". The sample splitting, gene selection, and validation dataset prediction steps were repeated 50 times for adequate statistics.
We first compared the classifiers based on the VIP genes with those from the p-value ranking. As shown in Table 2, the VIP classifiers exhibited somewhat better performance compared to the classifiers from the p-value selected genes. The p-values from t-statistic for accuracy, specificity, sensitivity and MCC between two groups of classifiers (the VIP classifiers versus the p-value ranking classifiers) are 0.0027, 0.32, 0.059, and 0.0092, respectively. Therefore, at the 0.05 confidence level, the improvement of classifier measured in MCC and accuracy is significant, but not for specificity and sensitivity. The results indicate that the VIP genes may convey more, but not less, biologically relevant information than the p-value selected genes.
Next, to determine whether the unique genes indeed contribute to the sample differentiation and thus biological relevance, we compared prediction performance of the classifiers built from unique genes with those built from the p-value ranked genes across the nine datasets. The average number of unique genes for each dataset is also listed in Table 2. It was shown that the average perform- The flowchart for the classifier development and validation using three gene sets: (A) Top 50 p-value ranked genes; (B) Top 50 VIP genes; and (C) the unique VIP genes Figure 1 The flowchart for the classifier development and validation using three gene sets: (A) Top 50 p-value ranked genes; (B) Top 50 VIP genes; and (C) the unique VIP genes. Specifically, the data set is first randomly divided into two thirds of samples for training and the remainder for validation. Next, three sets of genes are generated solely based on the training set, and are subsequently used to develop Nearest-Centroid classifiers. Lastly, the classifiers are used to predict the validation samples and their respective prediction performance measured by accuracy (Acc), specificity (Spec), sensitivity (Sens), and Matthew's correlation coefficient (MCC) are calculated. The process is repeated 50 times and the averaged performance metrics are reported in Table 2.  ance metrics (accuracy, specificity, sensitivity, and MCC) for classifiers built from unique genes (number from 14 to 22) are not very different from those built from top 50 pvalue ranked genes for all nine datasets. The difference of each pair of average performance metrics is respectively tested across nine datasets with a null hypothesis that the compared performance metrics (accuracy, specificity, sensitivity, or MCC) is not very different from each other by using a paired and two-tailed t-statistic. The p-values given by t-statistic are 0.63, 0.77, 0.95, and 0.81 for accuracy, specificity, sensitivity, and MCC respectively. Apparently, the differences of all prediction performance metrics among classifiers are not significant at the 0.05 confidence level. This suggests that the unique VIP genes are statistically equivalent as those identified by p-value ranking in distinguishing different types of samples. Therefore, these unique genes could be an additional subset of genes which are equally as important as those selected with pvalue ranking. The existence of additional subsets of classifying genes may imply that there exist multiple biological processes for studied endpoints or co-factors.
Lastly, to gain more understanding of the VIP genes in terms of biology related to the investigated dataset, we further examined the unique genes as well as the common genes shared by the p-value method in the van't Veer dataset using PathArt http://www.jubilantbiosys.com/ ppa.htm through the FDA genomic tool, ArrayTrack http:/ /www.fda.gov/nctr/science/centers/toxicoinformatics/ ArrayTrack/. PathArt is a pathway analysis tool that contains disease related canonical pathways manually created from the literature. The van't Veer dataset contains 24 unique genes and 26 common genes. Of 24 unique genes, ten genes were found in PathArt and were listed in Table  3. Most of these ten genes involve biological processes related to various cancers; for example, IGFBP5 and MMP9 are directly related to breast cancer. We also exam-ined the pathways associated with the 26 common genes and found seven unique genes were involved in seven pathways identified by the common genes (Table 4). These results demonstrate that the unique genes not identified by the p-value ranking could convey additional important information for biological interpretation.

Discussion
Quantitatively assessing the effectiveness of gene selection methods can be problematic owing to several limitations among which selection bias caused by information leakage from training phase to validation phase figures prominently [24]. The most severe bias was described by Ambroise et al. [21] and Simon et al.
[34] as occurring when identifying genes from the entire dataset (i.e., training set and validation set) and using them in cross-validation. Wessels et al.
[35] and Lai et al. [24] describe a less severe bias. Typically, the training samples are used to generate a series of gene subsets, while the performance of a classifier trained with the training samples and tested with the validation samples is applied to estimate the informativeness of each gene subset. The bias derives from the fact that the validation samples are used to select the best performing gene subset. Since optimization of the gene subset is part of the training process, selection of the best gene subset should be conducted with the training samples only. This process as shown in Figure 1 has been carried out in this study to assess the utility of the proposed VIP gene selection method by entirely avoiding bias due to information leakage from validation dataset in training phase.
Classification method selection is another important aspect of developing predictive models from microarray expression data. Many classifiers are created with one or more adjustable parameters that affect not only the prediction accuracy but also the complexity of the classifiers  [24], choosing a classification method with a limited complexity can help prevent over-training, thus providing a more robust predictor. In this study, the simple classification approach Nearest-Centroid was used to develop and compare classifiers based on unique VIP genes and top 50 p-value ranked genes. Since the method lacks a tuneable parameter, risks of overtraining are lessened compared to other methods, as are the chances that differences in prediction accuracy are due to method rather than selected genes.
Commonly used gene selection approaches in DNA microarray data analysis, such as p-value ranking or fold change ranking and others, assume that all genes are stochastic variables that are unrelated to each for purposes of calculating significance. This assumption is inconsistent with the actual biological processes where most genes have some interdependency to and are interlinked with other genes through complex mechanisms and pathways. In contrast, the proposed VIP gene selection approach uses both DPs and p-values to assess the discriminatory capability of genes in differentiating sample types. DPs are calculated from two independent PCAs that fuse discriminating information across whole genes. The interdependence and interlinking effects among genes are embedded within the DP calculation, enhancing rather than reducing many aspects of actual biological processes. Furthermore, the bagging re-sampling technique, which has been used to analyze microarray data for clustering [40][41][42] and classification [43][44][45][46], is used here to mitigate the chance selection of genes. Compared with p-value ranking-type gene selection approaches, the proposed VIP gene selection has great potential to select additional informative genes that can be useful for either biological insights or to improve the prediction performance of classifiers.

Conclusion
The new hybrid gene selection approach was investigated for identifying VIP genes from nine diverse gene expres-sion datasets. The VIP gene selection approach quantifies discriminatory capability for differentiating sample classes using both discrimination analysis and p-value ranking through a bagging sampling process. The classifiers built from those unique VIP genes showed comparable prediction capability to those built from the top 50 t-statistic based p-value ranked genes in predicting the types of unknown samples. Therefore, the VIP gene selection approach could provide an additional subset of genes which are of equivalent performance as those selected with the t-statistic based p-value ranking. The subset of VIP genes could convey additional biological information in terms of associated biological pathways and mechanisms during hypothesis generation. Similarly, the VIP genes could be used to improve molecular fingerprints for use in clinical biomarkers.
The VIP gene selection approach was developed using the programming language Matlab ® 7.0, running on a DELL™ Precision 490 workstation equipped with two Intel ® Dual Core Xeon™ 3.0 GHz processors and 2 GB of memory. The Matlab codes are available upon request.

Algorithm
The VIP gene selection approach combines discriminatory powers derived from two independent principal component analyses and p values from t-statistic to filter genes based on a bagging, re-sampling technique. The algorithmic process is depicted in Figure 2, where the training dataset is composed of n 1 samples of class 1 and n 2 samples of class 2. Samples of class 1 and class 2 are represented by the matrices X 1 and X 2 , respectively. The VIP genes are chosen through the following steps: Tat signaling pathway Acquired immuno deficiency syndrome 1. Randomly select 75% of samples from the training data, X 1 and X 2, using a bagging, re-sampling strategy. The selected samples are represented with X 1m for class 1 and X 2m for class 2.
2. Rank genes by their p-values and only keep the top 100 genes for next step. P-values are calculated from a twotailed and unpaired t-statistic with pooled variance estimate (i.e., equal variances or homoscedastic assumption) on X 1m and X 2m . The remaining data are represented by and , respectively.
3. Rank genes based on their discrimination powers (DPs) and the increment the frequencies of the top 50 genes by one. The calculation of DPs is described in detail in the next section "calculation of discrimination power".
4. Repeat steps one through three 100 times.
5. Rank genes by frequencies and choose the top 50 genes as VIP genes.

Calculation of discrimination power
DPs are calculated from two independent principal component analyses (PCAs). PCA is performed on each pvalue-filtered data, and from step 2. The optimum number of components for each PCA is determined using Malinowski's factor indicator function (IND) [58] with eqs. (1) -(3): where X is either and ; T and P are the score and loading matrices of the PCA; λ i is the i th eigenvalue of the total g eigenvalues; and n and p are the number of samples and the number of genes in the matrix X, respectively. The optimum number (k) of components for the PCA is the one that yields the minimum IND value. The discrimination power (DP j ) for a gene j can be calculated with eq.
where E is one of the four residue matrices E 11 , E 12 , E 22 , and E 21 .

Prediction performance
The prediction performance of a Nearest-Centroid classifier in this study is characterized with four metrics: accuracy, specificity, sensitivity, and the Matthew's correlation coefficient (MCC). The metrics can be calculated from the prediction confusion matrix shown in Table 5 as follows:  The detailed process for identifying a very important pool (VIP) of genes Figure 2 The detailed process for identifying a very important pool (VIP) of genes. X 1 and X 2 are, respectively, the gene expression profiles for class 1 samples and class 2 samples in the training set. X 1m and X 2m are samples randomly selected from X 1 and X 2 in the m th bagging step. and are the genes remaining after filtering genes from X 1m and X 2m , respectively. Malinowski's factor indicator function (IND) is calculated with equations and IND k = RE k /(n -k) 2 , where λ i is the i th eigenvalue of the total g eigenvalues; n is the number of samples and p is the number of genes. The optimum number (k) of components corresponds to the IND minimum. E 11 and E 21 are the residue matrices after projecting X 1m and X 2m into the PCA space for class 1, respectively, while E 22 and E 12 are the residue matrices after projecting X 2m and X 1m into the PCA space for class 2, respectively. The discrimination power (DP) of a gene j is calculated with the equation: , where , , , and are the j columns of residue matrices E 11 , E 12 , E 22 , and E 21 , respectively.

Start
Select 75% of samples (X 1m , X 2m ) from X 1 and X 2 independently using bagging re-sampling strategy  (Table 5).

Disclaimer
The views presented in this article do not necessarily reflect those of the US Food and Drug Administration.