Informative gene selection and the direct classification of tumors based on relative simplicity

Chen, Yuan; Wang, Lifeng; Li, Lanzhi; Zhang, Hongyan; Yuan, Zheming

doi:10.1186/s12859-016-0893-0

Research article
Open access
Published: 20 January 2016

Informative gene selection and the direct classification of tumors based on relative simplicity

Yuan Chen^1,2,
Lifeng Wang³,
Lanzhi Li²,
Hongyan Zhang² &
…
Zheming Yuan^1,2

BMC Bioinformatics volume 17, Article number: 44 (2016) Cite this article

2552 Accesses
23 Citations
1 Altmetric
Metrics details

Abstract

Background

Selecting a parsimonious set of informative genes to build highly generalized performance classifier is the most important task for the analysis of tumor microarray expression data. Many existing gene pair evaluation methods cannot highlight diverse patterns of gene pairs only used one strategy of vertical comparison and horizontal comparison, while individual-gene-ranking method ignores redundancy and synergy among genes.

Results

Here we proposed a novel score measure named relative simplicity (RS). We evaluated gene pairs according to integrating vertical comparison with horizontal comparison, finally built RS-based direct classifier (RS-based DC) based on a set of informative genes capable of binary discrimination with a paired votes strategy. Nine multi-class gene expression datasets involving human cancers were used to validate the performance of new method. Compared with the nine reference models, RS-based DC received the highest average independent test accuracy (91.40 %), the best generalization performance and the smallest informative average gene number (20.56). Compared with the four reference feature selection methods, RS also received the highest average test accuracy in three classifiers (Naïve Bayes, k-Nearest Neighbor and Support Vector Machine), and only RS can improve the performance of SVM.

Conclusions

Diverse patterns of gene pairs could be highlighted more fully while integrating vertical comparison with horizontal comparison strategy. DC core classifier can effectively control over-fitting. RS-based feature selection method combined with DC classifier can lead to more robust selection of informative genes and classification accuracy.

Background

Microarray expression data of cancer tissue samples has the following properties: small sample size yet large number of features, high noise and redundancy, a remarkable level of background differences among samples and features, and nonlinearity [1, 2]. Selecting a parsimonious set of informative genes to build robust classifier with highly generalized performance is one of the most important tasks for the analysis of microarray expression data, as it can help to discover disease mechanisms, as well as improve the precision and reduce the cost of clinical diagnoses [3].

Gene selection depends on a given evaluation strategy and a defined score. The individual-gene-ranking methods rank genes by only comparing the expression values of the same individual gene between different classes (a vertical comparison evaluation strategy). This can be very far from the truth, as the deregulation of pathways, rather than individual genes, may be critical in triggering carcinogenesis [4]. If a gene has a remarkable joint effect on other genes, it should be selected as an informative gene, even though it may receive a lower rank in an individual-gene-ranking method. This joint effect of genes has been taken into account in most popular, existing algorithms, including top scoring pair (TSP) [5, 6], top scoring triplet(TST) [7], top-scoring ‘N’(TSN) [8], top scoring genes (TSG) [9] and doublet method [4]. However, the gene pairs score, that is the percentage of Δ_ij in TSP [5, 6], cannot reflect size differences among samples. To fully utilize sample size information TSG introduces chi-square values as the score for gene pairs [9]. TSP and TSG are both pair-wise gene evaluations, which compare the expression values of the same sample between two different genes (a horizontal comparison evaluation strategy), and can help to eliminate the influence of sampling variability due to different subjects [5, 6, 9].

At the level of gene pairs, Merja et al. [10] defined two patterns based on rank data, rather than absolute expression, from data-driven perspective: the consistent reversal of relative expression and consistent relative expression. This premise allowed us to organize the cell types in to their ontogenetic lineage-relationships and may reflect regulatory relationships among the genes [10]. The first pattern can be subdivided into a consistent reversal of expression (Pattern I) and a consistent reversal of relative expression (Pattern II) based on absolute expression (see Table 1). Similarly, the second pattern can be subdivided in to a consistent expression (Pattern III) and a consistent relative expression (Pattern IV). Furthermore, a heterogeneous background expression of samples (Pattern V) and an interaction expression pattern (Pattern VI) can be defined, if the influence of sampling variability due to different subjects [9] and paired-gene interactions are considering [11]. Clearly, all twelve genes (G₁ ~ G₁₂) in Table 1 should be informative genes from data-driven perspective. However, individual-gene evaluations, which only detect different expression levels between positive samples and negative samples, cannot highlight Pattern V and Pattern VI. Pair-wise gene evaluation with vertical comparison can highlight most patterns except Pattern V. Only pair-wise gene evaluation with horizontal comparison can highlight Pattern V, even though it cannot detect most other patterns. Therefore, both vertical and horizontal comparisons need to be considered in pair-wise gene evaluation techniques.

Table 1 Six patterns for joint effect of gene pairs in binary-class simulation data

Informative gene selection and the direct classification of tumors based on relative simplicity

Abstract

Background

Results

Conclusions

Background

Datasets and methods

Datasets

Data preprocessing

Adjustment for outliers

Transforming datasets from multi-class to binary-class with “one versus rest”

Complexity and relative simplicity score

Individual-gene evaluation

Pair-wise gene evaluation

Horizontal comparison of gene pairs

Vertical comparison of gene pairs

RS score of gene pairs

Integrated individual-gene ranking

Informative gene selection

Paired votes prediction with RS-based DC

Results and analysis

Comparison of independent prediction accuracy and the number of informative genes among different models

Comparison of feature selection methods

Comparison of generalization performance among different models

Discussion

Outlier adjustment and endpoint selection

Entropy and complexity

Horizontal and vertical evaluation of gene pairs

Direct classifier

Paired votes based on binary-discriminative informative genes

Biological relevance of informative genes selected by RS

Conclusion

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interest

Authors’ contributions

Additional files

Additional file 1:

Additional file 2:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us