- Software
- Open access
- Published:
JCD-DEA: a joint covariate detection tool for differential expression analysis on tumor expression profiles
BMC Bioinformatics volume 20, Article number: 365 (2019)
Abstract
Background
Differential expression analysis on tumor expression profiles has always been a key issue for subsequent biological experimental validation. It is important how to select features which best discriminate between different groups of patients. Despite the emergence of multivariate analysis approaches, prevailing feature selection methods primarily focus on multiple hypothesis testing on individual variables, and then combine them for an explanatory result. Besides, these methods, which are commonly based on hypothesis testing, view classification as a posterior validation of the selected variables.
Results
Based on previously provided A5 feature selection strategy, we develop a joint covariate detection tool for differential expression analysis on tumor expression profiles. This software combines hypothesis testing with testing according to classification results. A model selection approach based on Gaussian mixture model is introduced in for automatic selection of features. Besides, a projection heatmap is proposed for the first time.
Conclusions
Joint covariate detection strengthens the viewpoint for selecting variables which are not only individually but also jointly significant. Experiments on simulation and realistic data show the effectiveness of the developed software, which enhances the reliability of joint covariate detection for differential expression analysis on tumor expression profiles. The software is available at http://bio-nefu.com/resource/jcd-dea.
Background
Multiple hypothesis testing, which is a situation where more than one hypothesis is evaluated simultaneously [1], has been widely used for differential expression analysis on tumor expression profiles. In order to improve the statistical power, methods that address multiple testing by adjusting the p-value from a statistical test have been widely proposed for controlling the family-wise error rate (FWER) [2], false discovery rate (FDR) [3], q-value [4], etc.
Correspondingly, many tools deriving from multiple hypothesis testing have been produced for detecting differentially expressed genes. The siggenes bioconductor package, which uses the significance analysis of microarrays (SAM) [5], provides a resampling-based multiple testing procedure involving permutations of data. Linear models for microarray data (namely, limma), which help to shrink the estimated sample variances towards an estimate based on all gene variances, provide several common options (e.g., FWER and FDR) for multiple testing [6, 7]. The multtest package provides a wide range of resampling-based methods for both FWER and FDR correction [8]. Besides, a regression framework is proposed to estimate the proportion of null hypotheses conditional on observed covariates for controlling FDR [9].
Apart from multiple hypothesis testing on individual variables, multivariate hypothesis testing which indicates whether two distributions of samples are differential or not (e.g., Hotelling’s t2-test [10]) holds a non-mainstream position, considering the need of high dimensional matrix operation. With the increasing number of multidimensional features, multiple hypothesis testing also has to be provided to multivariate hypothesis testing, which needs more computation. Therefore, testing according to classification results is assured of a common place. Using classifiers (i.e., logistic regression model, supporting vector machine and random forest, etc. [11]), genes which together help to stratify sample populations are regarded as predictive.
In fact, it has been pointed out that hypothesis testing is regarded to be explanatory, while classification-based methods are viewed to be predictive [12]. As to multiple hypothesis testing on individual variables, it may leave out the explanatory signature. It has been found out in our previous researches [13, 14] that an explanatory pair expressed differently between two patient groups may not be composed of individually explanatory variables. As to various dimensional hypothesis testing and classification-based methods, how to select features not only obeying population distribution but also improving prediction accuracy needs to be further discussed. Thus, we proposed joint covariate detection for differential expression analysis on tumor expression profiles [13]. Three improvements have been made. First of all, we made a bottom-up enumeration of features in different dimensions of gene tuples. Secondly, various dimensional hypothesis testing was combined with classification-based method. Thirdly, a resampling procedure involving permutations of data, which was derived from A5 formulation [15], was constructed. Besides, a combined projection using cancer and adjacent normal tissues was made other than treating them separately [16–19], in order to make a better discriminative performance.
In this paper, we propose a joint covariate detection software for differential expression analysis on tumor expression profiles (i.e., abbreviated to JCD-DEA). In addition, we make three more improvements. Firstly, a model selection method based on Gaussian mixture model (GMM) [20] is introduced in, due to the need of automatic selection of features. Secondly, we present a projection heatmap other than traditional expression heatmap, which directly indicates the effectiveness of JCD-DEA. Thirdly, it is further discussed whether the adjacent normal tissues really work or not.
Method
Our JCD-DEA is concisely expressed, as illustrated in Fig. 1. At step A1, combined projection which corresponds to a linear projection (e.g., Fisher’s linear discriminate analysis [11]) of cancer and adjacent normal tissues on each gene is manually selected or not. Once combined projection is selected, two expression profiles which correspond to cancer and adjacent normal tissues respectively are merged into one projection profiles with two kinds of classification labels (e.g., metastasis or not). Dimension reduction projection refers to a linear projection across genes for enumeration of features in different dimensions bigger than one.
At step A2, values of expressions or projections with two kinds of classification labels are resampled at 90% in each dimension. Welch’s t-test is used on the one dimensional values of two categories for hypothesis testing. Permutations of data are alternatively utilized for overcoming the limitation of sample size. In addition, a classifier is trained using resampled 70% specimens and tested using the left 30% samples. An average classification error rate is calculated after certain rounds of resampling. More details about step A1 and step A2 can be seen in [13].
At step A3, hypothesis testing results are combined with those of classification-based testing. Unlike the voting strategy applied in [13], a GMM-based model selection method [20] for automatic feature selection is introduced in. The numbers of Gaussian mixtures for both p-values derived from hypothesis testing and average classification error rates are confirmed respectively. An intersection of features derived from the two minimum-mean-value Gaussian components respectively for hypothesis testing and classification-based testing is obtained and voted with one score for bonus point, as labeled with symbol \(\bigcap \) in Fig. 1. As shown in the flow chart of Fig. 1, step A2 and step A3 are repeated for score accumulation in order to ensure the reliability of the selected candidates.
Based on proposed bottom-up enumeration strategy on features with different dimensions, the above procedure is repeated beneath the upper bound of computing capacity. Tuples with different dimensions are voted and accumulated. GMM-based model selection [20] is again used for selection of features in each dimension. The Gaussian component with the minimum-mean-value for accumulation scores is chosen corresponding to candidates. If there is only one Gaussian component in a certain dimension, no candidates in this dimension are to be selected. Considering the discrimination power, candidates are to be chosen with dimensions as high as possible, as labeled with symbol \(\bigwedge \) in Fig. 1.
At step A5, we present a projection heatmap other than traditional expression heatmap for further decision. Projection values are derived from the expression values of selected candidates using the same projection method at previous steps. In fact, the thought of using a projection heatmap derives from the procedure of accumulations on classification results. Following the treatment of using projections at step A1 and step A2, it is obvious to use projection values for clustering other than to use simple expression values. The performance of candidates with different dimensions is evaluated by their projection heatmaps. According to Occam’s razor criteria [11], a candidate in a lower dimension while with a good clustering result on its projection heatmap is preferred.
Implementation
JCD-DEA is written mainly in MATLAB, distributed under GNU GPLv3. Variables which are either individually differential or jointly significant for distinguishing between groups of samples are identified. Due to the lack of adjacent normal tissues in some cancer diseases (e.g., brain cancer), Fisher’s linear discriminative analysis (LDA) other than corresponding bilinear projection [21] is also considered.
Due to the existence of repeating steps in JCD-DEA, we make a two-step implementation: a client part in Client.zip for analyzing expression profiles on personal computers or workstations, and a server part in Server.zip which is designed to run on cluster servers that using Portable Batch System(PBS) as scheduling program.
Step A1, step A2 and step A3 correspond to a MATLAB m-file S1_feature_selection.m for selection of feature(s) associated with differential expression analysis, as shown in Fig. 2. Parameters for assignment of feature dimension, times of permutation, rounds of iterations for step A2 and step A3, the threshold of prior probability for GMM-based automatic model selection for feature selection and other running environments are set. A display is also made after parameter setting, as shown in Fig. 3.
Step A4 and step A5 correspond to a MATLAB m-file S2_plot_heatmap.m for selection of feature(s) with high accumulation score(s), as shown in Fig. 4. Candidates derived from step A3 are further selected using GMM-based automatic model selection on their accumulation scores. In addition, a projection heatmap is made for indicating the hierarchical clustering result of each selected feature.
Detailed software documentation and tutorial are presented on http://bio-nefu.com/resource/jcd-dea.
Results
Results of the simulated data
In order to exhibit the effectiveness of JCD-DEA, we made a simulated data containing 500 samples equally divided into two categories in a 40 dimensional space. 34 variables of them are independently and identically distributed, each of which keeps a random mean value ranging from 10 to 30 and a same standard deviation 0.01. The left three variable pairs have jointly but not individually significant distributions respectively, subjecting to the following guidelines.
As illustrated in Fig. 5a, the variable pair miRNA-alternative 1 and miRNA-alternative 2 has a good sample distribution form and also a clear category distinction. The mean vectors corresponding to the two categories of samples are (1,1)T and (1.11,0.89)T. The two categories of samples keep a same covariance matrix, which is expressed as \(\left ({\begin {array}{cc} 1&{0.999}\\ {0.999}&1 \end {array}} \right)\).
As to variable pair miRNA-alternative 3 and miRNA-alternative 4, it ought to keep a good sample distribution form but an inferior category distinction. In order to achieve the above objectives, one fifth of samples are randomly and evenly selected and exchanged between the two categories, of which the mean vectors and the covariance matrix keep the same as the former pair before sample exchange, as plotted in Fig. 5b.
Scattered as Fig. 5c, variable pair miRNA-alternative 5 and miRNA-alternative 6 appears an inferior sample distribution form but a superior category distinction. Logically speaking, this might be caused by a very small amount of singular points that significantly different from others with the same label. We’ve found this situation in the expression values of miRNA hsa-mir-450 from data set GSE22058 and make the following surmises for the existence of such points.
-
It is just a special case among the expression values of a particular feature, and the corresponding sample should be removed in statistical view.
-
This is caused by an unbalanced sampling, which means that there might be undiscovered samples between the singular points and others (see Fig. 5d).
In order to achieve the above objectives, five samples of each category are resampled as singular points with their mean vectors (2,0)T and (0,2)T and the corresponding covariance matrix \(\left ({\begin {array}{cc} 0&{0}\\ {0}&0 \end {array}} \right)\).
Figure 5e shows a scatter plot of miRNA-alternative 1 and miRNA-alternative 5, which illustrates a noncorrelation across different variable pairs.
In fact, we made such a simulated data in order to verify the following three facts.
-
Significant feature may not be composed of individual variables expressed differentially between two patient groups.
-
Significant feature ought to keep not only a good sample distribution form but also a clear category distinction.
-
Projection heatmap corresponding to the classifier selected before may present a better clustering result other than traditional expression heatmap.
Fisher’s LDA was utilized for combined projection and dimension reduction projection at step A1 and the classifier at step A2. Besides, 100 rounds of resampling were performed at step A2 and step A3, with GMM priori probability for eliminating redundant Gaussian components set to 0.001. Correspondingly, GMM priori probability used at step A4 was set to 0.001.
A5 scores (i.e., accumulation scores) together with the p-values of Welch’s t-test and the average classification error rate derived from 100 rounds of Fisher’s LDA trained on 70% randomly selected samples and tested on 30% rest samples were calculated. The corresponding pairwise and individual results on simulation data are listed in Tables 1 and 2.
In Table 1, it is found that neither A5 scores nor the average classification error rates of individual miRNAs show significance. Several p-values (e.g., miRNA-null 8 and miRNA-null 22) exhibit false positives. Besides, variable importance of each miRNA is calculated using random forest [22] as listed in Table 1, which also shows no significance.
In Table 2, it is found that the variable pair miRNA-alternative 1 and miRNA-alternative 2 which keeps a statistically good distribution and also a clear category distinction, has the highest A5 score, the minimal p-value and the smallest average of classification error rate. As to the variable pair miRNA-alternative 3 and miRNA-alternative 4 which keeps a statistically good distribution but an inferior category distinction, a smaller p-value and a bigger average of classification error rate are listed. As to the variable pair miRNA-alternative 5 and miRNA-alternative 6 which has a statistically inferior distribution but a superior category distinction, it keeps a bigger p-value and a smaller average of classification error rate. As the result indicates, only the variable pair miRNA-alternative 1 and miRNA-alternative 2 has been selected by JCD-DEA, which shows the effectiveness of our method.
In addition, we made projection heatmaps (i.e., clustering on projection values instead of directly on original expression values) as plotted in Figs. 6a, 7a and 8a with the corresponding traditional heatmaps plotted in Figs. 6b, 7b, 8b. In each sub-figure, the up bar, the middle part and the bottom strip refer to the projection values, the expression values and the classification labels, respectively. Slices of the bottom strip colored in red and black in Fig. 6a are clearly separated, compared with Figs. 7a and 8a. Besides, comparisons within each figure show the effectiveness of using a projection heatmap.
Results of GSE6857
We also performed experiments on GSE6857 which is a public dataset containing 29 samples associated with metastasis cases and 102 samples corresponded to liver cancer without metastasis using linear and bilinear projection. Limited by computing capacity, we have only enumerated features in 2-dimensional space.
Results with GMM priori probability set to 5e-5 are listed in Table 3. Furthermore, only the pair hsa-mir-29b-1No1 and hsa-mir-338No1 has been selected with GMM priori probability set to 1e-5.
However, the result is not very ideal. As shown in Fig. 9a, though the red slices of the bottom strip tend to cluster in the right, there are misclassifications. In fact, when diagnosing whether there is metastasis, patients have been diseased. Thus, expressions of normal tissues might not be meaningful anymore.
On account of this, we made new hierarchical clusterings using linear projection on tumor and normal tissues instead of bilinear projection based on the pair selected above respectively. We found that the result on tumor is better than normal tissues, as shown in Fig. 9b and c. The other two pairs pointed in [13] also have the same situation.
Thus, we performed new experiments using only linear projection on tumor data with GMM priori probability set to 5e-5. Results are presented in Table 4. And only miRNA pair hsa-mir-194-2No1 and hsa-mir-346No1 is selected with GMM priori probability set to 1e-5. Compared to Figs. 9a, 10 indicates that linear projection on tumor tissues have a better clustering result than bilinear projection.
As illustrated in Fig. 10, the clustering result using projection values of the selected 2-dimension feature can achieve a demonstration effect comparable to the heatmap using expression values on dozen of variables (see Fig. 3 in [13]).
Though improvements have been made in Fig. 10, misclassification still exists, possibly due to the inadequate 2-dimension enumeration limited by our computing capacity.
Conclusions
JCD-DEA is a bottom-up enumeration tool for seeking not only explanatory but also predictive variables associated with the categories of patients on tumor expression profiles. Other than prevailing differential expression analysis, we concern various dimensional features expressed differentially on tumor expression profiles. In order to strengthen the reliability of selected candidates, both distribution-based and classification-based testing are considered. In addition, we introduce GMM-based model selection for automatic feature selection, which helps to choose features objectively. Finally, a projection heatmap is proposed for hierarchical clustering. On account of the potential possibilities on complicated distributions of samples, we plan to develop new top-down feature selection methods in the near future.
Availability and requirements
Project name: JCD-DEA
Project home page: http://bio-nefu.com/resource/jcd-dea
Operating system(s): Linux, Windows
Programming language: Matlab (≥R2012b), Python (≥ 3.0)
License: GPLv3
Any restrictions to use by non-academics: none
Abbreviations
- FDR:
-
False discovery rate
- FWER:
-
Family-wise error rate
- GMM:
-
Gaussian mixture model
- JCD-DEA:
-
Joint covariate detection for differential expression analysis
- LDA:
-
Fisher’s linear discriminative analysis
- PBS:
-
Portable Batch System
- SAM:
-
Significance analysis of microarrays
References
Russell S, Meadows LA, Russell RR. Microarray Technology in Practice, 1st ed. San Diego: Academic Press; 2009.
Hochberg Y, Tamhane AC. Multiple comparison procedures. Danvers: John Wiley; 1985.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995; 57:289–300.
Storey JD, Tibshirani R. Statistical significance for genome-wide experiments. Proc Natl Acad Sci. 2003; 100:9440–5.
Storey JD, Tibshirani R. SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays In: Parmigiani G, Garrett ES, Irizarry R, Zeger SL, editors. The Analysis of Gene Expression Data. New York: Springer: 2003. p. 272–90.
Phipson B, Lee S, Majewski IJ, Alexander WS, Smyth GK. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Ann Appl Stat. 2016; 10:946–63.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43:e47.
Pollard KS, Dudoit S, van der Laan MJ. Multiple testing procedures: the multiset package and application to genomics In: Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S, editors. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer: 2005. p. 249–271.
Boca SM, Leek JT. A direct approach to estimating false discovery rates conditional on covariates. PeerJ. 2018; 6:e6035.
Kanji GK. 100 statistical tests, 3rd ed. Thousand Oaks: SAGE Publication; 2006.
Bishop CM. Pattern recognition and machine learning. New York: Springer; 2006.
Shmueli G. To Explain or to Predict?Stat Sci. 2010; 25:289–311.
Zhao XD, Wang L, Chen GS. Joint covariate detection on expression profles for identifying microRNAs related to venous metastasis in hepatocellular carcinoma. Sci Rep. 2017; 7:1–11.
Sun CQ, Zhao XD. Joint covariate detection on expression profiles forselecting prognostic mirnas in glioblastoma. Biomed Res Int. 2017; 2:1–10.
Xu L. In: Sun CY, Fang F, Zhou ZH, Yang WK, Liu ZY, (eds).Integrative hypothesis test and a5 Formulation: sample pairing delta, case control study, and boundary based statistics. Berlin, Heidelberg: Springer; 2013, pp. 887–902.
Schmitz R, Wright GW, Huang DW, Johnson CA, Phelan JD, Wang JQ, Roulland S, Kasbekar M, Young RM, Shaffer AL, Hodson DJ, Xiao WM, Yu X, Yang YD, Zhao H, Xu WH, Liu XL, Zhou B, Du W, Chan WC, Jaffe ES, Gascoyne RD, Connors JM, Campo E, Lopez-Guillermo A, Rosenwald A, Ott G, Delabie J, Rimsza LM, Tay KWK, Zelenetz AD, Leonard JP, Bartlett NL, Tran B, Shetty J, Zhao YM, Soppet DR, Pittaluga S, Wilson WH, Staudt LM. Genetics and pathogenesis of diffuse large B-cell lymphoma. N Engl J Med. 2018; 378:1396–407.
Wong CM, Wong CCL, Lee JMF, Fan DNY, Au SLK, Ng IOL. Sequential alterations of microrna expression in hepatocellular carcinoma development and venous metastasis. Hepatology. 2012; 55:1453–61.
Budhu A, Jia HL, Forgues M, Liu CG, Goldstein D, Lam A, Zanetti KA, Ye QH, Qin LX, Croce CM, Tang ZY, Wang XW. Identification of metastasis-related microRNAs in hepatocellular carcinoma. Hepatology. 2008; 47:897–907.
Ye QH, Qin LX, Forgues M, He P, Kim JW, Peng AC, Simon R, Li Y, Robles AI, Chen YD, Ma ZC, Wu ZQ, Ye SL, Liu YK, Tang ZY, Wang XW. Predicting hepatitis B virus–positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine learning. Nat Med. 2003; 9:416–23.
Huang T, Peng H, Zhang K. Model selection for Gaussian mixture models. Stat Sin. 2017; 27:147–69.
Xu L. Bi-linear matrix-variate analyses, integrative hypothesis tests, and case-control studies. Appl Inform. 2015; 2:1–17.
Breiman L. Random forests. Mach Learn. 2001; 45:5–32.
Acknowledgements
Not applicable.
Funding
This work has been supported by the financial support of Fundamental Research Funds for the Central Universities (No. 2572018BH01), National Undergraduate Innovation Project (No. 201710225127) and Specialized Personnel Start-up Grant (Also National Construction Plan of World-class Universities and First-class Disciplines, No. 41113237). The funding body of Fundamental Research Funds for the Central Universities played an important role in the design of the study, collection, analysis and interpretation of data and in writing the manuscript.
Availability of data and materials
The public dataset analysed during the current study is available in the GEO repository. GSE6857 is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6857.
The simulated data can be downloaded on http://bio-nefu.com/resource/jcd-dea.
Dataset GSE22058 containing singular point is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10457.
Author information
Authors and Affiliations
Contributions
XDZ conceived the general project and supervised it. YL, YNL, YMW were the principal developers. YL developed the main the graphical user interface parts. YNL has made the supplementary experiments on simulated data. YMW wrote the client-server connection related part and the software documentation. YL and YMW built the server components. XDZ wrote the underlying source code and the original manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Pairwise results on simulation data with a descending order of A5 scores. (PDF 153 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, Y., Liu, Y., Wu, Y. et al. JCD-DEA: a joint covariate detection tool for differential expression analysis on tumor expression profiles. BMC Bioinformatics 20, 365 (2019). https://doi.org/10.1186/s12859-019-2893-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-019-2893-3