 Software
 Open Access
 Published:
JCDSA: a joint covariate detection tool for survival analysis on tumor expression profiles
BMC Bioinformatics volume 19, Article number: 187 (2018)
Abstract
Background
Survival analysis on tumor expression profiles has always been a key issue for subsequent biological experimental validation. It is crucial how to select features which closely correspond to survival time. Furthermore, it is important how to select features which best discriminate between lowrisk and highrisk group of patients. Common features derived from the two aspects may provide variable candidates for prognosis of cancer.
Results
Based on the provided twostep feature selection strategy, we develop a joint covariate detection tool for survival analysis on tumor expression profiles. Significant features, which are not only consistent with survival time but also associated with the categories of patients with different survival risks, are chosen. Using the miRNA expression data (Level 3) of 548 patients with glioblastoma multiforme (GBM) as an example, miRNA candidates for prognosis of cancer are selected. The reliability of selected miRNAs using this tool is demonstrated by 100 simulations. Furthermore, It is discovered that significant covariates are not directly composed of individually significant variables.
Conclusions
Joint covariate detection provides a viewpoint for selecting variables which are not individually but jointly significant. Besides, it helps to select features which are not only consistent with survival time but also associated with prognosis risk. The software is available at http://bionefu.com/resource/jcdsa.
Background
Due to the limited effectiveness of current clinical diagnoses, expression profiles are utilized for informing variables, which are not only associated with the categories of patients with different survival risks but also consistent with survival time [1]. Commonly, Cox proportional hazards regression analysis is used to seek relevant variables considering the continuity of the patients’ survival outcomes with right censoring [2]. As to small sample data with high dimension, Cox proportional hazards regression has to be combined with methods using dimension reduction or shrinkage such as partial least squares [3] and principal component analysis [4]. However, these approaches only provide a combination of variables. Besides, treestructured survival analysis [5], random survival forests [6] and that associated with hazards regression [7] are proposed for selection of features associated with survival outcomes. Anyway, these topdown strategies provide so many variable candidates that the real features which may reveal the possible molecular cause of different survival risks are inevitably submerged.
In contrast, univariable hazards regression analyses have been placed firmly in the mainstream. Bottomup strategies with different constraints such as leastangle regression [8] and sparse kernel [9] are utilized for providing variables associated with survival time. To the best of our knowledge, we are the first to present joint covariate detection [1] that combines significant variables consistent with survival time and associated with the categories of patients. Other than individually significant variables, we concentrate on bottomup enumeration of feature tuples, each component of which is either individually significant or not. This thought is inspired by Integrative Hypothesis Testing [10], which is used for selecting features differentially expressed between different groups of patients. Unlike Integrative Hypothesis Testing, joint covariate detection is faced with continuous survival time other than labels representing different categories of patients.
In this paper, we further divide the provided feature selection into two steps, i.e., selection of variables associated with survival outcomes and further feature selection for discrimination between patients with different survival risks. In addition, we develop a joint covariate detection tool for survival analysis on tumor expression profiles (i.e. JCDSA), which helps to conveniently select significant features either on a cluster or a workstation, even on a personal computer. Matlab R2012b and Python 3 are utilized as the development platform. miRNA expression data (Level 3) of 548 patients with GBM downloaded from TCGA (http://cancergenome.nih.gov) and the simulated data are considered to be the examples. Compared with the prevailing method named as random survival forests (i.e. RSF), JCDSA shows better experimental results, which demonstrates the effectiveness of our method.
Implementation
In order to elucidate joint covariate detection in brief, a schematic diagram is illustrated in Fig. 1 (Notations: x _{(i)} and β denote the expression levels of sample i and the regression coefficients of the detected variables, respectively. The summation in the denominator is over all subjects in the risk set at ordered survival time t_{(i)}, denoted by R(t_{(i)}). \(z_{k}^{0}\) denotes a null statistics by a random rearrangement of survival outcomes. The estimator of the expected number of deaths in highrisk group is denoted by \(\hat {e}_{1i}\), expressed as \(\hat {e}_{1i}\) = \(\frac {n_{1i}d_{i}}{n_{i}}\), where n _{ i } and d _{ i } represent the number at risk and of deaths at the observation of ordered survival time t_{(i)}, n_{1i} denotes the number at risk in highrisk group. The estimator of the variance of d_{1i} on the hypergeometric distribution is defined as \(\hat {v}_{1i}\) = \(\frac {n_{1i}n_{0i}d_{i}(n_{i}  d_{i})}{n_{i}^{2}(n_{i}  1)}\), where n _{0i} denotes the number at risk in lowrisk group. Q\(_{r}^{0}\) denotes a null statistics by a random rearrangement of survival outcomes). Input data is considered as expression profiles with survival time and censoring states of patients. Output data refers to selected features. Joint covariate detection corresponds to twostep feature selection, i.e., selection of features associated with survival outcomes and selection of features for discriminating between two risk groups.
Features associated with survival outcomes
We first consider to select features associated with survival time. A bottomup enumeration on ktuple with k variables is made. As to each ktuple, Cox proportional hazards regression analysis [2] is introduced. By making the maximum partial likelihood estimation on the partial likelihood function, we obtain k estimated regression coefficients on which Wald statistics are made. Furthermore, a permutation test is made on each Wald statistic. The ktuple with each component corresponding to a significant p value is regarded as a candidate feature associated with survival outcomes. More details can be seen in [1].
Features for discriminating between two risk groups
We then intend to select features for discriminating between lowrisk and highrisk group of patients, which conforms to doctors’ daily decision making process. As to each patient, a risk score which is the linear portion of the expression values using the Cox regression coefficients is calculated. A preassigned risk score is utilized as a cutoff value for stratification between highrisk and lowrisk group of patients. Logrank test is made. Furthermore, a permutation test is presented on each tuple, which has been selected to be associated with survival outcomes. The ktuple with a significant p value is regarded as a candidate feature for discriminating between two risk groups. More details can be also seen in [1].
Brief overview of the software
Our software, which is implemented in Matlab R2012b or other later versions, can work on different computational platforms (e.g., a cluster, a workstation, even a personal computer). Therefore, it contains two parts, i.e., client and server. Selection of features associated with survival outcomes is accomplished by two Matlab mfiles (i.e., ’/Client/S1_feature_selection.m’ and ’/Server/S1_feature_selection_on_server.m’). A further selection of features for stratification of patients is fulfilled by a Matlab mfile ’Client/S2_plot_draw.m’. If this program is implemented on a workstation or a personal computer, only the client part is needed. That is to say, users only need to concentrate on two GUIs (i.e., ’/Client/S1_feature_selection.m’ and ’Client/S2_plot_draw.m’) on the client part. Otherwise, the server part is also in demand. Data communications and environment configurations are actualized using Python 3. More details can be seen in the user’s guide on the website: http://bionefu.com/resource/jcdsa.
Results
According to the presented twostep feature selection strategy, we first consider selecting features associated with survival outcomes. Figure 2 illustrates this step. Cancer type can be selected or input by clicking the right side arrow if it is not supported in the type list. Other selections in the setting frame can be also made, details of which are listed in user’s guide. Before running at full speed, JCDSA estimates the finishing seconds which helps to make a further decision. After its completion, the result which records p value(s) of each ktuple is stored in ’/Client/Data/S1’. Figure 3 further illustrates the step of selecting features associated with survival outcomes (i.e., Step 2.1). By setting the threshold of the p value corresponding to permutation test on Wald statistic, features associated with survival outcomes are selected.
Using the miRNA expression data (Level 3) of 548 patients with GBM as an example, individually significant miRNAs and significant miRNAs in pairs are listed in Tables 1 and 2, respectively. After making careful comparisons between Tables 1 and 2, we conclude that significant features in high dimension may not be composed of individually significant miRNAs. Taking the significant pair miR10b and miR222 as an example, miR10b is not listed in Table 1, which shows that it is not individually significant. This phenomenon reveals the advantage of using joint covariate detection.
Together, Figs. 3, 4 and 5 illustrate the feature selection step for discriminating between two risk groups. In Fig. 3, after choosing the files that represent the original data and the result corresponding to significant features associated with survival time at Step 2.2, the software runs to Step 2.3 and Step 2.4.
As shown in Fig. 4, KaplanMeier analysis with parameters derived from logrank test and Harrell’s concordance index is made for further selection of features, which helps to discriminate between highrisk and lowrisk group of patients. Meanwhile, the result of risk score analysis is illustrated in Fig. 5. Correspondingly, results which refer to significant features are stored in ’Client/Data/S2/S2_3’ and ’Client/Data/S2/S2_4’, respectively.
In order to show the effectiveness of our method, we implemented the prevailing method named as random survival forests (i.e. RSF) on the miRNA expression data (Level 3) of 548 patients with GBM for comparison. 1000 binary survival trees were made, with each terminal node containing a minimum of d_{0}=10 unique deaths. We made 1000 permutations on each variable, and obtained the variable importance (VIMP) for each variable. The result is listed in Table 3.
After making careful comparisons between Tables 2 and 3, we find that miR10b is still unimportant, as it is not listed in Table 3. This phenomenon reveals the advantage of using joint covariate detection other than RSF. In fact, the individually significant miR222 keeps a p=0.0012 corresponding to logrank test with 10000 rounds of permutation. As to significant pair (i.e., miR222 and miR10b), it keeps a p=0.0002 which corresponds to logrank test with 10000 rounds of permutation. As to miR10b, it keeps a p=0.285, which is individually insignificant.
We simulated data under 40 independent dimensions, from which we assigned two to be significant. That is, the survival time S is defined as S=exp(−Xβ+ε), where X is the simulated gene expression matrix and β=[0.9,0.1,0.001,...,0.001]_{40} denotes the coefficient parameter. ε∼N(0,2). The sample size n is 50. The censoring states are generated, and yield 10 percent censoring for the simulated data.
The experimental results on simulated data are listed in Tables 4, 5 and 6, respectively. The significant pair closely associated with simulated survival outcomes are selected out, as shown in Table 5. In contrast, miRNAalternative 2 which is in absence in Table 4 shows insignificant (p=0.939), and illustrates less important in Table 6. These results demonstrate the effectiveness of our method. The simulated data and full tables corresponding to Tables 4, 5 and 6 can be downloaded on the website: http://bionefu.com/resource/jcdsa.
In order to show that selected variables are improbable false positive or false negative ones, we repeated the simulations above for 100 times with an enlarged sample size (n=500). The experimental results are illustrated in Fig. 6. Figure 6a denotes the p values (p<1e−3) of the significant pair through 100 times of simulation. However, miRNAalternative 2 individually shows less important, as illustrated in Fig. 6b. Comparisons between Fig. 6a and b indicate that the significant features are probably not composed of individually significant univariables. Figure 6c and d report the number of positive pairs and individuals through 100 times of simulation, respectively. No false negative results are discovered. In Fig. 6c, the maximum number of false positive pair is three, which indicates a small probability of false positive pair 0.0038 (i.e., \(3/C_{40}^{2}\)). As to Fig. 6d, the maximum number of false positive individual is also three; yet, the probability of false positive individual is 0.075 (i.e., 3/40).
Discussion
There are several states needed to be discussed. First, it is the significant multivariable other than combinations of individually significant univariables that contributes to selection of features not only consistent with survival outcomes but also associated with stratification of patients under different survival risks. This fact has been demonstrated by our experimental results in this paper. Second, components of each significant multivariable may keep a low correlation. This phenomenon has been discovered when experiments on the simulated data were made. Further evidence is still needed. Third, the correction for multiple hypothesis testing is absent, considering the computational cost of calculating FDR, q value, the adjusted p values, etc. on each pair or each highdimension tuple of variables. However, simulations are made, which demonstrate the effectiveness of our method.
Conclusion
Our joint covariate detection for survival analysis provides a new viewpoint for selecting variable candidates which are not individually but jointly significant. Following a twostep variable selection strategy, we propose a software (i.e., JCDSA) in order to help users to select features which are not only consistent with survival time but also associated with prognosis risk. JCDSA can be adapted for many categories of cancer. Users can easily operate it and conveniently obtain the experimental results for subsequent biological experimental validation.
Availability and requirements
Project name: JCDSA Project home page: http://bionefu.com/resource/jcdsaOperating system(s): Linux, Windows Programming language: Matlab (≥R2012b), Python (≥ 3.0) License: GPL (≥2) Any restrictions to use by nonacademics: none
Abbreviations
 GUI:

Graphical user interface
 GBM:

Glioblastoma multiforme
 JCDSA:

Joint covariate detection for survival analysis
 TCGA:

The Cancer Genome Atlas
References
 1
Sun CQ, Zhao XD. Joint covariate detection on expression profiles for selecting prognostic mirnas in glioblastoma. Biomed Res Int. 2017; 2:1–10.
 2
Cox DR. Regression models and life tables (with discussion). J R Stat Soc Series B. 1972; 34:187–220.
 3
Li H, Gui J. Partial cox regression analysis for highdimensional microarray gene expression data. Bioinformatics. 2004; 20:208–15.
 4
Li L, Li H. Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics. 2004; 20:3406–12.
 5
Wallace ML. Timedependent treestructured survival analysis with unbiased variable selection through permutation tests. Stat Med. 2013; 33:4790–804.
 6
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008; 2:841–60.
 7
Kawaguchi A, Yajima N, Tsuchiya N, Homma J, Sano M, Natsumeda M, Takahashi H, Fujii Y, Kakuma T, Yamanaka R. Gene expression signaturebased prognostic risk score in patients with glioblastoma. Cancer Sci. 2013; 104:1205–10.
 8
Gui J, Li HZ. Penalized cox regression analysis in the highdimensional and lowsample size settings, with applications to microarray gene expression data. Bioinformatics. 2005; 21:3001–8.
 9
Evers L, Messow CM. Sparse kernel methods for highdimensional survival data. Bioinformatics. 2008; 24:1632–8.
 10
Xu L. Bilinear matrixvariate analyses, integrative hypothesis tests, and casecontrol studies. Appl Inform. 2015; 2:1–39.
Acknowledgements
Not applicable.
Funding
This work has been supported by the financial support of Fundamental Research Funds for the Central Universities (No. 2572018BH01), National Undergraduate Innovation Project (No. 201610225050) and Specialized Personnel Startup Grant (Also National Construction Plan of Worldclass Universities and Firstclass Disciplines, No. 41113237). The funding body of Fundamental Research Funds for the Central Universities played an important role in the design of the study, collection, analysis and interpretation of data and in writing the manuscript.
Availability of data and materials
The dataset analysed during the current study is available in the TCGA repository, http://cancergenome.nih.gov. The simulated data can be downloaded on http://bionefu.com/resource/jsdca.
Author information
Affiliations
Contributions
XDZ conceived the general project and supervised it. YMW^{1}, YNL, YMW^{3} and XDZ were the principal developers. YMW^{1} has rewritten almost all the frontend codes and has majorly made the revision on the manuscript. YNL has made the supplementary experiments on new simulated data, which helps to illustrate the effectiveness of JCDSA on avoiding false positives. YS tested the software and made the improvement. XDZ wrote the underlying source code and the original manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Xudong Zhao.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Wu, Y., Liu, Y., Wang, Y. et al. JCDSA: a joint covariate detection tool for survival analysis on tumor expression profiles. BMC Bioinformatics 19, 187 (2018) doi:10.1186/s1285901822133
Received
Accepted
Published
DOI
Keywords
 Feature selection
 Expression profiles
 Survival analysis
 Prognosis
 Cancer