 Research
 Open access
 Published:
A novel approach to the analysis of Overall Survival (OS) as response with ProgressionFree Interval (PFI) as condition based on the RNAseq expression data in The Cancer Genome Atlas (TCGA)
BMC Bioinformatics volume 25, Article number: 300 (2024)
Abstract
Background
Overall Survival (OS) and ProgressionFree Interval (PFI) as survival times have been collected in The Cancer Genome Atlas (TCGA). It is of biomedical interest to consider their dependence in pathway detection and survival prediction. We intend to develop novel methods for integrating PFI as condition based on parametric survival models for identifying pathways associated with OS and predicting OS.
Results
Based on the framework of conditional probability, we developed a family of frailtybased parametricmodels for this purpose, with exponential or Weibull distribution as baseline. We also considered two classes of existing methods with PFI as a covariate. We evaluated the performance of three approaches by analyzing RNAseq expression data from TCGA for lung squamous cell carcinoma and lung adenocarcinoma (LUNG), brain lower grade glioma and glioblastoma multiforme (GBMLGG), as well as skin cutaneous melanoma (SKCM). Our focus was on fourteen general cancerrelated pathways. The 10fold crossvalidation was employed for the evaluation of predictive accuracy. For LUNG, p53 signaling and cell cycle pathways were detected by all approaches. Furthermore, three approaches with the consideration of PFI demonstrated significantly better predictive performance compared to the approaches without the consideration of PFI. For GBMLGG, ten pathways (e.g., Wnt signaling, JAKSTAT signaling, ECMreceptor interaction, etc.) were detected by all approaches. Furthermore, three approaches with the consideration of PFI demonstrated better predictive performance compared to the approaches without the consideration of PFI. For SKCM, p53 signaling pathway was detected only by our Weibullbaselinebased model. And three approaches with the consideration of PFI demonstrated significantly better predictive performance compared to the approaches without the consideration of PFI.
Conclusions
Based on our study, it is necessary to incorporate PFI into the survival analysis of OS. Furthermore, PFI is a survivaltype time, and improved results can be achieved by our conditionalprobabilitybased approach.
Background
Overall Survival (OS) and ProgressionFree Interval (PFI) as survival times have been widely employed to demonstrate the clinical impact of treatments, covariates, or biomarkers [1, 2]. In this study, Overall Survival (OS), defined as the duration from diagnosis to death from any cause, and ProgressionFree Interval (PFI), defined as the time from diagnosis to the occurrence of a new tumor event were selected as survival times. In biomedical research, the Cox proportional hazards model [3] is predominantly utilized to analyze OS or PFI separately. The principal objective of these studies is to understand the marginal effects of certain covariates on a chosen survival time and to facilitate prediction. The detection of pathways is pivotal in revealing the intricate biological processes and interactions that significantly impact patient outcomes [4, 5]. Furthermore, it is evident that there exists some dependence between OS and PFI, and their dependence may play a critical role in unraveling the mechanisms underlying disease progression [6,7,8]. Therefore, it is necessary to consider their dependence in pathway detection and survival prediction. In this study, we intend to develop novel methods for integrating PFI as condition based on parametric survival models for identifying pathways associated with OS and predicting OS.
These methods can be developed within the framework of a bivariate survival analysis, focusing on how to obtain an appropriate conditional distribution for the second survival time \(T_2\), when observing a survival time \(T_1\) (whether censored or not). Although current methodologies were available for analyzing the related bivariate survival problem, they often exhibit limitations in their applicability and flexibility [9, 10]. Day et al. [9] explored a similar problem, but they assumed that times were not sequential. In contrast, Henderson et al. [10] regarded times as sequential, yet their interest was confined to instances where the observed survival time was not censored. However, Day et al. [9] and Henderson et al. [10] did not consider the situation that observed survival time was censored. In this study, we proposed an approach to accommodate this situation.
Based on the framework of conditional probability distribution, we developed a family of novel frailtybased methods [11,12,13] for this purpose. Our frailtybased methods can accommodate the situation when the observed survival time as condition is censored and also capture certain nonlinear relationships. In our frailty models, we consider a gamma distribution for the frailty. We developed two sharedfrailtymodel based methods, with exponential or Weibull as baseline hazard function. The estimators are obtained by the maximum likelihood method. We derived the related likelihood ratio test to identify covariates associated with the future survival event when the observed survival event is considered as condition. Furthermore, we achieved the prediction of future survival event.
In the literature [10], there are two classes of existing methods related to this analysis: both with \(T_1\) as a covariate; \(T_2\) was modeled by traditional parametric regression models or Cox regression models. In the Cox regression model, \(T_1\) can be used as a timevarying/timedependent effect [14]. Both methods mirrored the approach suggested by Gail et al. [15] in their analysis of multiple tumor times.
We evaluated the performance of three approaches by analyzing RNAseq expression data from TCGA for lung squamous cell carcinoma and lung adenocarcinoma (LUNG), brain lower grade glioma and glioblastoma multiforme (GBMLGG), as well as skin cutaneous melanoma (SKCM). Our focus was on fourteen general cancerrelated pathways (Table 2). We utilized the existing singlesample Gene Set Enrichment Analysis (ssGSEA) algorithm [16], and each pathway is summarized as a covariate. The 10fold crossvalidation approach was employed for the evaluation of predictive accuracy, which was based on the timedependent Receiver Operating Characteristic (ROC) methodology [17,18,19]. In Fig. 1, we present the workflow diagram of our framework, which highlights the key steps and processes. To further illustrate our frailtybased methods, we evaluated the impact of censoring on likelihood ratio test and parameter estimation (In the Supplementary Materials). Then, we also demonstrated the advantages of frailtybased methods over traditional regression models (In the Supplementary Materials).
Methods
We first provide an overview of regression models used in survival analysis, beginning with the Cox proportional hazards model and traditional regression models, and then extending to shared gamma frailty models.
Background: Cox proportional hazards model
The Cox proportional hazards model [3] is a widely used method in survival analysis that relates the time of an event to covariates without making specific assumptions about the underlying hazard function. The model can be described by the following hazard function:
where \(\lambda (tx)\) is the hazard function, \(\lambda _0(t)\) is the baseline hazard, \(\beta\) represents the coefficients, and x represents the covariates.
Background: timevarying Cox model
The Cox model can be extended to handle timevarying/timedependent effects [20], where covariates change over time or coefficients change over time. The hazard function for a Cox model with timevarying covariates or timevarying coefficients can be written as:
where x(t) denotes the covariates that change over time, \(\beta (t)\) denotes the coefficients that change over time. This extension allows for more flexibility in modeling scenarios where the risk factors change as time progresses.
Background: baseline hazard functions and regression approach
Two baseline hazard functions are considered in the analysis: the exponential distribution and the Weibull distribution (Table 1). In practice, traditional regression model is often used to analysis ordered survival events. The probability density function of the traditional regression model is [10]
where \(\delta\), \(\lambda\), \(\beta\) and \(\gamma\) represent unknown parameter, x represents covariate, \(t_1\) and \(t_2\) represent two survival times.
Conditional models
Background: shared gamma frailty models
Shared frailty models are widely used models in survival analysis [21, 22]. In these models, we assume that the two time components are \(T_1\) and \(T_2\). In shared frailty models, we assume that \(T_1\) and \(T_2\) are conditionally independent given a shared unobservable random effect, namely the frailty Z. We consider that the frailty Z follows a gamma distribution (\(Z\sim \Gamma (\theta ,\theta ), \quad \theta >0\)). Given the gamma frailty Z with the observed covariate x, the conditional hazard function and the conditional cumulative hazard function of \(T_1\) and \(T_2\) are as follow
where \(\lambda _{10}(t)\) and \(\lambda _{20}(t)\) are baseline hazard functions, \(\Lambda _{10}(t)\) and \(\Lambda _{20}(t)\) are baseline cumulative hazard functions.
The conditional survival functions of \(T_1\) and \(T_2\) are as follow
Because \(T_1\) and \(T_2\) are conditionally independent given the gamma frailty Z, the bivariate survival function of \(T_1\)and \(T_2\) given the gamma frailty Z is
Then, the bivariate density function of \(T_1\) and \(T_2\) is (The derivation details are described in the Supplementary Materials.)
Conditional distribution
The conditional density function of \(T_2\) given \(T_1\) is (The derivation details are described in the Supplementary Materials.)
In practice, it is often to observe right censored \(T_1\) and \(T_2\). The corresponding conditional probabilities are as follow (The derivation details are described in the Supplementary Materials)
It can be seen from the above formulas that our models actually do not assume that \(T_1 \le T_2\). However, in practice, it is often to observe \(T_1 \le T_2\). Based on the above relationships, we find that the conditional probability can maintain an increasingly monotone relationship between \(T_1\) and \(T_2\) (regardless of whether \(T_1\) is censored or not), which is commonly observed in practice. For example, in the TCGA datasets, if the tumor recurrence could be delayed (PFI), then the survival probability (OS) could be higher.
Model interpretation
The hazard functions of conditional model are as follow
It can be seen from the above formulas that the predictive score of evaluating individual hazard is \(\theta e^{\beta _2x}+\Lambda _{10}(t_1)e^{(\beta _1\beta _2)x}\). We can obtain the relationship of the individual hazard and covariate x as follow

1.
\(\beta _2 \ge 0,\beta _1 \le \beta _2.\)
The conditional hazard function increases with respect to x;

2.
\(\beta _2 \ge 0,\beta _1 > \beta _2.\)
The conditional hazard function with respect to x increases at \([0,\frac{1}{\beta _1}\log \frac{\beta _2\theta }{\Lambda _{10}(t_1)(\beta _1\beta _2)}]\) and decreases at \((\frac{1}{\beta _1}\log \frac{\beta _2\theta }{\Lambda _{10}(t_1)(\beta _1\beta _2)},+\infty ]\);

3.
\(\beta _2 < 0, \beta _1 \ge \beta _2.\)
The conditional hazard function decreases with respect to x;

4.
\(\beta _2< 0,\beta _1 < \beta _2.\)
The conditional hazard function with respect to x decreases at \([0,\frac{1}{\beta _1}\log \frac{\beta _2\theta }{\Lambda _{10}(t_1)(\beta _1\beta _2)}]\) and increases at \((\frac{1}{\beta _1}\log \frac{\beta _2\theta }{\Lambda _{10}(t_1)(\beta _1\beta _2)},+\infty ]\).
Clearly, censored \(t_1\) does not affect the relationship above.
Theoretical results
We have proven the five lemmas required for our models in the Supplementary Materials. In \({\textbf {Lemma\;3}}\) and \({\textbf {Lemma\;4}}\), we generalize two baseline hazard functions to any positive function and generalize the frailty distribution to the powervariancefunction (PVF) family [11].
Likelihoodbased estimation
We assume that the sample data were \((t_{1i},t_{2i},\delta _{1i},\delta _{2i},x_i \quad i=1 \dots n)\), where \(\delta _{1i}=0\) represents censored \(t_{1i}\), \(\delta _{1i}=1\) represents uncensored \(t_{1i}\), \(\delta _{2i}=0\) represents censored \(t_{2i}\) and \(\delta _{2i}=1\) represents uncensored \(t_{2i}\).
In the exponentialbaselinebased model, the baseline hazard functions are
The likelihood function \(L(\theta ,\lambda _1,\lambda _2,\beta _1,\beta _2)\) is
In the Weibullbaselinebased model, the baseline hazard functions are
The likelihood function \(L(\theta ,\lambda _1,\lambda _2,\beta _1,\beta _2,\rho _1,\rho _2)\) is
The maximum likelihood estimators asymptotically follow normal distributions with mean being the true value of the parameters and covariance matrix being the inverse of the nfold Fisher information matrix by the \({\textbf {Lemma\;1}}\) (In the Supplementary Materials). We calculated the maximum likelihood estimates by the \({\textbf {optim}}\) function in R or the NLMIXED Procedure in SAS.
Variance and confidence interval
In Section Likelihoodbased Estimation, we have stated that the estimated variances of the maximum likelihood estimates of \(\beta _1\) and \(\beta _2\) can be calculated by the inverse of the nfold Fisher information matrix and we can calculate the estimated variances of the maximum likelihood estimates of \(\beta _1\) and \(\beta _2\) by the \({\textbf {optim}}\) function in R or the NLMIXED Procedure in SAS. The maximum likelihood estimates of \(\beta _1\) and \(\beta _2\) asymptotically follow normal distributions. Therefore, the \(95\%\) confidence interval of \(\beta _1\) and \(\beta _2\) are as follow: assume the estimated standard deviations of \(\beta _1\) and \(\beta _2\) are respectively \(\hat{s_1},\hat{s_2}\), the estimated nfold Fisher information matrix is \({\hat{H}}\) and the estimation of \(\beta _1\) and \(\beta _2\) are respectively \(\hat{\beta _1},\hat{\beta _2}\), the \(95\%\) confidence interval of \(\beta _1\) and \(\beta _2\) are respectively
Hypothesis testing
When we consider \(T_1\), the relationship between \(T_2\) and the covariate x is not only related to \(\beta _2\), it may also be related to \(\beta _1\). According to \({\textbf {Lemma\;2}}\) (In the Supplementary Materials), \(\beta _1=\beta _2=0\) is equivalent to that \(T_2\) is independent of the covariate x given \(T_1\) in three shared frailty models.
Therefore, we can consider the following hypothesis test
For the above hypothesis test, we used the likelihood ratio test.
The likelihood ratio statistics are as follow
According to \({\textbf {Lemma\;5}}\) (In the Supplementary Materials), the above likelihood ratio test statistics have asymptotical distributions, i.e. the double loglikelihood ratio test statistics asymptotically follow chisquare distribution with two degrees of freedom (\(\chi ^2(2)\)). Therefore, we test the covariate x by the likelihood ratio test. The above likelihood ratio test statistics \(\Lambda\) can be calculated using the \({\textbf {optim}}\) function in R or the NLMIXED Procedure in SAS.
Singlesample gene set enrichment analysis (ssGSEA)
We utilized the R package \({\textbf {GSVA}}\) to perform ssGSEA [16] at the individual sample level, using pathways as gene sets. The pathway information was obtained from the Molecular Signatures Database (MSigDB) [23, 24].
Kaplan–Meier (K–M) curve
We utilized the R package \({\textbf {survival}}\) to create a Kaplan–Meier (K–M) [25] curves. The K–M curves estimates and visualizes the survival probability over time.
Timedependent receiver operating characteristic (ROC)
We use the timedependent ROC methodology (a modified version of the conventional ROC methodology) to evaluate the predictive accuracy of our models [17,18,19]. The areas under the timedependent ROC curves (AUC(t)) are often used to compare the predictive accuracy of several prediction models. Since Heagerty [17, 18] proposed the timedependent ROC methodology and some definitions of cases and controls, there are many methods to estimate the timedependent ROC curve. We choose a nonparametric estimator of the timedependent ROC curve using the inverse probability of censoring weighting (IPCW) approach [19]. We plot AUC(t) of IPCW estimators by the R package \({\textbf {timeROC}}\).
Results
TCGA data
The RNA sequencing expression data from The Cancer Genome Atlas (TCGA) database were considered. Liu et al. [26] recently curated and standardized genomic and survival information for the TCGA dataset into the TCGA PanCancer Clinical Data Resource (TCGACDR). Therefore, we downloaded genomic and survival data for the TCGA datasets from UCSC Xena Browser (https://xenabrowser.net/datapages/). In this study, Overall Survival (OS), defined as the duration from diagnosis to death from any cause, and ProgressionFree Interval (PFI), defined as the time from diagnosis to the occurrence of a new tumor event were selected as endpoints because these were deemed to be relatively accurate endpoints by Liu et al. But since PFI contains information about OS, we did not consider the case where PFI equals OS. Then the analysis was conducted with the PFI as \(T_1\) and OS as \(T_2\). Time is measured in days. Among 186 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, we primarily focus on the analytical findings for a subset specifically associated with cancer. There exists a total of sixteen pathways in KEGG but fourteen of them are catalogued within the MSigDB. In this study, ssGSEA was performed on these fourteen pathways (Table 2) to calculate cancerrelated predictive scores.
In this study, we focused on cancer datasets with relatively large sample sizes. After excluding missing values and removing cases where PFI equals OS, there are three datasets that can be used for patient survival analysis. Therefore, our subsequent analyses are based on lung squamous cell carcinoma and lung adenocarcinoma (LUNG; n=306), brain lower grade glioma and glioblastoma multiforme (GBMLGG; n=276), as well as skin cutaneous melanoma (SKCM; n=268).
LUNG data
Data description
From Fig. 2, we can observe that PFI and OS are mostly concentrated in the interval [0, 4000]. It is evident from Fig. 2 that PFI and OS are highly positively correlated. Among these patients, the censoring proportion for PFI is \(0.0\%\) while the censoring proportion for OS is \(35.6\%\). An intuitive overview of gene set activity across the lung cancer samples is provided by the heatmap of the GSVA output in the LUNG dataset, shown in Figure S1 (In the Supplementary Materials).
Pathway detection
Six different models were performed to identify pathways significantly associated with OS: “\(\hbox {Cox}_1\)” represents the Cox regression model (consider corresponding pathway as only covariate), “\(\hbox {Cox}_{{time}}\)” represents the Cox regression model (consider corresponding pathway as a covariate, but PFI as a timevarying/timedependent effect, using timevarying coefficient), “\(\hbox {Regression}_1\)” represents the traditional regression model (consider corresponding pathway as only covariate), “\(\hbox {Regression}_2\)” represents the traditional regression model (consider corresponding pathway and PFI as covariates), the exponentialbaselinebased model, the Weibullbaselinebased model. Table 2 presents the analysis results of six models:

Notably, these six models all detected p53 signaling pathway and cell cycle pathway significantly positively correlated with OS.
It is wellknown that p53 signaling pathway and cell cycle pathway are closely linked to the development of lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD) [27,28,29,30].
Predictive performance
In our study, we employed a 10fold crossvalidation method to compare the predictive performance of the models. Initially, the entire dataset was divided into ten subsets of equal size. This method involved ten independent cycles of training and validation. In each cycle, a unique subset was selected as the validation set, while the remaining nine subsets were combined to form the training set. This approach ensured that each subset served as a validation set once and as part of the training set at other times. Following each training and validation cycle, the predictive score of models on the current validation set was assessed. Upon completion of all ten cycles, the predictive scores from the ten validation sets (encompassing the entire dataset) were aggregated to derive the predictive score of models for the full dataset.
We compared the predictive performance of six models. For each fold (10fold crossvalidation), based on the training data, the association pvalues were first computed; the pairwise pearson correlations were then calculated. A network adjacency matrix was constructed accordingly based on the absolute correlation value (>0.6); for each connected component in the network, only the pathway with the smallest associated pvalue was selected. (Please see Table S13–S15 in the supplementary materials for the related details, including the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model, the exponentialbaselinebased model, the Weibullbaselinebased model.) The predictive performance was evaluated using the timedependent ROC methodology. The predictive scores of the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model and the \(\hbox {Regression}_2\) model are all based on linear predictors. The predictive scores of the exponentialbaselinebased model and the Weibullbaselinebased model are based on \(\theta e^{\varvec{\beta ^T_2x}}+\Lambda _{10}(t_1)e^{\varvec{(\beta _1\beta _2)^Tx}}\) (see details in Section Model Interpretation).
In Fig. 3 and Figure S5 (In the Supplementary Materials), we can observe that our Weibullbaselinebased model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate significantly better predictive performance in comparison to the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model in [0, 3000]. And we also can observe that the exponentialbaselinebased model, the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model demonstrate poor predictive capabilities, with the AUC values being approximately 0.5.
GBMLGG data
Data description
From Fig. 2, we can similarly observe that PFI and OS are mostly concentrated in the interval [0, 4000]. It is evident from Fig. 2 that PFI and OS are highly positively correlated. Among these patients, the censoring proportion for PFI is \(0.0\%\) while the censoring proportion for OS is \(34.1\%\). An intuitive overview of gene set activity across the glioma cancer samples is provided by the heatmap of the GSVA output in the GBMLGG dataset, shown in Figure S2 (In the Supplementary Materials).
Pathway detection
Similar six different models were performed to identify pathways significantly associated with OS. Table 3 presents the analysis results of six models:

The TGFbeta signaling pathway exhibited a significantly negative correlation with OS in the \(\hbox {Cox}_1\) and the \(\hbox {Regression}_1\) models. In contrast, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model detected that TGFbeta signaling pathway is not significant correlated with OS. Similarly, the Weibull model detected that TGFbeta signaling pathway is not significant correlated with OS.

The \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model both detected MAPK signaling pathway significantly negatively correlated with OS. Similarly, the Weibull model detected that MAPK signaling pathway is significantly negatively correlated with OS. But the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) models detected that MAPK signaling pathway is not significant correlated with OS.

The \(\hbox {Regression}_1\) model detected PPAR signaling pathway significantly positively correlated with OS. The Weibull model detected that PPAR signaling pathway is not significant correlated with OS.

Notably, the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model and the Weibull model all detected that eight pathways (JAKSTAT signaling, ECMreceptor interaction, cytokinecytokine receptor interaction, focal adhesion, VEGF signaling, apoptosis, p53 signaling, cell cycle) are significantly positively correlated with OS, two pathways (Wnt signaling, mTOR signaling) are significantly negatively correlated with OS.
For brain lower grade glioma (LGG) and glioblastoma multiforme (GBM), many pathways have been extensively studied. For instance, it has been studied that Wnt signaling pathway, JAKSTAT signaling pathway, ECMreceptor interaction pathway, cytokinecytokine receptor interaction pathway and focal adhesion pathway are related to the growth and survival of tumors [31,32,33,34,35].
Predictive performance
Similarly, we employed a 10fold crossvalidation method to compare the predictive performance of the models. We compared the predictive performance of six models. For each fold (10fold crossvalidation), based on the training data, the association pvalues were first computed; the pairwise pearson correlations were then calculated. A network adjacency matrix was constructed accordingly based on the absolute correlation value (>0.6); for each connected component in the network, only the pathway with the smallest associated pvalue was selected. (Please see Table S16–S18 in the supplementary materials for the related details, including the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model, the exponentialbaselinebased model, the Weibullbaselinebased model.) The predictive performance was evaluated using the timedependent ROC methodology. The predictive scores of the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model and the \(\hbox {Regression}_2\) model are all based on linear predictors. The predictive scores of the exponentialbaselinebased model and the Weibullbaselinebased model are based on \(\theta e^{\varvec{\beta ^T_2x}}+\Lambda _{10}(t_1)e^{\varvec{(\beta _1\beta _2)^Tx}}\) (see details in Section Model Interpretation).
In Fig. 4 and Figure S6 (In the Supplementary Materials), we can observe that our Weibullbaselinebased model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate better predictive performance in comparison to the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model. And we can observe that the exponentialbaselinebased model demonstrate poor predictive capabilities, with the AUC values being approximately 0.5.
SKCM data
Data description
From Fig. 5, we can similarly observe that PFI and OS are mostly concentrated in the interval [0, 7000]. It is evident from Fig. 5 that PFI and OS are highly positively correlated. Among these patients, the censoring proportion for PFI is \(0.0\%\) while the censoring proportion for OS is \(39.6\%\). An intuitive overview of gene set activity across the skin cancer samples is provided by the heatmap of the GSVA output in the SKCM dataset, shown in Figure S3 (In the Supplementary Materials).
Pathway detection
Similar six different models were performed to identify pathways significantly associated with OS. Table 4 presents the analysis results of six models:

The Wnt signaling pathway and mTOR signaling pathway exhibited a significantly positive correlation with OS in the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model. Similarly, the Weibull model detected that Wnt signaling pathway and mTOR signaling pathway are significantly positive correlated with OS. In contrast, the \(\hbox {Cox}_1\) and the \(\hbox {Regression}_1\) models detected that Wnt signaling pathway and mTOR signaling pathway are not significantly correlated with OS.

The Weibull model detected p53 signaling pathway significantly positively correlated with OS, but this correlation was not detected by other models.

Notably, the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model and the Weibull model all detected that two pathways (adherens junction, cell cycle) are significantly positively correlated with OS, three pathways (JAKSTAT signaling, cytokinecytokine receptor interaction, apoptosis) are significantly negatively correlated with OS.
For skin cutaneous melanoma (SKCM), many pathways have been extensively studied. For instance, it has been shown that JAKSTAT signaling pathway is related to the disease development, as well as cell cycle pathway, p53 signaling pathway and apoptosis pathway [36]; it also wellknown that mTOR signaling pathway plays an important role in tumor progression [37].
Predictive performance
Similarly, we employed a 10fold crossvalidation method to compare the predictive performance of the models. We compared the predictive performance of six models. For each fold (10fold crossvalidation), based on the training data, the association pvalues were first computed; the pairwise pearson correlations were then calculated. A network adjacency matrix was constructed accordingly based on the absolute correlation value (>0.6); for each connected component in the network, only the pathway with the smallest associated pvalue was selected. (Please see Table S19–S21 in the supplementary materials for the related details, including the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model, the exponentialbaselinebased model, the Weibullbaselinebased model.) The predictive performance was evaluated using the timedependent ROC methodology. The predictive scores of the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model and the \(\hbox {Regression}_2\) model are all based on linear predictors. The predictive scores of the exponentialbaselinebased model and the Weibullbaselinebased model are based on \(\theta e^{\varvec{\beta ^T_2x}}+\Lambda _{10}(t_1)e^{\varvec{(\beta _1\beta _2)^Tx}}\) (see details in Section Model Interpretation).
In Fig. 6 and Figure S7 (In the Supplementary Materials), we can observe that our Weibullbaselinebased model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate significantly better predictive performance in comparison to the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model in [0, 4500]. And we also can observe that the exponentialbaselinebased model, the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model demonstrate poor predictive capabilities, with the AUC values being approximately 0.5.
Discussion and conclusions
We presented some comparisons of several models using TCGA datasets for LUNG, GBMLGG and SKCM. For LUNG, p53 signaling and cell cycle pathways were detected significantly positively correlated with OS by all approaches. Our Weibullbaselinebased model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate significantly better predictive performance in comparison to the approaches without the consideration of PFI in [0, 3000]. For GBMLGG, eight pathways (JAKSTAT signaling, ECMreceptor interaction, cytokinecytokine receptor interaction, focal adhesion, VEGF signaling, apoptosis, p53 signaling, cell cycle) were detected significantly positively correlated with OS by all approaches, two pathways (Wnt signaling, mTOR signaling) were detected significantly negatively correlated with OS by all approaches. Our Weibullbaselinebased model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate better predictive performance in comparison to the approaches without the consideration of PFI. For SKCM, p53 signaling pathway was detected significantly positively correlated with OS by our Weibullbaselinebased model but not by the other approaches. Our Weibullbaselinebased model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate significantly better predictive performance in comparison to the approaches without the consideration of PFI in [0, 4500]. In the Supplementary Materials, we present additional analysis based on a cohort of breast invasive carcinoma (BRCA) samples (n=111). For BRCA, MAPK signaling pathway was detected significantly negatively correlated with OS by our Weibullbaselinebased model but not by the other approaches. Our Weibullbaselinebased model demonstrate significantly better predictive performance in comparison to the approaches without the consideration of PFI in [0, 3000]. Based on our study, it is necessary to incorporate PFI into a survival analysis so that additional pathways associated with OS may be discovered, and the predictive performance for OS can be improved. PFI is a survivaltype time. Based on the conditional probability framework, further improved results can be obtained by our proposed approaches. We also performed Cox regression with timevarying/timedependent covariates for prediction performance evaluation [38], and our conclusion still remains, please see Figs. S46–S49 (In the Supplementary Materials) for more details. In practice, it may be generally difficult to obtain PFI in advance. In some instances, the PFI may be observed in advance [39, 40]. If PFI could be observed in advance, it would enhance the prediction of OS. This would encourage us to collect information about PFI as much as possible. Due to the correlations between PFI and OS, the prediction performance on OS can be significantly enhanced. In the analysis of bivariate correlated survival data, it is common to construct frailty models. This approach helps account for the correlation and unobserved heterogeneity between bivariate survival times.
Due to the complexity of frailty models, parameter estimation through maximum likelihood optimization may be sensitive to the choice of initial values. Therefore, we recommend exploring multiple initial values during model application to ensure appropriate parameter estimation. The baseline hazard function for Weibullbaseline based data exhibits a strictly increasing trend. The exponentialbaselinebased model and the piecewiseexponentialbaselinebased model (In the Supplementary Materials) are less likely to capture this trend. So in the Weibullbaseline based data, we observed that the likelihood ratio test distribution does not asymptotically follow \(\chi ^2(2)\) for the exponentialbaselinebased model and piecewiseexponentialbaselinebased model (In the Supplementary Materials). And the corresponding parameter estimation and hypothesis testing are abnormal. Furthermore, when identifying associated covariates, excessive zero values for covariate may result in the failure of the likelihood ratio test. In this case, appropriate transformations to “x” (covariate) can solve this issue. In our study, we use bulk RNAseq data. When analyzing bulk RNAseq data, mixed cell types can lead to confounding results. Deconvolution can separate the gene expression levels of each cell type, thereby avoiding the confusion caused by cell mixture. We obtained a ranked list of genes through the deconvolution method (NMF), and then used Gene Set Enrichment Analysis (gseapy module in Python) to assess the significance of fourteen cancerrelated pathways. The specific results are shown in Table S37–S40.
For the piecewiseexponentialbaselinebased model (In the Supplementary Materials), there are several methods to divide the interval, such as quantiles or empirical hazard plots [41]. In our simulation studies (In the Supplementary Materials), the intervals of the piecewiseexponentialbaselinebased model match the optimal intervals (consistent with the data generation process). In this case, its performance is also not clearly higher to the Weibullbaselinebased model. We primarily focused on the frailty following a gamma distribution. Other distributions like lognormal or inverse Gaussian can also be considered [42], but the computing will become complicated. Our models primarily utilize shared frailty models. However, it is worth noting that they can also be expanded to the multivariable frailty models, as demonstrated in prior works [43,44,45,46]. Additionally, another generalization of shared frailty models can be achieved through the incorporation of copula models, as explored in previous studies [47,48,49]. When confronted with high censoring rates, our models can be optimized by using the multiple imputation method, as indicated in previous studies [50, 51]. Moreover, our models exhibit the versatility to accommodate not only random right censoring but also random left censoring or random interval censoring, as explored in the literature [52,53,54,55]. To enhance our model adaptability, the baseline hazard function can be approximated with spline functions [8]. The spline functions provide a more flexible approach to capture the shape of the hazard function.
Availability of data and materials
For each type of tumor, we downloaded TCGA genomic data and survival data based on the following dataset ID from UCSC Xena Browser (https://xenabrowser.net/datapages/).
\(\bullet\) Lung Cancer: TCGA.LUNG.sampleMap/HiSeqV2\(\_\)PANCAN and survival/LUNG\(\_\)survival.txt; \(\bullet\) Lower grade glioma and glioblastoma: TCGA.GBMLGG.sampleMap/HiSeqV2\(\_\)PANCAN and survival/GBMLGG\(\_\)survival.txt; \(\bullet\) Melanoma: TCGA.SKCM.sampleMap/HiSeqV2\(\_\)PANCAN and survival/SKCM\(\_\)survival.txt; \(\bullet\) Breast Cancer: TCGA.BRCA.sampleMap/HiSeqV2\(\_\)PANCAN and survival/BRCA\(\_\)survival.txt. The code used in this study is available at https://github.com/1LinBo/Survival/tree/main.
Abbreviations
 OS:

Overall Survival
 PFI:

ProgressionFree Interval
 TCGA:

The Cancer Genome Atlas
 LUNG:

Lung squamous cell carcinoma and lung adenocarcinoma
 GBMLGG:

Brain lower grade glioma and glioblastoma multiforme
 SKCM:

Skin cutaneous melanoma
 ssGSEA:

Singlesample Gene Set Enrichment Analysis
 ROC:

Receiver Operating Characteristic
 PVF:

Powervariancefunction
 MSigDB:

Molecular Signatures Database
 KM:

KaplanMeier
 IPCW:

Inverse probability of censoring weighting
 TCGACDR:

TCGA PanCancer Clinical Data Resource
 KEGG:

Kyoto Encyclopedia of Genes and Genomes
 GSEA:

Gene Set Enrichment Analysis
 LUSC:

Lung squamous cell carcinoma
 LUAD:

Lung adenocarcinoma
 LGG:

Brain lower grade glioma
 GBM:

Glioblastoma multiforme
References
Hudis CA, Barlow WE, Costantino JP. Proposal for standardized definitions for efficacy end points in adjuvant breast cancer trials: the STEEP system. J Clin Oncol. 2007;25(15):2127–32.
Punt CJA, Buyse M, Köhne C. Endpoints in adjuvant treatment trials: a systematic review of the literature in colon cancer and proposed definitions for future trials. J Natl Cancer Inst. 2007;99(13):998–1003.
David Cox R, David R. Regression models and life tables. J Roy Stat Soc. 1972;34(2):187–220.
Xie HY, Wang WJ, Sun FY, et al. Proteomics analysis to reveal biological pathways and predictive proteins in the survival of highgrade serous ovarian cancer. Sci Rep. 2017;7(1):9896.
Lyudovyk O, Shen YF, Tatonetti NP, et al. Pathway analysis of genomic pathology tests for prognostic cancer subtyping. J Biomed Inform. 2019;98: 103286.
Michiels S, Le MA, Buyse M, et al. Surrogate endpoints for overall survival in locally advanced head and neck cancer: metaanalyses of individual patient data. Lancet Oncol. 2009;10(4):341–50.
Rondeau V, Pignon JP, Michiels S. A joint model for the dependence between clustered times to tumour progression and deaths: a metaanalysis of chemotherapy in head and neck cancer. Stat Methods Med Res. 2015;24(6):711–29.
Emura T, Nakatochi M, Murotani K, et al. A joint frailtycopula model between tumour progression and death for metaanalysis. Stat Methods Med Res. 2017;26(6):2649–66.
Day R, Bryant J, Lefkopoulou M. Adaptation of bivariate frailty models for prediction, with application to biological markers as prognostic indicators. Biometrika. 1997;84(1):45–56.
Henderson R, Prince H. Choice of conditional models in bivariate survival. Stat Med. 2000;19(4):563–74.
Duchateau L, Janssen P. The frailty model. Berlin: Springer; 2008.
Manda SOM. A nonparametric frailty model for clustered survival data. Commun StatTheory Methods. 2011;40(5):863–75.
Gasperoni F, Ieva F, Paganoni AM, et al. Nonparametric frailty Cox models for hierarchical timetoevent data. Biostatistics. 2020;21(3):531–44.
Rogoz B, de l’Aulnoit A H, Duhamel A, et al. Thirtyyear trends of survival and timevarying effects of prognostic factors in patients with metastatic breast cancer—a single institution experience. Clin Breast Cancer. 2018;18(3):246–53.
Gail MH, Santner TJ, Brown CC. An analysis of comparative carcinogenesis experiments based on multiple times to tumor. Biometrics. 1980;1:255–66.
Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNAseq data. BMC Bioinf. 2013;1:1.
Heagerty PJ, Lumley T, Pepe MS. Timedependent ROC curves for censored survival data and a diagnostic marker. Biometrics. 2000;56(2):337–44.
Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics. 2005;61(1):92–105.
Blanche P, Dartigues JF, JacqminGadda H. Estimating and comparing timedependent areas under receiver operating characteristic curves for censored event times with competing risks. Stat Med. 2013;32(30):5381–97.
Fisher LD, Lin DY. Timedependent covariates in the Cox proportionalhazards regression model. Annu Rev Public Health. 1999;20(1):145–57.
Mwikali Muli A, HouwingDuistermaat J, Gusnanto A. Use of shared gamma frailty model in analysis of survival data in twins. Use of Shared Gamma Frailty Model in Analysis of Survival Data in Twins, 2021; pp. 45–58.
Esayas LM, Akessa GM, Kifle DD. Application of parametric shared frailty models to analyze timetodeath of gastric cancer patients. J Gastrointest Cancer. 2023;54(1):104–16.
Mootha VK, Lindgren CM, Eriksson K. PGC1\(\alpha\)responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34(3):267–73.
Subramanian A, Tamayo P, Mootha V. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.
Kaplan Edward L, Meier Paul. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53(282):457–81.
Liu JF, Lichtenberg T, Hoadley KA. An integrated TCGA pancancer clinical data resource to drive highquality survival outcome analytics. Cell. 2018;173(2):400–16.
Kaiser AM, Gatto A, Hanson KJ. p53 governs an AT1 differentiation programme in lung cancer suppression. Nature. 2023;619(7971):851–9.
Bretz AC, Gittler MP, Charles JP. \(\Delta\)Np63 activates the Fanconi anemia DNA repair pathway and limits the efficacy of cisplatin treatment in squamous cell carcinoma. Nucl Acids Res. 2016;44(7):3204–18.
Liu W, Du Q, Mei T. Comprehensive analysis the prognostic and immune characteristics of mitochondrial transportrelated gene SFXN1 in lung adenocarcinoma. BMC Cancer. 2024;24(1):94.
Ye G, Luo H, Zhang T. Knockdown of RNF183 suppressed proliferation of lung adenocarcinoma cells via inactivating the STAT3 signaling pathway. Cell Cycle. 2022;21(9):948–60.
Precilla SD, Biswas I, Kuduvalli SS. Crosstalk between PI3K/AKT/mTOR and WNT/\(\beta\)Catenin signaling in GBMCould combination therapy checkmate the collusion? Cell Signal. 2022;95: 110350.
Heynckes S, Daka K, Franco P. Crosslink between Temozolomide and PDL1 immunecheckpoint inhibition in glioblastoma multiforme. BMC Cancer. 2019;19:1–7.
Jiang Y, He J, Guo Y. Identification of genes related to lowgrade glioma progression and prognosis based on integrated transcriptome analysis. J Cell Biochem. 2020;121(5–6):3099–111.
Zan X, Li L. Construction of lncRNAmediated ceRNA network to reveal clinically relevant lncRNA biomarkers in glioblastomas. Oncol Lett. 2019;17(5):4369–74.
Alowaidi F, Hashimi SM, Alqurashi N. Cripto1 overexpression in U87 glioblastoma cells activates MAPK, focal adhesion and ErbB pathways. Oncol Lett. 2019;18(3):3399–406.
Zhang W, Zhao H, Chen J. Mining database for the expression and gene regulation network of JAK2 in skin cutaneous melanoma. Life Sci. 2020;253: 117600.
Jiang Y, Hu X, Wang Z. RPTOR mutation: a novel predictor of efficacious immunotherapy in melanoma. Invest New Drugs. 2024;42(1):60–9.
Therneau T, Crowson C, Atkinson E. Using time dependent covariates and time dependent coefficients in the Cox model. Surv Vignettes. 2017;2(3):1–25.
Hamada T, Nakai Y, Isayama H, et al. Progressionfree survival as a surrogate for overall survival in firstline chemotherapy for advanced pancreatic cancer. Eur J Cancer. 2016;65:11–20.
Buyse M, Burzykowski T, Carroll K, et al. Progressionfree survival is a surrogate for survival in advanced colorectal cancer. J Clin Oncol. 2007;25(33):5218–24.
Ma Y. Flexible isotonic regression in survival data analysis. PhD thesis, The George Washington University, 2010.
Hanagal DD. Modeling survival data using frailty models. Berlin: Springer; 2011.
Yashin AI, Vaupel JW, Iachine IA. Correlated individual frailty: an advantageous approach to survival analysis of bivariate data. Math Popul Stud. 1995;5(2):145–59.
Wienke A. Frailty models in survival analysis. Chapman and Hall/CRC, 2010.
Martins A, Aerts M, Hens N, et al. Correlated gamma frailty models for bivariate survival time data. Stat Methods Med Res. 2019;28(10–11):3437–50.
Ng SK, Tawiah R, Mclachlan GJ, et al. Joint frailty modeling of timetoevent data to elicit the evolution pathway of events: a generalized linear mixed model approach. Biostatistics. 2021;1:1.
Romeo JS, Meyer R, Gallardo DI. Bayesian bivariate survival analysis using the power variance function copula. Lifetime Data Anal. 2018;24(2):355–83.
Emura T, Matsui S, Rondeau V. Survival analysis with correlated endpoints: Joint FrailtyCopula models. Springer; 2019.
Sofeu CL, Emura T, Rondeau V. A joint frailtycopula model for metaanalytic validation of failure time surrogate endpoints in clinical trials. Biom J. 2021;63(2):423–46.
Pan W. A multiple imputation approach to Cox regression with intervalcensored data. Biometrics. 2000;56(1):199–203.
Lam KF, Xu Y, Cheung TL. A multiple imputation approach for clustered intervalcensored survival data. Stat Med. 2010;29(6):680–93.
Goggins WB, Finkelstein DM. A proportional hazards model for multivariate intervalcensored failure time data. Biometrics. 2000;56(3):940–3.
Kim MY, Xue X. The analysis of multivariate intervalcensored survival data. Stat Med. 2002;21(23):3715–26.
Bellamy SL, Li Y, Ryan LM, et al. Analysis of clustered and interval censored data from a communitybased study in asthma. Stat Med. 2004;23(23):3607–21.
Wong MCM, Lam KF, Lo ECM. Bayesian analysis of clustered intervalcensored data. J Dent Res. 2005;84(9):817–21.
Acknowledgements
During the preparation of this work the authors used ChatGPT in order to improve language and readability. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Funding
YL was partially supported by a startup fund from the University of Science and Technology of China. KJ was partially supported by Anhui Major Science and Technology Project under grant number 2020b07050001. MZ was partially supported by the National Natural Science Foundation of China under grant number 12126604 and R &D project of Pazhou Lab (Huangpu) under Grant2023K0609.
Author information
Authors and Affiliations
Contributions
BL, KJ, MZ, and YL designed the research; BL and YL developed the methods; BL, KW, YW (Yueguo Wang), QL, YW (Yulan Wang), JS, WW, YY, HW, SZ, KJ and YL contributed to the acquisition, analysis, and interpretation of the data; BL and YL drafted the manuscript; KJ, MZ and YL revised the manuscript; All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics Approval and Consent to Participate
This study adhered to the established guidelines of TCGA and UCSC, consequently, the requirement for ethical approval and patient informed consent was waived (https://xenabrowser. net/datapages/).
Consent for Publication
Not applicable.
Conflict of interest
The authors declare that they have no competing financial interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Lin, B., Wang, K., Yuan, Y. et al. A novel approach to the analysis of Overall Survival (OS) as response with ProgressionFree Interval (PFI) as condition based on the RNAseq expression data in The Cancer Genome Atlas (TCGA). BMC Bioinformatics 25, 300 (2024). https://doi.org/10.1186/s12859024058971
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859024058971