Skip to main content

A novel approach to the analysis of Overall Survival (OS) as response with Progression-Free Interval (PFI) as condition based on the RNA-seq expression data in The Cancer Genome Atlas (TCGA)

Abstract

Background

Overall Survival (OS) and Progression-Free Interval (PFI) as survival times have been collected in The Cancer Genome Atlas (TCGA). It is of biomedical interest to consider their dependence in pathway detection and survival prediction. We intend to develop novel methods for integrating PFI as condition based on parametric survival models for identifying pathways associated with OS and predicting OS.

Results

Based on the framework of conditional probability, we developed a family of frailty-based parametric-models for this purpose, with exponential or Weibull distribution as baseline. We also considered two classes of existing methods with PFI as a covariate. We evaluated the performance of three approaches by analyzing RNA-seq expression data from TCGA for lung squamous cell carcinoma and lung adenocarcinoma (LUNG), brain lower grade glioma and glioblastoma multiforme (GBMLGG), as well as skin cutaneous melanoma (SKCM). Our focus was on fourteen general cancer-related pathways. The 10-fold cross-validation was employed for the evaluation of predictive accuracy. For LUNG, p53 signaling and cell cycle pathways were detected by all approaches. Furthermore, three approaches with the consideration of PFI demonstrated significantly better predictive performance compared to the approaches without the consideration of PFI. For GBMLGG, ten pathways (e.g., Wnt signaling, JAK-STAT signaling, ECM-receptor interaction, etc.) were detected by all approaches. Furthermore, three approaches with the consideration of PFI demonstrated better predictive performance compared to the approaches without the consideration of PFI. For SKCM, p53 signaling pathway was detected only by our Weibull-baseline-based model. And three approaches with the consideration of PFI demonstrated significantly better predictive performance compared to the approaches without the consideration of PFI.

Conclusions

Based on our study, it is necessary to incorporate PFI into the survival analysis of OS. Furthermore, PFI is a survival-type time, and improved results can be achieved by our conditional-probability-based approach.

Peer Review reports

Background

Overall Survival (OS) and Progression-Free Interval (PFI) as survival times have been widely employed to demonstrate the clinical impact of treatments, covariates, or biomarkers [1, 2]. In this study, Overall Survival (OS), defined as the duration from diagnosis to death from any cause, and Progression-Free Interval (PFI), defined as the time from diagnosis to the occurrence of a new tumor event were selected as survival times. In biomedical research, the Cox proportional hazards model [3] is predominantly utilized to analyze OS or PFI separately. The principal objective of these studies is to understand the marginal effects of certain covariates on a chosen survival time and to facilitate prediction. The detection of pathways is pivotal in revealing the intricate biological processes and interactions that significantly impact patient outcomes [4, 5]. Furthermore, it is evident that there exists some dependence between OS and PFI, and their dependence may play a critical role in unraveling the mechanisms underlying disease progression [6,7,8]. Therefore, it is necessary to consider their dependence in pathway detection and survival prediction. In this study, we intend to develop novel methods for integrating PFI as condition based on parametric survival models for identifying pathways associated with OS and predicting OS.

These methods can be developed within the framework of a bivariate survival analysis, focusing on how to obtain an appropriate conditional distribution for the second survival time \(T_2\), when observing a survival time \(T_1\) (whether censored or not). Although current methodologies were available for analyzing the related bivariate survival problem, they often exhibit limitations in their applicability and flexibility [9, 10]. Day et al. [9] explored a similar problem, but they assumed that times were not sequential. In contrast, Henderson et al. [10] regarded times as sequential, yet their interest was confined to instances where the observed survival time was not censored. However, Day et al. [9] and Henderson et al. [10] did not consider the situation that observed survival time was censored. In this study, we proposed an approach to accommodate this situation.

Based on the framework of conditional probability distribution, we developed a family of novel frailty-based methods [11,12,13] for this purpose. Our frailty-based methods can accommodate the situation when the observed survival time as condition is censored and also capture certain nonlinear relationships. In our frailty models, we consider a gamma distribution for the frailty. We developed two shared-frailty-model based methods, with exponential or Weibull as baseline hazard function. The estimators are obtained by the maximum likelihood method. We derived the related likelihood ratio test to identify covariates associated with the future survival event when the observed survival event is considered as condition. Furthermore, we achieved the prediction of future survival event.

In the literature [10], there are two classes of existing methods related to this analysis: both with \(T_1\) as a covariate; \(T_2\) was modeled by traditional parametric regression models or Cox regression models. In the Cox regression model, \(T_1\) can be used as a time-varying/time-dependent effect [14]. Both methods mirrored the approach suggested by Gail et al. [15] in their analysis of multiple tumor times.

We evaluated the performance of three approaches by analyzing RNA-seq expression data from TCGA for lung squamous cell carcinoma and lung adenocarcinoma (LUNG), brain lower grade glioma and glioblastoma multiforme (GBMLGG), as well as skin cutaneous melanoma (SKCM). Our focus was on fourteen general cancer-related pathways (Table 2). We utilized the existing single-sample Gene Set Enrichment Analysis (ssGSEA) algorithm [16], and each pathway is summarized as a covariate. The 10-fold cross-validation approach was employed for the evaluation of predictive accuracy, which was based on the time-dependent Receiver Operating Characteristic (ROC) methodology [17,18,19]. In Fig. 1, we present the workflow diagram of our framework, which highlights the key steps and processes. To further illustrate our frailty-based methods, we evaluated the impact of censoring on likelihood ratio test and parameter estimation (In the Supplementary Materials). Then, we also demonstrated the advantages of frailty-based methods over traditional regression models (In the Supplementary Materials).

Fig. 1
figure 1

The work flow diagram of our framework

Methods

We first provide an overview of regression models used in survival analysis, beginning with the Cox proportional hazards model and traditional regression models, and then extending to shared gamma frailty models.

Background: Cox proportional hazards model

The Cox proportional hazards model [3] is a widely used method in survival analysis that relates the time of an event to covariates without making specific assumptions about the underlying hazard function. The model can be described by the following hazard function:

$$\begin{aligned} \lambda (t|x) = \lambda _0(t) \exp (\beta x), \end{aligned}$$

where \(\lambda (t|x)\) is the hazard function, \(\lambda _0(t)\) is the baseline hazard, \(\beta\) represents the coefficients, and x represents the covariates.

Background: time-varying Cox model

The Cox model can be extended to handle time-varying/time-dependent effects [20], where covariates change over time or coefficients change over time. The hazard function for a Cox model with time-varying covariates or time-varying coefficients can be written as:

$$\begin{aligned} \lambda (t|x(t))= & {} \lambda _0(t) \exp (\beta x(t)), \\ \lambda (t|x)= & {} \lambda _0(t) \exp (\beta (t) x), \end{aligned}$$

where x(t) denotes the covariates that change over time, \(\beta (t)\) denotes the coefficients that change over time. This extension allows for more flexibility in modeling scenarios where the risk factors change as time progresses.

Background: baseline hazard functions and regression approach

Two baseline hazard functions are considered in the analysis: the exponential distribution and the Weibull distribution (Table 1). In practice, traditional regression model is often used to analysis ordered survival events. The probability density function of the traditional regression model is [10]

$$\begin{aligned} f(t_2|x, t_1)=e^{-{t_2}^\delta \lambda e^{\beta x+\gamma log(t_1)}}e^{\beta x+\gamma log(t_1)}\lambda \delta {t_2}^{\delta -1}. \end{aligned}$$

where \(\delta\), \(\lambda\), \(\beta\) and \(\gamma\) represent unknown parameter, x represents covariate, \(t_1\) and \(t_2\) represent two survival times.

Table 1 Two baseline hazard functions for shared gamma frailty models and regression approach

Conditional models

Background: shared gamma frailty models

Shared frailty models are widely used models in survival analysis [21, 22]. In these models, we assume that the two time components are \(T_1\) and \(T_2\). In shared frailty models, we assume that \(T_1\) and \(T_2\) are conditionally independent given a shared unobservable random effect, namely the frailty Z. We consider that the frailty Z follows a gamma distribution (\(Z\sim \Gamma (\theta ,\theta ), \quad \theta >0\)). Given the gamma frailty Z with the observed covariate x, the conditional hazard function and the conditional cumulative hazard function of \(T_1\) and \(T_2\) are as follow

$$\begin{aligned} \lambda _i(t|Z, x)= & {} Z\lambda _{i0}(t)e^{\beta _ix}\qquad i=1,2; \\ \Lambda _i(t|Z, x)= & {} Z\Lambda _{i0}(t)e^{\beta _ix} \qquad i=1,2. \end{aligned}$$

where \(\lambda _{10}(t)\) and \(\lambda _{20}(t)\) are baseline hazard functions, \(\Lambda _{10}(t)\) and \(\Lambda _{20}(t)\) are baseline cumulative hazard functions.

The conditional survival functions of \(T_1\) and \(T_2\) are as follow

$$\begin{aligned} S_1(t|Z, x)= & {} e^{-Z\Lambda _{10}(t)e^{\beta _1x}}; \\ S_2(t|Z, x)= & {} e^{-Z\Lambda _{20}(t)e^{\beta _2x}}. \end{aligned}$$

Because \(T_1\) and \(T_2\) are conditionally independent given the gamma frailty Z, the bivariate survival function of \(T_1\)and \(T_2\) given the gamma frailty Z is

$$\begin{aligned} S(t_1,t_2|Z, x)=e^{-Z(\Lambda _{10}(t_1)e^{\beta _1x}+\Lambda _{20}(t_2)e^{\beta _2x})}. \end{aligned}$$

Then, the bivariate density function of \(T_1\) and \(T_2\) is (The derivation details are described in the Supplementary Materials.)

$$\begin{aligned} \begin{aligned} f(t_1,t_2|x)&=\lambda _{10}(t_1)\lambda _{20}(t_2)e^{(\beta _1+\beta _2)x}\frac{(\theta +1){\theta }^{\theta +1}}{(\theta +\Lambda _{10}(t_1)e^{\beta _1x}+\Lambda _{20}(t_2)e^{\beta _2x})^{\theta +2}}. \end{aligned} \end{aligned}$$

Conditional distribution

The conditional density function of \(T_2\) given \(T_1\) is (The derivation details are described in the Supplementary Materials.)

$$\begin{aligned} \begin{aligned} P(T_2=t_2|T_1=t_1, X=x)&=\lambda _{20}(t_2)e^{\beta _2x}\frac{(\theta +1)(\theta +\Lambda _{10}(t_1)e^{\beta _1x})^{\theta +1}}{(\theta +\Lambda _{10}(t_1)e^{\beta _1x}+\Lambda _{20}(t_2)e^{\beta _2x})^{\theta +2}}. \end{aligned} \end{aligned}$$

In practice, it is often to observe right censored \(T_1\) and \(T_2\). The corresponding conditional probabilities are as follow (The derivation details are described in the Supplementary Materials)

$$\begin{aligned}{} & {} \begin{aligned} P(T_2=t_2|T_1 \ge t_1, X=x)&=\lambda _{20}(t_2)\theta e^{\beta _2x}\frac{(\theta +\Lambda _{10}(t_1)e^{\beta _1x})^\theta }{(\theta +\Lambda _{10}(t_1)e^{\beta _1x}+\Lambda _{20}(t_2)e^{\beta _2x})^{\theta +1}}; \end{aligned} \\{} & {} \begin{aligned} P(T_2 \ge t_2|T_1=t_1, X=x)&=\frac{(\theta +\Lambda _{10}(t_1)e^{\beta _1x})^{\theta +1}}{(\theta +\Lambda _{10}(t_1)e^{\beta _1x}+\Lambda _{20}(t_2)e^{\beta _2x})^{\theta +1}}; \end{aligned} \\{} & {} \begin{aligned} P(T_2 \ge t_2|T_1 \ge t_1, X=x)&=\frac{(\theta +\Lambda _{10}(t_1)e^{\beta _1x})^{\theta }}{(\theta +\Lambda _{10}(t_1)e^{\beta _1x}+\Lambda _{20}(t_2)e^{\beta _2x})^{\theta }}. \end{aligned} \end{aligned}$$

It can be seen from the above formulas that our models actually do not assume that \(T_1 \le T_2\). However, in practice, it is often to observe \(T_1 \le T_2\). Based on the above relationships, we find that the conditional probability can maintain an increasingly monotone relationship between \(T_1\) and \(T_2\) (regardless of whether \(T_1\) is censored or not), which is commonly observed in practice. For example, in the TCGA datasets, if the tumor recurrence could be delayed (PFI), then the survival probability (OS) could be higher.

Model interpretation

The hazard functions of conditional model are as follow

$$\begin{aligned}{} & {} \begin{aligned} \lambda (t_2|T_1=t_1, X=x)&=\frac{\partial [-log(\int _{t_2}^{+\infty }P(T_2=t|T_1=t_1)\textrm{dt})]}{\partial t_2}\\ {}&=(\theta +1)\frac{\lambda _{20}(t_2)e^{\beta _2x}}{\theta +\Lambda _{10}(t_1)e^{\beta _1x}+\Lambda _{20}(t_2)e^{\beta _2x}}\\&=(\theta +1)\frac{\lambda _{20}(t_2)}{\theta e^{-\beta _2x}+\Lambda _{10}(t_1)e^{(\beta _1-\beta _2)x}+\Lambda _{20}(t_2)}; \end{aligned} \\{} & {} \begin{aligned} \lambda (t_2|T_1 \ge t_1, X=x)&=\frac{\partial [-log(\int _{t_2}^{+\infty }P(T_2=t|T_1 \ge t_1)\textrm{dt})]}{\partial t_2}\\ {}&=\theta \frac{\lambda _{20}(t_2)e^{\beta _2x}}{\theta +\Lambda _{10}(t_1)e^{\beta _1x}+\Lambda _{20}(t_2)e^{\beta _2x}}\\&=\theta \frac{\lambda _{20}(t_2)}{\theta e^{-\beta _2x}+\Lambda _{10}(t_1)e^{(\beta _1-\beta _2)x}+\Lambda _{20}(t_2)}. \end{aligned} \end{aligned}$$

It can be seen from the above formulas that the predictive score of evaluating individual hazard is \(\theta e^{-\beta _2x}+\Lambda _{10}(t_1)e^{(\beta _1-\beta _2)x}\). We can obtain the relationship of the individual hazard and covariate x as follow

  1. 1.

    \(\beta _2 \ge 0,\beta _1 \le \beta _2.\)

    The conditional hazard function increases with respect to x;

  2. 2.

    \(\beta _2 \ge 0,\beta _1 > \beta _2.\)

    The conditional hazard function with respect to x increases at \([0,\frac{1}{\beta _1}\log \frac{\beta _2\theta }{\Lambda _{10}(t_1)(\beta _1-\beta _2)}]\) and decreases at \((\frac{1}{\beta _1}\log \frac{\beta _2\theta }{\Lambda _{10}(t_1)(\beta _1-\beta _2)},+\infty ]\);

  3. 3.

    \(\beta _2 < 0, \beta _1 \ge \beta _2.\)

    The conditional hazard function decreases with respect to x;

  4. 4.

    \(\beta _2< 0,\beta _1 < \beta _2.\)

The conditional hazard function with respect to x decreases at \([0,\frac{1}{\beta _1}\log \frac{\beta _2\theta }{\Lambda _{10}(t_1)(\beta _1-\beta _2)}]\) and increases at \((\frac{1}{\beta _1}\log \frac{\beta _2\theta }{\Lambda _{10}(t_1)(\beta _1-\beta _2)},+\infty ]\).

Clearly, censored \(t_1\) does not affect the relationship above.

Theoretical results

We have proven the five lemmas required for our models in the Supplementary Materials. In \({\textbf {Lemma\;3}}\) and \({\textbf {Lemma\;4}}\), we generalize two baseline hazard functions to any positive function and generalize the frailty distribution to the power-variance-function (PVF) family [11].

Likelihood-based estimation

We assume that the sample data were \((t_{1i},t_{2i},\delta _{1i},\delta _{2i},x_i \quad i=1 \dots n)\), where \(\delta _{1i}=0\) represents censored \(t_{1i}\), \(\delta _{1i}=1\) represents uncensored \(t_{1i}\), \(\delta _{2i}=0\) represents censored \(t_{2i}\) and \(\delta _{2i}=1\) represents uncensored \(t_{2i}\).

In the exponential-baseline-based model, the baseline hazard functions are

$$\begin{aligned} \lambda _{10}(t_1)=\lambda _1,\lambda _{20}(t_2)=\lambda _2,\Lambda _{10}(t_1)=\lambda _1t_1,\Lambda _{20}(t_2)=\lambda _2t_2, \quad \lambda _1>0,\lambda _2>0. \end{aligned}$$

The likelihood function \(L(\theta ,\lambda _1,\lambda _2,\beta _1,\beta _2)\) is

$$\begin{aligned} \begin{aligned} \prod _{i=1}^n&(\lambda _{2}e^{\beta _2x_i}\frac{(\theta +1)(\theta +\lambda _1t_{1i}e^{\beta _1x_i})^{\theta +1}}{(\theta +\lambda _1t_{1i}e^{\beta _1x_i}+\lambda _2t_{2i}e^{\beta _2x_i})^{\theta +2}})^{\delta _{1i}\delta _{2i}}(\lambda _{2}e^{\beta _2x_i}\theta \frac{(\theta +\lambda _{1}t_{1i}e^{\beta _1x_i})^\theta }{(\theta +\lambda _{1}t_{1i}e^{\beta _1x_i}+\lambda _{2}t_{i2}e^{\beta _2x_i})^{\theta +1}})^{(1-\delta _{1i})\delta _{2i}}\\&(\frac{(\theta +\lambda _{1}t_{1i}e^{\beta _1x_i})^{\theta +1}}{(\theta +\lambda _{1}t_{1i}e^{\beta _1x_i}+\lambda _{2}t_{2i}e^{\beta _2x_i})^{\theta +1}})^{\delta _{1i}(1-\delta _{2i})}(\frac{(\theta +\lambda _{1}t_{1i}e^{\beta _1x_i})^{\theta }}{(\theta +\lambda _{1}t_{1i}e^{\beta _1x_i}+\lambda _{2}t_{2i}e^{\beta _2x_i})^{\theta }})^{(1-\delta _{1i})(1-\delta _{2i})}. \end{aligned} \end{aligned}$$

In the Weibull-baseline-based model, the baseline hazard functions are

$$\begin{aligned} \lambda _{10}(t_1)= & {} \lambda _1\rho _1 {t_1}^{\rho _1-1},\lambda _{20}(t_2)=\lambda _2\rho _2 {t_2}^{\rho _2-1},\Lambda _{10}(t_1)=\lambda _1{t_1}^{\rho _1},\Lambda _{20}(t_2)=\lambda _2{t_2}^{\rho _2}, \\{} & {} \lambda _1>0,\lambda _2>0,\rho _1>0,\rho _2>0. \end{aligned}$$

The likelihood function \(L(\theta ,\lambda _1,\lambda _2,\beta _1,\beta _2,\rho _1,\rho _2)\) is

$$\begin{aligned} \begin{aligned} \prod _{i=1}^n&(\lambda _{2}\rho _2 {t_{2i}}^{\rho _2-1}e^{\beta _2x_i}\frac{(\theta +1)(\theta +\lambda _1{t_{1i}}^{\rho _1}e^{\beta _1x_i})^{\theta +1}}{(\theta +\lambda _1{t_{1i}}^{\rho _1}e^{\beta _1x_i}+\lambda _2{t_{2i}}^{\rho _2}e^{\beta _2x_i})^{\theta +2}})^{\delta _{1i}\delta _{2i}} \\&(\lambda _{2}\rho _2 {t_{2i}}^{\rho _2-1}e^{\beta _2x_i}\theta \frac{(\theta +\lambda _{1}{t_{1i}}^{\rho _1}e^{\beta _1x_i})^\theta }{(\theta +\lambda _{1}{t_{1i}}^{\rho _1}e^{\beta _1x_i}+\lambda _{2}{t_{i2}}^{\rho _2}e^{\beta _2x_i})^{\theta +1}})^{(1-\delta _{1i})\delta _{2i}} \\&(\frac{(\theta +\lambda _{1}{t_{1i}}^{\rho _1}e^{\beta _1x_i})^{\theta +1}}{(\theta +\lambda _{1}{t_{1i}}^{\rho _1}e^{\beta _1x_i}+\lambda _{2}{t_{2i}}^{\rho _2}e^{\beta _2x_i})^{\theta +1}})^{\delta _{1i}(1-\delta _{2i})}\\&(\frac{(\theta +\lambda _{1}{t_{1i}}^{\rho _1}e^{\beta _1x_i})^{\theta }}{(\theta +\lambda _{1}{t_{1i}}^{\rho _1}e^{\beta _1x_i}+\lambda _{2}{t_{2i}}^{\rho _2}e^{\beta _2x_i})^{\theta }})^{(1-\delta _{1i})(1-\delta _{2i})}. \end{aligned} \end{aligned}$$

The maximum likelihood estimators asymptotically follow normal distributions with mean being the true value of the parameters and covariance matrix being the inverse of the n-fold Fisher information matrix by the \({\textbf {Lemma\;1}}\) (In the Supplementary Materials). We calculated the maximum likelihood estimates by the \({\textbf {optim}}\) function in R or the NLMIXED Procedure in SAS.

Variance and confidence interval

In Section Likelihood-based Estimation, we have stated that the estimated variances of the maximum likelihood estimates of \(\beta _1\) and \(\beta _2\) can be calculated by the inverse of the n-fold Fisher information matrix and we can calculate the estimated variances of the maximum likelihood estimates of \(\beta _1\) and \(\beta _2\) by the \({\textbf {optim}}\) function in R or the NLMIXED Procedure in SAS. The maximum likelihood estimates of \(\beta _1\) and \(\beta _2\) asymptotically follow normal distributions. Therefore, the \(95\%\) confidence interval of \(\beta _1\) and \(\beta _2\) are as follow: assume the estimated standard deviations of \(\beta _1\) and \(\beta _2\) are respectively \(\hat{s_1},\hat{s_2}\), the estimated n-fold Fisher information matrix is \({\hat{H}}\) and the estimation of \(\beta _1\) and \(\beta _2\) are respectively \(\hat{\beta _1},\hat{\beta _2}\), the \(95\%\) confidence interval of \(\beta _1\) and \(\beta _2\) are respectively

$$\begin{aligned} {[}\hat{\beta _1}-1.96*\hat{s_1},\hat{\beta _1}+1.96*\hat{s_1}]=[\hat{\beta _1}-1.96*\sqrt{{\hat{H}}^{-1}(1,1)},\hat{\beta _1}+1.96*\sqrt{{\hat{H}}^{-1}(1,1)}]; \\ {[}\hat{\beta _2}-1.96*\hat{s_2},\hat{\beta _2}+1.96*\hat{s_2}]=[\hat{\beta _2}-1.96*\sqrt{{\hat{H}}^{-1}(2,2)},\hat{\beta _2}+1.96*\sqrt{{\hat{H}}^{-1}(2,2)}]. \end{aligned}$$

Hypothesis testing

When we consider \(T_1\), the relationship between \(T_2\) and the covariate x is not only related to \(\beta _2\), it may also be related to \(\beta _1\). According to \({\textbf {Lemma\;2}}\) (In the Supplementary Materials), \(\beta _1=\beta _2=0\) is equivalent to that \(T_2\) is independent of the covariate x given \(T_1\) in three shared frailty models.

Therefore, we can consider the following hypothesis test

$$\begin{aligned} H_0:\beta _1=\beta _2=0. \quad vs \quad H_1:\beta _1\ne 0 \quad or \quad \beta _2\ne 0. \end{aligned}$$

For the above hypothesis test, we used the likelihood ratio test.

The likelihood ratio statistics are as follow

$$\begin{aligned} \Lambda= & {} \frac{\sup _{(\theta ,\lambda _1,\lambda _2,\beta _1,\beta _2)\in R^5}L(\theta ,\lambda _1,\lambda _2,\beta _1,\beta _2)}{\sup _{(\theta ,\lambda _1,\lambda _2,\beta _1,\beta _2)\in (R,R,R,0,0)}L(\theta ,\lambda _1,\lambda _2,\beta _1,\beta _2)}; \\ \Lambda= & {} \frac{\sup _{(\theta ,\lambda _1,\lambda _2,\rho _1,\rho _2,\beta _1,\beta _2)\in R^7}L(\theta ,\lambda _1,\lambda _2,\rho _1,\rho _2,\beta _1,\beta _2)}{\sup _{(\theta ,\lambda _1,\lambda _2,\rho _1,\rho _2,\beta _1,\beta _2)\in (R,R,R,R,R,0,0)}L(\theta ,\lambda _1,\lambda _2,\rho _1,\rho _2,\beta _1,\beta _2)}; \end{aligned}$$

According to \({\textbf {Lemma\;5}}\) (In the Supplementary Materials), the above likelihood ratio test statistics have asymptotical distributions, i.e. the double log-likelihood ratio test statistics asymptotically follow chi-square distribution with two degrees of freedom (\(\chi ^2(2)\)). Therefore, we test the covariate x by the likelihood ratio test. The above likelihood ratio test statistics \(\Lambda\) can be calculated using the \({\textbf {optim}}\) function in R or the NLMIXED Procedure in SAS.

Single-sample gene set enrichment analysis (ssGSEA)

We utilized the R package \({\textbf {GSVA}}\) to perform ssGSEA [16] at the individual sample level, using pathways as gene sets. The pathway information was obtained from the Molecular Signatures Database (MSigDB) [23, 24].

Kaplan–Meier (K–M) curve

We utilized the R package \({\textbf {survival}}\) to create a Kaplan–Meier (K–M) [25] curves. The K–M curves estimates and visualizes the survival probability over time.

Time-dependent receiver operating characteristic (ROC)

We use the time-dependent ROC methodology (a modified version of the conventional ROC methodology) to evaluate the predictive accuracy of our models [17,18,19]. The areas under the time-dependent ROC curves (AUC(t)) are often used to compare the predictive accuracy of several prediction models. Since Heagerty [17, 18] proposed the time-dependent ROC methodology and some definitions of cases and controls, there are many methods to estimate the time-dependent ROC curve. We choose a nonparametric estimator of the time-dependent ROC curve using the inverse probability of censoring weighting (IPCW) approach [19]. We plot AUC(t) of IPCW estimators by the R package \({\textbf {timeROC}}\).

Results

TCGA data

The RNA sequencing expression data from The Cancer Genome Atlas (TCGA) database were considered. Liu et al. [26] recently curated and standardized genomic and survival information for the TCGA dataset into the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR). Therefore, we downloaded genomic and survival data for the TCGA datasets from UCSC Xena Browser (https://xenabrowser.net/datapages/). In this study, Overall Survival (OS), defined as the duration from diagnosis to death from any cause, and Progression-Free Interval (PFI), defined as the time from diagnosis to the occurrence of a new tumor event were selected as endpoints because these were deemed to be relatively accurate endpoints by Liu et al. But since PFI contains information about OS, we did not consider the case where PFI equals OS. Then the analysis was conducted with the PFI as \(T_1\) and OS as \(T_2\). Time is measured in days. Among 186 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, we primarily focus on the analytical findings for a subset specifically associated with cancer. There exists a total of sixteen pathways in KEGG but fourteen of them are catalogued within the MSigDB. In this study, ssGSEA was performed on these fourteen pathways (Table 2) to calculate cancer-related predictive scores.

In this study, we focused on cancer datasets with relatively large sample sizes. After excluding missing values and removing cases where PFI equals OS, there are three datasets that can be used for patient survival analysis. Therefore, our subsequent analyses are based on lung squamous cell carcinoma and lung adenocarcinoma (LUNG; n=306), brain lower grade glioma and glioblastoma multiforme (GBMLGG; n=276), as well as skin cutaneous melanoma (SKCM; n=268).

LUNG data

Data description

Fig. 2
figure 2

The scatter plot of PFI and OS. The solid points represent the data with both PFI and OS not censored. The empty points represent the data with PFI or OS censored. The x-axis represents PFI while y-axis represents OS. a LUNG Dataset. b GBMLGG Dataset

Table 2 Fourteen cancer-related pathways and analysis results of six models for OS in LUNG dataset

From Fig. 2, we can observe that PFI and OS are mostly concentrated in the interval [0, 4000]. It is evident from Fig. 2 that PFI and OS are highly positively correlated. Among these patients, the censoring proportion for PFI is \(0.0\%\) while the censoring proportion for OS is \(35.6\%\). An intuitive overview of gene set activity across the lung cancer samples is provided by the heatmap of the GSVA output in the LUNG dataset, shown in Figure S1 (In the Supplementary Materials).

Pathway detection

Six different models were performed to identify pathways significantly associated with OS: “\(\hbox {Cox}_1\)” represents the Cox regression model (consider corresponding pathway as only covariate), “\(\hbox {Cox}_{{time}}\)” represents the Cox regression model (consider corresponding pathway as a covariate, but PFI as a time-varying/time-dependent effect, using time-varying coefficient), “\(\hbox {Regression}_1\)” represents the traditional regression model (consider corresponding pathway as only covariate), “\(\hbox {Regression}_2\)” represents the traditional regression model (consider corresponding pathway and PFI as covariates), the exponential-baseline-based model, the Weibull-baseline-based model. Table 2 presents the analysis results of six models:

  • Notably, these six models all detected p53 signaling pathway and cell cycle pathway significantly positively correlated with OS.

It is well-known that p53 signaling pathway and cell cycle pathway are closely linked to the development of lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD) [27,28,29,30].

Fig. 3
figure 3

AUC(\(t_2\)) curve within 5 years or survival curve for \(t_2\) in LUNG dataset. a AUC(\(t_2\)) curve. b Survival curve for \(t_2\). In plot, x-axis represents time \(t_2\) when y-axis represents AUC(\(t_2\)) or survival probability. “Weibull” represents the Weibull-baseline-based model, “\(\hbox {Cox}_{{time}}\)” represents the Cox regression model (consider pathway scores that are not highly correlated as covariates, but PFI as a time-varying effect, using time-varying coefficient), “\(\hbox {Regression}_2\)” represents the traditional regression model (consider pathway scores that are not highly correlated and PFI as covariates). The solid lines represent AUC(\(t_2\)) curves or survival curve for \(t_2\). The dashed lines represent the pointwise confidence intervals of AUC(\(t_2\)) curves or survival curve for \(t_2\)

Predictive performance

In our study, we employed a 10-fold cross-validation method to compare the predictive performance of the models. Initially, the entire dataset was divided into ten subsets of equal size. This method involved ten independent cycles of training and validation. In each cycle, a unique subset was selected as the validation set, while the remaining nine subsets were combined to form the training set. This approach ensured that each subset served as a validation set once and as part of the training set at other times. Following each training and validation cycle, the predictive score of models on the current validation set was assessed. Upon completion of all ten cycles, the predictive scores from the ten validation sets (encompassing the entire dataset) were aggregated to derive the predictive score of models for the full dataset.

We compared the predictive performance of six models. For each fold (10-fold cross-validation), based on the training data, the association p-values were first computed; the pairwise pearson correlations were then calculated. A network adjacency matrix was constructed accordingly based on the absolute correlation value (>0.6); for each connected component in the network, only the pathway with the smallest associated p-value was selected. (Please see Table S13–S15 in the supplementary materials for the related details, including the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model, the exponential-baseline-based model, the Weibull-baseline-based model.) The predictive performance was evaluated using the time-dependent ROC methodology. The predictive scores of the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model and the \(\hbox {Regression}_2\) model are all based on linear predictors. The predictive scores of the exponential-baseline-based model and the Weibull-baseline-based model are based on \(\theta e^{-\varvec{\beta ^T_2x}}+\Lambda _{10}(t_1)e^{\varvec{(\beta _1-\beta _2)^Tx}}\) (see details in Section Model Interpretation).

In Fig. 3 and Figure S5 (In the Supplementary Materials), we can observe that our Weibull-baseline-based model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate significantly better predictive performance in comparison to the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model in [0, 3000]. And we also can observe that the exponential-baseline-based model, the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model demonstrate poor predictive capabilities, with the AUC values being approximately 0.5.

GBMLGG data

Data description

From Fig. 2, we can similarly observe that PFI and OS are mostly concentrated in the interval [0, 4000]. It is evident from Fig. 2 that PFI and OS are highly positively correlated. Among these patients, the censoring proportion for PFI is \(0.0\%\) while the censoring proportion for OS is \(34.1\%\). An intuitive overview of gene set activity across the glioma cancer samples is provided by the heatmap of the GSVA output in the GBMLGG dataset, shown in Figure S2 (In the Supplementary Materials).

Table 3 Fourteen cancer-related pathways and analysis results of six models for OS in GBMLGG dataset

Pathway detection

Similar six different models were performed to identify pathways significantly associated with OS. Table 3 presents the analysis results of six models:

  • The TGF-beta signaling pathway exhibited a significantly negative correlation with OS in the \(\hbox {Cox}_1\) and the \(\hbox {Regression}_1\) models. In contrast, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model detected that TGF-beta signaling pathway is not significant correlated with OS. Similarly, the Weibull model detected that TGF-beta signaling pathway is not significant correlated with OS.

  • The \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model both detected MAPK signaling pathway significantly negatively correlated with OS. Similarly, the Weibull model detected that MAPK signaling pathway is significantly negatively correlated with OS. But the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) models detected that MAPK signaling pathway is not significant correlated with OS.

  • The \(\hbox {Regression}_1\) model detected PPAR signaling pathway significantly positively correlated with OS. The Weibull model detected that PPAR signaling pathway is not significant correlated with OS.

  • Notably, the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model and the Weibull model all detected that eight pathways (JAK-STAT signaling, ECM-receptor interaction, cytokine-cytokine receptor interaction, focal adhesion, VEGF signaling, apoptosis, p53 signaling, cell cycle) are significantly positively correlated with OS, two pathways (Wnt signaling, mTOR signaling) are significantly negatively correlated with OS.

For brain lower grade glioma (LGG) and glioblastoma multiforme (GBM), many pathways have been extensively studied. For instance, it has been studied that Wnt signaling pathway, JAK-STAT signaling pathway, ECM-receptor interaction pathway, cytokine-cytokine receptor interaction pathway and focal adhesion pathway are related to the growth and survival of tumors [31,32,33,34,35].

Fig. 4
figure 4

AUC(\(t_2\)) curve within 5 years or survival curve for \(t_2\) in GBMLGG dataset. a AUC(\(t_2\)) curve. b Survival curve for \(t_2\). In plot, x-axis represents time \(t_2\) when y-axis represents AUC(\(t_2\)) or survival probability. “Weibull” represents the Weibull-baseline-based model, “\(\hbox {Cox}_{{time}}\)” represents the Cox regression model (consider pathway scores that are not highly correlated as covariates, but PFI as a time-varying effect, using time-varying coefficient), “\(\hbox {Regression}_2\)” represents the traditional regression model (consider pathway scores that are not highly correlated and PFI as covariates). The solid lines represent AUC(\(t_2\)) curves or survival curve for \(t_2\). The dashed lines represent the pointwise confidence intervals of AUC(\(t_2\)) curves or survival curve for \(t_2\)

Predictive performance

Similarly, we employed a 10-fold cross-validation method to compare the predictive performance of the models. We compared the predictive performance of six models. For each fold (10-fold cross-validation), based on the training data, the association p-values were first computed; the pairwise pearson correlations were then calculated. A network adjacency matrix was constructed accordingly based on the absolute correlation value (>0.6); for each connected component in the network, only the pathway with the smallest associated p-value was selected. (Please see Table S16–S18 in the supplementary materials for the related details, including the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model, the exponential-baseline-based model, the Weibull-baseline-based model.) The predictive performance was evaluated using the time-dependent ROC methodology. The predictive scores of the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model and the \(\hbox {Regression}_2\) model are all based on linear predictors. The predictive scores of the exponential-baseline-based model and the Weibull-baseline-based model are based on \(\theta e^{-\varvec{\beta ^T_2x}}+\Lambda _{10}(t_1)e^{\varvec{(\beta _1-\beta _2)^Tx}}\) (see details in Section Model Interpretation).

In Fig. 4 and Figure S6 (In the Supplementary Materials), we can observe that our Weibull-baseline-based model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate better predictive performance in comparison to the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model. And we can observe that the exponential-baseline-based model demonstrate poor predictive capabilities, with the AUC values being approximately 0.5.

SKCM data

Data description

Fig. 5
figure 5

The scatter plot of PFI and OS in the SKCM dataset. The solid points represent the data with both PFI and OS not censored. The empty points represent the data with PFI or OS censored. The x-axis represents PFI while y-axis represents OS

From Fig. 5, we can similarly observe that PFI and OS are mostly concentrated in the interval [0, 7000]. It is evident from Fig. 5 that PFI and OS are highly positively correlated. Among these patients, the censoring proportion for PFI is \(0.0\%\) while the censoring proportion for OS is \(39.6\%\). An intuitive overview of gene set activity across the skin cancer samples is provided by the heatmap of the GSVA output in the SKCM dataset, shown in Figure S3 (In the Supplementary Materials).

Table 4 Fourteen cancer-related pathways and analysis results of six models for OS in SKCM dataset

Pathway detection

Similar six different models were performed to identify pathways significantly associated with OS. Table 4 presents the analysis results of six models:

  • The Wnt signaling pathway and mTOR signaling pathway exhibited a significantly positive correlation with OS in the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model. Similarly, the Weibull model detected that Wnt signaling pathway and mTOR signaling pathway are significantly positive correlated with OS. In contrast, the \(\hbox {Cox}_1\) and the \(\hbox {Regression}_1\) models detected that Wnt signaling pathway and mTOR signaling pathway are not significantly correlated with OS.

  • The Weibull model detected p53 signaling pathway significantly positively correlated with OS, but this correlation was not detected by other models.

  • Notably, the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model and the Weibull model all detected that two pathways (adherens junction, cell cycle) are significantly positively correlated with OS, three pathways (JAK-STAT signaling, cytokine-cytokine receptor interaction, apoptosis) are significantly negatively correlated with OS.

For skin cutaneous melanoma (SKCM), many pathways have been extensively studied. For instance, it has been shown that JAK-STAT signaling pathway is related to the disease development, as well as cell cycle pathway, p53 signaling pathway and apoptosis pathway [36]; it also well-known that mTOR signaling pathway plays an important role in tumor progression [37].

Fig. 6
figure 6

AUC(\(t_2\)) curve within 5 years or survival curve for \(t_2\) in SKCM dataset. a AUC(\(t_2\)) curve. b Survival curve for \(t_2\). In plot, x-axis represents time \(t_2\) when y-axis represents AUC(\(t_2\)) or survival probability. “Weibull” represents the Weibull-baseline-based model, “\(\hbox {Cox}_{{time}}\)” represents the Cox regression model (consider pathway scores that are not highly correlated as covariates, but PFI as a time-varying effect, using time-varying coefficient), “\(\hbox {Regression}_2\)” represents the traditional regression model (consider pathway scores that are not highly correlated and PFI as covariates). The solid lines represent AUC(\(t_2\)) curves or survival curve for \(t_2\). The dashed lines represent the pointwise confidence intervals of AUC(\(t_2\)) curves or survival curve for \(t_2\)

Predictive performance

Similarly, we employed a 10-fold cross-validation method to compare the predictive performance of the models. We compared the predictive performance of six models. For each fold (10-fold cross-validation), based on the training data, the association p-values were first computed; the pairwise pearson correlations were then calculated. A network adjacency matrix was constructed accordingly based on the absolute correlation value (>0.6); for each connected component in the network, only the pathway with the smallest associated p-value was selected. (Please see Table S19–S21 in the supplementary materials for the related details, including the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model, the \(\hbox {Regression}_2\) model, the exponential-baseline-based model, the Weibull-baseline-based model.) The predictive performance was evaluated using the time-dependent ROC methodology. The predictive scores of the \(\hbox {Cox}_1\) model, the \(\hbox {Cox}_{{time}}\) model, the \(\hbox {Regression}_1\) model and the \(\hbox {Regression}_2\) model are all based on linear predictors. The predictive scores of the exponential-baseline-based model and the Weibull-baseline-based model are based on \(\theta e^{-\varvec{\beta ^T_2x}}+\Lambda _{10}(t_1)e^{\varvec{(\beta _1-\beta _2)^Tx}}\) (see details in Section Model Interpretation).

In Fig. 6 and Figure S7 (In the Supplementary Materials), we can observe that our Weibull-baseline-based model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate significantly better predictive performance in comparison to the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model in [0, 4500]. And we also can observe that the exponential-baseline-based model, the \(\hbox {Cox}_1\) model and the \(\hbox {Regression}_1\) model demonstrate poor predictive capabilities, with the AUC values being approximately 0.5.

Discussion and conclusions

We presented some comparisons of several models using TCGA datasets for LUNG, GBMLGG and SKCM. For LUNG, p53 signaling and cell cycle pathways were detected significantly positively correlated with OS by all approaches. Our Weibull-baseline-based model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate significantly better predictive performance in comparison to the approaches without the consideration of PFI in [0, 3000]. For GBMLGG, eight pathways (JAK-STAT signaling, ECM-receptor interaction, cytokine-cytokine receptor interaction, focal adhesion, VEGF signaling, apoptosis, p53 signaling, cell cycle) were detected significantly positively correlated with OS by all approaches, two pathways (Wnt signaling, mTOR signaling) were detected significantly negatively correlated with OS by all approaches. Our Weibull-baseline-based model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate better predictive performance in comparison to the approaches without the consideration of PFI. For SKCM, p53 signaling pathway was detected significantly positively correlated with OS by our Weibull-baseline-based model but not by the other approaches. Our Weibull-baseline-based model, the \(\hbox {Cox}_{{time}}\) model and the \(\hbox {Regression}_2\) model demonstrate significantly better predictive performance in comparison to the approaches without the consideration of PFI in [0, 4500]. In the Supplementary Materials, we present additional analysis based on a cohort of breast invasive carcinoma (BRCA) samples (n=111). For BRCA, MAPK signaling pathway was detected significantly negatively correlated with OS by our Weibull-baseline-based model but not by the other approaches. Our Weibull-baseline-based model demonstrate significantly better predictive performance in comparison to the approaches without the consideration of PFI in [0, 3000]. Based on our study, it is necessary to incorporate PFI into a survival analysis so that additional pathways associated with OS may be discovered, and the predictive performance for OS can be improved. PFI is a survival-type time. Based on the conditional probability framework, further improved results can be obtained by our proposed approaches. We also performed Cox regression with time-varying/time-dependent covariates for prediction performance evaluation [38], and our conclusion still remains, please see Figs. S46–S49 (In the Supplementary Materials) for more details. In practice, it may be generally difficult to obtain PFI in advance. In some instances, the PFI may be observed in advance [39, 40]. If PFI could be observed in advance, it would enhance the prediction of OS. This would encourage us to collect information about PFI as much as possible. Due to the correlations between PFI and OS, the prediction performance on OS can be significantly enhanced. In the analysis of bivariate correlated survival data, it is common to construct frailty models. This approach helps account for the correlation and unobserved heterogeneity between bivariate survival times.

Due to the complexity of frailty models, parameter estimation through maximum likelihood optimization may be sensitive to the choice of initial values. Therefore, we recommend exploring multiple initial values during model application to ensure appropriate parameter estimation. The baseline hazard function for Weibull-baseline based data exhibits a strictly increasing trend. The exponential-baseline-based model and the piecewise-exponential-baseline-based model (In the Supplementary Materials) are less likely to capture this trend. So in the Weibull-baseline based data, we observed that the likelihood ratio test distribution does not asymptotically follow \(\chi ^2(2)\) for the exponential-baseline-based model and piecewise-exponential-baseline-based model (In the Supplementary Materials). And the corresponding parameter estimation and hypothesis testing are abnormal. Furthermore, when identifying associated covariates, excessive zero values for covariate may result in the failure of the likelihood ratio test. In this case, appropriate transformations to “x” (covariate) can solve this issue. In our study, we use bulk RNA-seq data. When analyzing bulk RNA-seq data, mixed cell types can lead to confounding results. Deconvolution can separate the gene expression levels of each cell type, thereby avoiding the confusion caused by cell mixture. We obtained a ranked list of genes through the deconvolution method (NMF), and then used Gene Set Enrichment Analysis (gseapy module in Python) to assess the significance of fourteen cancer-related pathways. The specific results are shown in Table S37–S40.

For the piecewise-exponential-baseline-based model (In the Supplementary Materials), there are several methods to divide the interval, such as quantiles or empirical hazard plots [41]. In our simulation studies (In the Supplementary Materials), the intervals of the piecewise-exponential-baseline-based model match the optimal intervals (consistent with the data generation process). In this case, its performance is also not clearly higher to the Weibull-baseline-based model. We primarily focused on the frailty following a gamma distribution. Other distributions like log-normal or inverse Gaussian can also be considered [42], but the computing will become complicated. Our models primarily utilize shared frailty models. However, it is worth noting that they can also be expanded to the multivariable frailty models, as demonstrated in prior works [43,44,45,46]. Additionally, another generalization of shared frailty models can be achieved through the incorporation of copula models, as explored in previous studies [47,48,49]. When confronted with high censoring rates, our models can be optimized by using the multiple imputation method, as indicated in previous studies [50, 51]. Moreover, our models exhibit the versatility to accommodate not only random right censoring but also random left censoring or random interval censoring, as explored in the literature [52,53,54,55]. To enhance our model adaptability, the baseline hazard function can be approximated with spline functions [8]. The spline functions provide a more flexible approach to capture the shape of the hazard function.

Availability of data and materials

For each type of tumor, we downloaded TCGA genomic data and survival data based on the following dataset ID from UCSC Xena Browser (https://xenabrowser.net/datapages/).

\(\bullet\) Lung Cancer: TCGA.LUNG.sampleMap/HiSeqV2\(\_\)PANCAN and survival/LUNG\(\_\)survival.txt; \(\bullet\) Lower grade glioma and glioblastoma: TCGA.GBMLGG.sampleMap/HiSeqV2\(\_\)PANCAN and survival/GBMLGG\(\_\)survival.txt; \(\bullet\) Melanoma: TCGA.SKCM.sampleMap/HiSeqV2\(\_\)PANCAN and survival/SKCM\(\_\)survival.txt; \(\bullet\) Breast Cancer: TCGA.BRCA.sampleMap/HiSeqV2\(\_\)PANCAN and survival/BRCA\(\_\)survival.txt. The code used in this study is available at https://github.com/1LinBo/Survival/tree/main.

Abbreviations

OS:

Overall Survival

PFI:

Progression-Free Interval

TCGA:

The Cancer Genome Atlas

LUNG:

Lung squamous cell carcinoma and lung adenocarcinoma

GBMLGG:

Brain lower grade glioma and glioblastoma multiforme

SKCM:

Skin cutaneous melanoma

ssGSEA:

Single-sample Gene Set Enrichment Analysis

ROC:

Receiver Operating Characteristic

PVF:

Power-variance-function

MSigDB:

Molecular Signatures Database

K-M:

Kaplan-Meier

IPCW:

Inverse probability of censoring weighting

TCGA-CDR:

TCGA Pan-Cancer Clinical Data Resource

KEGG:

Kyoto Encyclopedia of Genes and Genomes

GSEA:

Gene Set Enrichment Analysis

LUSC:

Lung squamous cell carcinoma

LUAD:

Lung adenocarcinoma

LGG:

Brain lower grade glioma

GBM:

Glioblastoma multiforme

References

  1. Hudis CA, Barlow WE, Costantino JP. Proposal for standardized definitions for efficacy end points in adjuvant breast cancer trials: the STEEP system. J Clin Oncol. 2007;25(15):2127–32.

    Article  PubMed  Google Scholar 

  2. Punt CJA, Buyse M, Köhne C. Endpoints in adjuvant treatment trials: a systematic review of the literature in colon cancer and proposed definitions for future trials. J Natl Cancer Inst. 2007;99(13):998–1003.

    Article  PubMed  Google Scholar 

  3. David Cox R, David R. Regression models and life tables. J Roy Stat Soc. 1972;34(2):187–220.

    Article  Google Scholar 

  4. Xie HY, Wang WJ, Sun FY, et al. Proteomics analysis to reveal biological pathways and predictive proteins in the survival of high-grade serous ovarian cancer. Sci Rep. 2017;7(1):9896.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Lyudovyk O, Shen YF, Tatonetti NP, et al. Pathway analysis of genomic pathology tests for prognostic cancer subtyping. J Biomed Inform. 2019;98: 103286.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Michiels S, Le MA, Buyse M, et al. Surrogate endpoints for overall survival in locally advanced head and neck cancer: meta-analyses of individual patient data. Lancet Oncol. 2009;10(4):341–50.

    Article  PubMed  Google Scholar 

  7. Rondeau V, Pignon JP, Michiels S. A joint model for the dependence between clustered times to tumour progression and deaths: a meta-analysis of chemotherapy in head and neck cancer. Stat Methods Med Res. 2015;24(6):711–29.

    Article  PubMed  Google Scholar 

  8. Emura T, Nakatochi M, Murotani K, et al. A joint frailty-copula model between tumour progression and death for meta-analysis. Stat Methods Med Res. 2017;26(6):2649–66.

    Article  PubMed  Google Scholar 

  9. Day R, Bryant J, Lefkopoulou M. Adaptation of bivariate frailty models for prediction, with application to biological markers as prognostic indicators. Biometrika. 1997;84(1):45–56.

    Article  Google Scholar 

  10. Henderson R, Prince H. Choice of conditional models in bivariate survival. Stat Med. 2000;19(4):563–74.

    Article  CAS  PubMed  Google Scholar 

  11. Duchateau L, Janssen P. The frailty model. Berlin: Springer; 2008.

    Google Scholar 

  12. Manda SOM. A nonparametric frailty model for clustered survival data. Commun Stat-Theory Methods. 2011;40(5):863–75.

    Article  Google Scholar 

  13. Gasperoni F, Ieva F, Paganoni AM, et al. Non-parametric frailty Cox models for hierarchical time-to-event data. Biostatistics. 2020;21(3):531–44.

    Article  PubMed  Google Scholar 

  14. Rogoz B, de l’Aulnoit A H, Duhamel A, et al. Thirty-year trends of survival and time-varying effects of prognostic factors in patients with metastatic breast cancer—a single institution experience. Clin Breast Cancer. 2018;18(3):246–53.

  15. Gail MH, Santner TJ, Brown CC. An analysis of comparative carcinogenesis experiments based on multiple times to tumor. Biometrics. 1980;1:255–66.

    Article  Google Scholar 

  16. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinf. 2013;1:1.

    Google Scholar 

  17. Heagerty PJ, Lumley T, Pepe MS. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics. 2000;56(2):337–44.

    Article  CAS  PubMed  Google Scholar 

  18. Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics. 2005;61(1):92–105.

    Article  PubMed  Google Scholar 

  19. Blanche P, Dartigues JF, Jacqmin-Gadda H. Estimating and comparing time-dependent areas under receiver operating characteristic curves for censored event times with competing risks. Stat Med. 2013;32(30):5381–97.

    Article  PubMed  Google Scholar 

  20. Fisher LD, Lin DY. Time-dependent covariates in the Cox proportional-hazards regression model. Annu Rev Public Health. 1999;20(1):145–57.

    Article  CAS  PubMed  Google Scholar 

  21. Mwikali Muli A, Houwing-Duistermaat J, Gusnanto A. Use of shared gamma frailty model in analysis of survival data in twins. Use of Shared Gamma Frailty Model in Analysis of Survival Data in Twins, 2021; pp. 45–58.

  22. Esayas LM, Akessa GM, Kifle DD. Application of parametric shared frailty models to analyze time-to-death of gastric cancer patients. J Gastrointest Cancer. 2023;54(1):104–16.

    Article  Google Scholar 

  23. Mootha VK, Lindgren CM, Eriksson K. PGC-1\(\alpha\)-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34(3):267–73.

    Article  CAS  PubMed  Google Scholar 

  24. Subramanian A, Tamayo P, Mootha V. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Kaplan Edward L, Meier Paul. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53(282):457–81.

    Article  Google Scholar 

  26. Liu JF, Lichtenberg T, Hoadley KA. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell. 2018;173(2):400–16.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Kaiser AM, Gatto A, Hanson KJ. p53 governs an AT1 differentiation programme in lung cancer suppression. Nature. 2023;619(7971):851–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Bretz AC, Gittler MP, Charles JP. \(\Delta\)Np63 activates the Fanconi anemia DNA repair pathway and limits the efficacy of cisplatin treatment in squamous cell carcinoma. Nucl Acids Res. 2016;44(7):3204–18.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Liu W, Du Q, Mei T. Comprehensive analysis the prognostic and immune characteristics of mitochondrial transport-related gene SFXN1 in lung adenocarcinoma. BMC Cancer. 2024;24(1):94.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Ye G, Luo H, Zhang T. Knockdown of RNF183 suppressed proliferation of lung adenocarcinoma cells via inactivating the STAT3 signaling pathway. Cell Cycle. 2022;21(9):948–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Precilla SD, Biswas I, Kuduvalli SS. Crosstalk between PI3K/AKT/mTOR and WNT/\(\beta\)-Catenin signaling in GBM-Could combination therapy checkmate the collusion? Cell Signal. 2022;95: 110350.

    Article  Google Scholar 

  32. Heynckes S, Daka K, Franco P. Crosslink between Temozolomide and PD-L1 immune-checkpoint inhibition in glioblastoma multiforme. BMC Cancer. 2019;19:1–7.

    Article  Google Scholar 

  33. Jiang Y, He J, Guo Y. Identification of genes related to low-grade glioma progression and prognosis based on integrated transcriptome analysis. J Cell Biochem. 2020;121(5–6):3099–111.

    Article  CAS  PubMed  Google Scholar 

  34. Zan X, Li L. Construction of lncRNA-mediated ceRNA network to reveal clinically relevant lncRNA biomarkers in glioblastomas. Oncol Lett. 2019;17(5):4369–74.

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Alowaidi F, Hashimi SM, Alqurashi N. Cripto-1 overexpression in U87 glioblastoma cells activates MAPK, focal adhesion and ErbB pathways. Oncol Lett. 2019;18(3):3399–406.

    PubMed  PubMed Central  Google Scholar 

  36. Zhang W, Zhao H, Chen J. Mining database for the expression and gene regulation network of JAK2 in skin cutaneous melanoma. Life Sci. 2020;253: 117600.

    Article  CAS  PubMed  Google Scholar 

  37. Jiang Y, Hu X, Wang Z. RPTOR mutation: a novel predictor of efficacious immunotherapy in melanoma. Invest New Drugs. 2024;42(1):60–9.

    Article  CAS  PubMed  Google Scholar 

  38. Therneau T, Crowson C, Atkinson E. Using time dependent covariates and time dependent coefficients in the Cox model. Surv Vignettes. 2017;2(3):1–25.

    Google Scholar 

  39. Hamada T, Nakai Y, Isayama H, et al. Progression-free survival as a surrogate for overall survival in first-line chemotherapy for advanced pancreatic cancer. Eur J Cancer. 2016;65:11–20.

    Article  PubMed  Google Scholar 

  40. Buyse M, Burzykowski T, Carroll K, et al. Progression-free survival is a surrogate for survival in advanced colorectal cancer. J Clin Oncol. 2007;25(33):5218–24.

    Article  CAS  PubMed  Google Scholar 

  41. Ma Y. Flexible isotonic regression in survival data analysis. PhD thesis, The George Washington University, 2010.

  42. Hanagal DD. Modeling survival data using frailty models. Berlin: Springer; 2011.

    Book  Google Scholar 

  43. Yashin AI, Vaupel JW, Iachine IA. Correlated individual frailty: an advantageous approach to survival analysis of bivariate data. Math Popul Stud. 1995;5(2):145–59.

    Article  CAS  PubMed  Google Scholar 

  44. Wienke A. Frailty models in survival analysis. Chapman and Hall/CRC, 2010.

  45. Martins A, Aerts M, Hens N, et al. Correlated gamma frailty models for bivariate survival time data. Stat Methods Med Res. 2019;28(10–11):3437–50.

    Article  PubMed  Google Scholar 

  46. Ng SK, Tawiah R, Mclachlan GJ, et al. Joint frailty modeling of time-to-event data to elicit the evolution pathway of events: a generalized linear mixed model approach. Biostatistics. 2021;1:1.

    Google Scholar 

  47. Romeo JS, Meyer R, Gallardo DI. Bayesian bivariate survival analysis using the power variance function copula. Lifetime Data Anal. 2018;24(2):355–83.

    Article  PubMed  Google Scholar 

  48. Emura T, Matsui S, Rondeau V. Survival analysis with correlated endpoints: Joint Frailty-Copula models. Springer; 2019.

  49. Sofeu CL, Emura T, Rondeau V. A joint frailty-copula model for meta-analytic validation of failure time surrogate endpoints in clinical trials. Biom J. 2021;63(2):423–46.

    Article  PubMed  Google Scholar 

  50. Pan W. A multiple imputation approach to Cox regression with interval-censored data. Biometrics. 2000;56(1):199–203.

    Article  CAS  PubMed  Google Scholar 

  51. Lam KF, Xu Y, Cheung TL. A multiple imputation approach for clustered interval-censored survival data. Stat Med. 2010;29(6):680–93.

    Article  CAS  PubMed  Google Scholar 

  52. Goggins WB, Finkelstein DM. A proportional hazards model for multivariate interval-censored failure time data. Biometrics. 2000;56(3):940–3.

    Article  CAS  PubMed  Google Scholar 

  53. Kim MY, Xue X. The analysis of multivariate interval-censored survival data. Stat Med. 2002;21(23):3715–26.

    Article  PubMed  Google Scholar 

  54. Bellamy SL, Li Y, Ryan LM, et al. Analysis of clustered and interval censored data from a community-based study in asthma. Stat Med. 2004;23(23):3607–21.

    Article  PubMed  Google Scholar 

  55. Wong MCM, Lam KF, Lo ECM. Bayesian analysis of clustered interval-censored data. J Dent Res. 2005;84(9):817–21.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

During the preparation of this work the authors used ChatGPT in order to improve language and readability. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Funding

YL was partially supported by a start-up fund from the University of Science and Technology of China. KJ was partially supported by Anhui Major Science and Technology Project under grant number 2020b07050001. MZ was partially supported by the National Natural Science Foundation of China under grant number 12126604 and R &D project of Pazhou Lab (Huangpu) under Grant2023K0609.

Author information

Authors and Affiliations

Authors

Contributions

BL, KJ, MZ, and YL designed the research; BL and YL developed the methods; BL, KW, YW (Yueguo Wang), QL, YW (Yulan Wang), JS, WW, YY, HW, SZ, KJ and YL contributed to the acquisition, analysis, and interpretation of the data; BL and YL drafted the manuscript; KJ, MZ and YL revised the manuscript; All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yinglei Lai.

Ethics declarations

Ethics Approval and Consent to Participate

This study adhered to the established guidelines of TCGA and UCSC, consequently, the requirement for ethical approval and patient informed consent was waived (https://xenabrowser. net/datapages/).

Consent for Publication

Not applicable.

Conflict of interest

The authors declare that they have no competing financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, B., Wang, K., Yuan, Y. et al. A novel approach to the analysis of Overall Survival (OS) as response with Progression-Free Interval (PFI) as condition based on the RNA-seq expression data in The Cancer Genome Atlas (TCGA). BMC Bioinformatics 25, 300 (2024). https://doi.org/10.1186/s12859-024-05897-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-024-05897-1

Keywords