Predicting survival outcomes using subsets of significant genes in prognostic marker studies with microarrays
© Matsui; licensee BioMed Central Ltd. 2006
Received: 01 September 2005
Accepted: 20 March 2006
Published: 20 March 2006
Genetic markers hold great promise for refining our ability to establish precise prognostic prediction for diseases. The development of comprehensive gene expression microarray technology has allowed the selection of relevant marker genes from a large pool of candidate genes in early-phased, developmental prognostic marker studies. The primary analytical task in such studies is to select a small fraction of relevant genes, typically from a list of significant genes, for further investigation in subsequent studies.
We develop a methodology for predicting survival outcomes using subsets of significant genes in prognostic marker studies with microarrays. Key components in this methodology include building prediction models, assessing predictive performance of prediction models, and assessing significance of prediction results. As particular specifications, we assume Cox proportional hazard models with a compound covariate. For assessing predictive accuracy, we propose to use the cross-validated log partial likelihood. To assess significance of prediction results, we apply permutation procedures in cross-validated prediction. As an additional key component peculiar to prognostic prediction, we also consider incorporation of standard prognostic factors. The methodology is evaluated using both simulated and real data.
The developed methodology for prognostic prediction using a subset of significant genes can provide new insights based on predictive capability, possibly incorporating standard prognostic factors, in selecting a fraction of relevant genes for subsequent studies.
Genetic markers hold great promise for refining our ability to establish precise prognostic prediction for diseases. The development of comprehensive, gene expression microarray technology has allowed the selection of relevant marker genes from a large pool of candidate genes in early-phased, developmental prognostic marker studies for various cancers including diffuse large B-cell lymphoma , follicular lymphoma , acute myeloid leukemia , lung adenocarcinoma , and metastatic kidney cancer . The selected genes will be further investigated in subsequent studies using technically simpler, but more reliable assays such as multiplexed quantitative reverse-transcriptase polymerase chain reaction (RT-PCR) in formalin-fixed, paraffin-embedded tissue sections for routine clinical use [6, 7]. Accordingly, the primary task in early-phased prognostic marker studies with microarrays would be to select a small fraction of relevant genes for subsequent studies. To this end, multiple testing to identify genes associated with prognosis is typically adopted as primary analysis, which may provide a list of significant genes.
Prediction analysis using subsets of significant genes may supplement the primary analysis. It can provide information regarding predictive capability for subsets of significant genes. More importantly, provided that appropriate measures of predictive accuracy for survival outcomes are established, it may indicate another 'cut-off' for a list of significant genes on the basis of predictive accuracy through gene filtering other than the criteria to control false positives in multiple testing. Despites many methods for prediction analysis of survival outcomes proposed in bioinformatics and biostatistics literature, including application of partial least squares [8–10] and ridge regression [11, 12], most methods intend to use the full set of genes for prediction without regard to the primary analysis.
In this article, we develop a methodology for predicting survival outcomes using subsets of significant genes. Key components in this methodology include building prediction models, assessing predictive performance of prediction models, and assessing significance of prediction results. Here, given the first two components, we can perform gene filtering. For each component, we consider particular specifications or procedures to illustrate the methodology. In building prediction models, we assume Cox proportional hazard models with a compound covariate [13, 14]. In assessing performance of prediction models, measures of explained variation for Cox regression models [15–17] may not aim to measure the performance of prediction models on future patients, i.e., predictive accuracy. We propose to use the cross-validated log partial likelihood [18, 19] to measure predictive accuracy. To assess significance of prediction results, we apply the permutation procedures in cross-validated prediction proposed by Radmacher et al. . As an additional key component peculiar to prognostic prediction, we also consider incorporation of standard prognostic factors, because it is important to determine whether a new genetic marker adds prognostic information to that already contained in the more established prognostic factors . The performance of the methodology will be evaluated using simulated data and real data from a lymphoma study.
The simplest approach of gene filtering is based on the marginal association between each gene expression and survival time [1–5]. For patient i in the training set, let h i (t) be the hazard function and xj,ibe the expression level for gene j. For gene j, we assume the univariate Cox regression model,
h i (t) = hj,0(t) exp(β j xj,i) (1)
where hj,0(t) is the baseline hazard function and β j is a parameter. Gene filtering is based on a test of hypothesis β j = 0 (e.g., a score or Wald test ). Genes are typically ranked on the basis of the value of absolute standardized test statistic. Gene filtering can be based on the number of genes  or a P-value cut-off [1, 2, 5]. A standardized score or Wald test statistic for testing hypothesis β j = 0 is asymptotically normal with unit variance and mean equal to D1/2β j σ j where is the variance of expression levels across patients for gene j and D is the expected number of events . The gene filtering is thus based on the hazard ratio associated with a change of standard deviation in gene expression for a given number of events.
For the set of K selected genes (j1, ..., j K ), the compound covariate for patient i is defined as
where is the standardized test statistic obtained in the gene filtering for the selected gene j k (k = 1, ..., K). The definition of the compound covariates weights by means of standardized test statistics has been suggested for generalized linear models in Radmacher et al. . This weighting policy reflects the criterion in the gene filtering step. Another possible policy is to use an estimate of β j , in stead of z j , as the weight for gene j (e.g., Beer et al. ). Our weighting policy gives higher weight to genes with larger variance, which would yield a more robust predictor for subsequent validation studies because the expression profiles for genes with larger variance would be more reproducible.
The compound covariate can be regarded as a prognostic index; patients with large values of the compound covariate may have poor prognosis. We assume the following Cox model to relate the compound covariate to the survival time,
h i (t) = h0(t) exp(ψ c i ) (3)
where h0(t) is the baseline hazard function and ψ is a parameter. The compound covariate for a new patient with the vector of expression level for the selected genes can be calculated by replacing with in (2), which is used for the prediction of survival time for that patient.
where l m (ψ) = l T (ψ) - l(-m)(ψ) is the difference between the partial log likelihood for the entire training set and that with the m th group of patients excluded as the test set, and is the value of ψ that maximizes l(-m)(ψ) for m = 1, ..., M. As to the number of cross-validation groups, M = 10 or 5 are reasonable choices especially for computationally burdensome analyses for large samples . A low value of ACVL corresponds to high predictive accuracy. ACVL reduces to the prediction residual error of sum of square (PRESS) in normal linear models .
When using M-fold cross-validation, it is critical that all aspects of model building including gene filtering are re-performed for each of M rounds of cross-validation to avoid selection bias [24, 25]. If we choose the cut-offs in gene filtering that minimizes ACVL, an independent validation set would be needed to have unbiased estimate of predictive accuracy because of the optimization process in model building for the training set. Matsui  demonstrated that the bias due to the optimization can be substantial in a class prediction problem from gene expression profiling using 6,000 genes for 48 bladder tumors.
A limitation of AVCL is that it is difficult to interpret for non-statisticians. Some graphical displays may be helpful to interpret the results. If an independent validation set is available, a simple way is to divide the validation set into some groups based on the value of the prognostic index and compare survival curves between groups using a log-rank test. The same type of assessment can also be performed for cross-validated test sets from the training set, but a usual log-rank test is not valid because the groups are not pre-specified independently of the survival time. A permutation procedure which permutates survival time to expression profile is available to have a correct P-value [14, 5]. In this procedure, the same cross-validated model building process, with some optimization process such as choosing optimal cut-off based on ACVL, if any, is performed to permutated data to obtain a null distribution of the log-rank statistic. This permutation procedure can also be useful for assessing the statistical significance of (the minimized) ACVL, in which one may obtain a null distribution of (the minimized) ACVL.
Adjustment for prognostic factors
Let u i be a vector of prognostic factors for patient i. For gene j, we assume the Cox model, instead of (1),
where η j is a vector of parameters. Gene filtering is based on a test of hypothesis β j = 0 in model (4). This is to select genes after adjustment for the prognostic factors. For the set of K selected genes (j1, ..., j K ), we calculate the compound covariate c i in (2) using a standardized test statistic for the hypothesis . Then we assume the Cox model, instead of (3),
h i (t) = h0(t) exp(γ'u i +ψ c i ) (5)
where γ and ψ are parameters. We assess the predictive accuracy based on ACVL for model (5). The prediction is based on the prognostic index, where and are estimates of γ and ψ, respectively, obtained from the training set.
Analyses should test whether new systems add predictiveness once outcome is adjusted for the effect of standard prognostic factors . For the validation set, a graphical display similar to that described in the previous section may be drawn for each stratum by prognostic factors and compare survival curves using a log-rank test for each stratum or a stratified log-rank test. For cross-validated test sets, a stratum-adjusted permutation procedure would be useful, in which the observed value of the log-rank statistic or (the minimized) ACVL are referred to their null distribution obtained by permutating survival time to expression profile within each stratum.
In this section, we assessed adequacy of choosing the cut-offs in gene filtering for the training set based on ACVL for the Cox model (3) through a small simulation study. We simulated data on 2,000 genes for 100 patients. Of the 2,000 genes, 50 genes were configured to be informative, i.e., these genes are associated with survival time. For informative genes, the distribution of expression was normal with mean 0 and standard deviation 1 (supposing a standardized expression data across patients for each gene). We considered exchangeable correlation matrices with correlation coefficient ρ of 0.2 or 0.7. In addition, we considered the correlation matrix obtained from the data from the lymphoma study by Rosenwald et al.  for the top 50 genes in the gene filtering with model (1). The range of correlation was -0.53 to 0.98. The informative genes were associated with survival time via a multivariate proportional hazard model,
λ0(t) exp(β'x) (6)
The averaged ACVL for several values of K across 100 simulations. Of the 2,000 genes, 50 genes were configured to be informative. The number of patients was 100.
ρ = 0.2
ρ = 0.7
We illustrated the developed methodology using the data from Rosenwald et al.  for diffuse large B-Cell lymphoma http://llmpp.nih.gov/DLBCL/. Briefly, this study collected gene expression data from cDNA microarrays using pretreatment biopsy specimens and clinical data for 240 diffuse large B-Cell lymphoma patients. Clinical data included the International Prognostic Index (IPI) Score , which is a composite score reflecting age, tumor stage, performance status, lactate dehydrogenase level, and the number of sites of extranodal disease, before treatment as a prognostic factor. 7399 microarray features were subject to analysis for predicting the survival time after treatment. In the prediction analysis, the patients were randomly divided into two groups: the training set comprised 160 patients and the validation set comprised 80 patients. The number of events (the median survival in year) was 88 (3.9) for the training set and 50 (3.6) for the validation set.
The selected genes from the gene filtering with no adjustment for IPI Score and from that with the adjustment by gene-expression signature.
In this article, we have developed a methodology for predicting survival outcomes using subsets of significant genes in early-phased, developmental prognostic marker studies with DNA microarrays. Key components in this methodology include development of prediction models, assessment of predictive capability of prediction models, and assessment of significance of prediction results. To illustrate the methodology, we introduced a particular prediction model, Cox regression models with a compound covariate, and a particular measure of predictive accuracy, ACVL. Although adequacy of them was indicated through their application to simulated data and real data, further studies to evaluate their performance through comparison with other specifications or methods would be warranted.
With respect to specification of prediction model, Bair and Tibshirani  recently developed a semi-supervised method that adopted principal components analysis for developing a compound index using subsets of significant genes from the supervised, univariate Cox regression analysis with model (1). They used the first principal component, in stead of the compound covariate c i in (2), as a single covariate in the Cox model (3). The semi-supervised method using subsets of significant genes performed well compared with various methods using a combination of all of the genes for some cancer datasets and simulated datasets. More recently, Li and Gui  considered the application of partial least square and proposed multivariate Cox regression models using the first few orthogonal compound covariates for a full set of genes. The use of the second or higher orthogonal compound covariates proposed by Li and Gui, in addition to the first compound covariate like c i in (2), for subsets of significant genes has the potential to improve predictive accuracy. A comparison study for several prediction methods including those described above using subsets of significant genes is ongoing and the result will appear in a future report. Our method with the compound covariate c i in (2) and the method proposed by Li and Gui using subsets of significant genes are expected to perform well because they are purely supervised. One potential drawback for applying principal components and partial least squares in practice is that they need a complete expression dataset with no missing values for the set of K selected genes. Because there is generally a large number of missing values in the dataset, a complete case analysis  will entail a substantial efficiency loss. As such, these methods may necessitate a data imputation step prior to model fitting. Meanwhile, the univariate standardized test statistics as the weights in compound covariate c i in (2) can be calculated using all the observed expression levels for the set of K selected genes, i.e., an available data analysis  can be performed.
The significance of integrating gene expression profiling into prognostic prediction studies is to improve the predictive capability attainable only using standard prognostic factors. However, it is rare that prognostic factors are incorporated in the prediction analysis using gene expression data in the literature. As an additional key component of our methodology, we considered selection of relevant genes with the adjustment for prognostic factors. The selected genes have the potential to be genetic markers unrelated to the prognostic factors. In such analysis, it is crucial to demonstrate additional information gain from genetic markers. We provided ways to assess this gain both for an independent validation set and cross-validated test sets. Comparison of the selected genes between without and with adjustment for prognostic factors would provide some insights in understanding biological mechanisms in the disease progression and help determine a set of genes for further investigation in subsequent studies. It is advisable that the comparison is supplemented by analyses of differentially expressed genes across different levels of prognostic factors.
We develop a methodology for predicting survival outcomes using a subset of significant genes in prognostic marker studies with microarrays. The adequacy of the methodology was demonstrated through its application to simulated and real data. Our methodology can provide new insights based on predictive capability, possibly incorporating standard prognostic factors, in selecting a fraction of relevant genes for subsequent studies.
We would thank Seiichiro Yamamoto of National Cancer Center Research Institute, Takeharu Yamanaka of Translational Research Informatics Center, and Richard Simon of Biometric Research Branch, National Cancer Institute for helpful discussions on an earlier draft of this article.
- Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, et al.: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Eng J Med 2002, 346: 1937–1947. 10.1056/NEJMoa012914View ArticleGoogle Scholar
- Dave SS, Wright G, Tan B, Rosenwald A, Gascoyne RD, Chan WC, Fisher RI, et al.: Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. N Eng J Med 2004, 351: 2159–2169. 10.1056/NEJMoa041869View ArticleGoogle Scholar
- Bullinger L, Dohner K, Bair E, Frohling S, Schlenk RF, Tibshirani R, Dohner H, Pollack JR: Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N Eng J Med 2004, 350: 1605–1616. 10.1056/NEJMoa031046View ArticleGoogle Scholar
- Beer DG, Kardia SLR, Huang CC, Giordano TJ, Levin AM, Misek DE, et al.: Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine 2002, 8: 816–824.PubMedGoogle Scholar
- Vasselli JR, Shih JH, Iyengar SR, Maranchie J, Riss J, Worrell R, et al.: Predicting survival in patients with metastatic kidney cancer by gene-expression profiling in the primary tumor. Proc Nat Acad Sci USA 2003, 100: 6958–6963. 10.1073/pnas.1131754100PubMed CentralView ArticlePubMedGoogle Scholar
- Lossos IS, Czerwinski DK, Alizadeh AA, Wechser MA, Tibshirani R, Botstein D, Levy R: Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Eng J Med 2004, 350: 1828–1837. 10.1056/NEJMoa032520View ArticleGoogle Scholar
- Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, et al.: A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Eng J Med 2004, 351: 2817–2826. 10.1056/NEJMoa041588View ArticleGoogle Scholar
- Nguyen DV, Rocke DM: Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 2002, 18: 1625–1632. 10.1093/bioinformatics/18.12.1625View ArticlePubMedGoogle Scholar
- Park PJ, Tian L, Kohane IS: Linking gene expression data with patient survival times using partial least squares. Bioinformatics 2002, 18(Suppl 1):S120–127.View ArticlePubMedGoogle Scholar
- Li H, Gui J: Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics 2004, 20(Suppl 1):i208-i215. 10.1093/bioinformatics/bth900View ArticlePubMedGoogle Scholar
- Hastie T, Tibshirani R: Efficient quadratic regularization for expression arrays. Biostatistics 2004, 5: 329–340.View ArticlePubMedGoogle Scholar
- Pawitan Y, Bjohle J, Wedren S, Humphreys K, Skoog L, Huang F, Amler L, Shaw P, Hall P, Bergh J: Gene expression profiling for prognosis using Cox regression. Stat Med 2004, 15: 1767–1780. 10.1002/sim.1769View ArticleGoogle Scholar
- Tukey JW: Tightening the clinical trial. Controlled Clin Trials 1993, 14: 266–285. 10.1016/0197-2456(93)90225-3View ArticlePubMedGoogle Scholar
- Radmacher MD, McShane LM, Simon R: A paradigm for class prediction using gene expression profiles. J Comp Biol 2002, 9: 505–511. 10.1089/106652702760138592View ArticleGoogle Scholar
- Korn E, Simon R: Measures of explained variation for survival data. Stat Med 1990, 9: 487–503.View ArticlePubMedGoogle Scholar
- Schemper M, Henderson R: Predictive accuracy and explained variation in Cox regression. Biometrics 2000, 56: 249–255. 10.1111/j.0006-341X.2000.00249.xView ArticlePubMedGoogle Scholar
- O'Quigley J, Xu R: Explained variation in proportional hazards regression. In Handbook of statistics in Clinical Oncology. 2nd edition. Edited by: Crowley JJ. Ankerst DA, Boca Raton: Chapman & Hall/CRC Press; 2006; 347–363; 2001:397–409.Google Scholar
- Verweij M, Houwelingen V: Cross-validation in survival analysis. Stat Med 1993, 12: 2305–2314.View ArticlePubMedGoogle Scholar
- Pauler DK, Hardin J, Faulkner JR, LeBlanc M, Crowley JJ: Survival analysis with gene expression arrays. In Handbook of Statistics, Advances in Survival Analysis. Volume 23. Edited by: Balakrishnan N, Rao CR. Amsterdam: Elsevier; 2004:675–688.View ArticleGoogle Scholar
- Simon R, Altman DG: Statistical aspects of prognostic factor studies in oncology. Br J Cancer 1994, 69: 979–985.PubMed CentralView ArticlePubMedGoogle Scholar
- Kalbfleisch JD, Prentice RL: The Statistical Analysis of Failure Time Data. 2nd edition. Wiley; 2002.View ArticleGoogle Scholar
- Hsieh FY, Lavori PW: Sample-size calculations for the Cox proportional hazards regression model with nonbinary covariates. Controlled Clin Trials 2000, 21: 552–560. 10.1016/S0197-2456(00)00104-5View ArticlePubMedGoogle Scholar
- Molinaro AM, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods. Bioinformatics 2005, 21: 3301–3307. 10.1093/bioinformatics/bti499View ArticlePubMedGoogle Scholar
- Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Nat Acad Sci USA 2002, 99: 6562–6566. 10.1073/pnas.102102699PubMed CentralView ArticlePubMedGoogle Scholar
- Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Nat Cancer Inst 2003, 95: 14–18.View ArticlePubMedGoogle Scholar
- Matsui S: Statistical Applications using DNA microarrays for cancer diagnosis and prognosis. In Handbook of statistics in Clinical Oncology. 2nd edition. Edited by: Crowley JJ, Ankerst DA. Boca Raton: Chapman and Hall/CRC Press; 419–436.
- The International Non-Hodgkin's Lymphoma Prognostic Factors Project: A predictive model for aggressive non-Hodgkin's lymphoma. N Engl J Med 1993, 329: 987–994. 10.1056/NEJM199309303291402View ArticleGoogle Scholar
- Bair E, Tibshirani R: Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology 2004, 2: 0511–0522. 10.1371/journal.pbio.0020108View ArticleGoogle Scholar
- Little RJA, Rubin DB: Statistical Analysis with Missing Data. 2nd edition. Wiley; 2001.Google Scholar
- Heagerty PJ, Lumley T, Pepe MS: Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 2000, 56: 337–344. 10.1111/j.0006-341X.2000.00337.xView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.