Meta-analysis of several gene lists for distinct types of cancer: A simple way to reveal common prognostic markers
© Yang and Sun. 2007
Received: 16 October 2006
Accepted: 06 April 2007
Published: 06 April 2007
Skip to main content
© Yang and Sun. 2007
Received: 16 October 2006
Accepted: 06 April 2007
Published: 06 April 2007
Although prognostic biomarkers specific for particular cancers have been discovered, microarray analysis of gene expression profiles, supported by integrative analysis algorithms, helps to identify common factors in molecular oncology. Similarities of Ordered Gene Lists (SOGL) is a recently proposed approach to meta-analysis suitable for identifying features shared by two data sets. Here we extend the idea of SOGL to the detection of significant prognostic marker genes from microarrays of multiple data sets. Three data sets for leukemia and the other six for different solid tumors are used to demonstrate our method, using established statistical techniques.
We describe a set of significantly similar ordered gene lists, representing outcome comparisons for distinct types of cancer. This kind of similarity could improve the diagnostic accuracies of individual studies when SOGL is incorporated into the support vector machine algorithm. In particular, we investigate the similarities among three ordered gene lists pertaining to mesothelioma survival, prostate recurrence and glioma survival. The similarity-driving genes are related to the outcomes of patients with lung cancer with a hazard ratio of 4.47 (p = 0.035). Many of these genes are involved in breakdown of EMC proteins regulating angiogenesis, and may be used for further research on prognostic markers and molecular targets of gene therapy for cancers.
The proposed method and its application show the potential of such meta-analyses in clinical studies of gene expression profiles.
Changes in gene expression levels could reflect clinically distinct conditions. Genome-wide perspectives of gene expression can now be obtained, and these can be combined with other currently-used criteria to identify predictors of clinical outcome for specific cancers [1–7]. Also, distinct gene expression profiles can reportedly determine molecular treatment responses, e.g. in cancer . Thus it is possible to discover biomarkers from gene expression profiles that help to predict outcomes, and this emphasizes the need in biomedical research to combine results from similar experiments in order to identify diagnostic or prognostic disease markers.
Much recent research has confirmed that microarray results are comparable among different laboratories, especially when a common platform and a set of procedures are used [9–13]. Integrative analysis that evaluates cancer transcriptome data in the context of data from other sources has received attention recently (reviewed by Rhodes and Chinnaiyan ). An important emerging argument concerns the uniformity of cancer metastases as well as the evolution of malignancy in primary tumors [15–17]. Grutzmann et al. ran meta-analysis on four studies for pancreatic cancer, and validated their identified signatures using RT-PCR and immunohistochemistry . In particular, Glinsky and colleagues innovatively published a 11-gene signature that is displayed consistently in stem cells self-renewal pathways, and this is a powerful predictor for prognosis in 11 distinct types of cancer . These results exemplify the clinical application of meta-analysis signatures detected in different cancer stages or types. Rhodes et al.  presented a comprehensive investigation of 40 data sets. They identified a robust signature of a set of differentially expressed genes when cancer and normal tissues were compared. A recent study  identified lists of differentially regulated genes that also significantly overlap with genes regulated by the tumor suppressors p16 and pRB. This work helps to translate genome-wide expression analyses into clinically useful cancer markers. Meta-analysis is a powerful tool for identification and validation marker genes in above studies [13, 18]. However, in these studies, meta-signatures are identified on the basis of the individual genes used for analysis. Segal et al.  divided genes into sets and reported that certain sets show coherent behavior across a diverse group of clinical conditions. Another recent publication compared gene expression in two conditions to generate a gene list for each study, and then detected significant Similarities of Ordered Gene Lists (SOGL)  from different studies. The above two approaches extend the determination of significance from single study analysis to meta-analysis.
However, none of the above studies involving multiple cancers mentions independent prediction, which is a key bridge between molecular knowledge and clinical application. In particular, the SOGL approach can detect similarities between two gene lists, irrespective of significant differences between them, because it does not rely on differential gene expression in each single list having strong effects, but rather on consistent changes across multiple lists. SOGL is similar to other non-parameter statistical tests, except that it uses different weighting schemes for ranks. The ideal is to give higher weights to the genes which expressed more differentially, and to sum all the weighted orders to quantify the similarity. This approach allows the significance of similarity to be decided during meta-analysis and identifies the genes responsible for the similarity. In contrast to previous methods, SOGL does not depend on the definition of a particular "significance" threshold for a single study. Thus it is superior to other methods for detecting signatures in studies with weak effects or small sample sizes.
However, the similarities among gene lists are not guaranteed to be transferable . With the discovery of common cancer signatures, there is a need to extend the method to several rather than two lists. Therefore, to meta-analyze many microarray profiles together, and to analyze the problem of outcome in highly noisy data, we have developed and implemented the SOGL method in this paper, extending it from the comparison of two gene lists to the comparison of multiple gene lists, which is useful for meta-analysis of microarray data. When the gene lists show similarity, we ask whether the similarity-driving genes improve the predictive power of a single study. To this end, we implement SOGL in two ways. One is to compare the accuracy of prediction by meta-analysis with that of individual analysis, which has already been successfully demonstrated for multiple cancer microarray data sets . The other is to compare the traditional classical highest t-score with SOGL in selecting variables for classification, which has not been used in the context of cross-validation and class prediction. Finally, we discuss the predictive capacity of the similarity-driving genes detected in three solid tumors, and prove its success on another independent cancer data set.
Clinical information about the microarray studies we collected
samples with outcome notation
The data set described by Ross  used a relatively newly designed microarray platform with 132 representative cases from another data set with 327 cases . Therefore a significant similarity between the gene lists generated from these two data sets were expected. Adding Another data set on leukemia outcome, we applied SOGL to comparison of more than two gene lists. Thus we performed the meta-analysis allowing partially common samples to generate an "artificial" similarity. However, finding a similarity in gene lists between samples run on different platforms is not our interest as many programs would find this. The question we addressed here is to evaluate whether our method improves the accuracy of prediction from individual studies when there is significant similarity.
Recurrence of breast cancer and lymph node status of breast cancer;
The same two lists, and neuroendocrine of lung cancer;
Survival of mesothelioma and glioma, and recurrence of prostate;
The above set of lists and the lymph node status of breast cancer or neuroendocrine of lung cancer;
All the above sets of gene lists achieved higher pAUC (partial area under curve) scores  than most other comparisons. A pAUC-score evaluates the degree of overlap between two distributions. Note that a higher pAUC-score shows a greater likelihood that the estimated SOGL scores exceed chance in our method, and a larger α indicates more similarities at the higher ends of the gene lists. This finding supports the emerging notion that when prognosis is poor, there are commonalities among distinct types of cancer in the dysregulation of gene expression, implying that poor prognosis is sometimes independent of the original cancer type. In contrast, this kind of similarity was not so significant when more than 4 of the studies we collected were compared, demonstrating that the similarities spanning tumor tissues are limited.
The similarity-driving genes found in the G, M, and P studies
Nevertheless, the z-statistic puts less weight on variances than a classical t-statistic. These genes contain a high proportion of known prognostic marker genes and represent biological processes involved in tumor progression and metastasis. To evaluate over-representation of GO annotations from gene lists that were calculated from specific microarray (Affymetrix Hgu95av2), we ran hypergeometric tests to compute p-values. It evaluates the likelihood that the corresponding number of annotations is occurring in a random list of genes of the same size. Interestingly, 4 of them are genes for the human extracellular matrix (ECM)-receptor interaction pathway (hypergeometric test p = 1e-6), namely COL4A1, COL1A2, COL5A2 and FN1. Moreover, 7 of our short-list of 13 genes encode ECM proteins and regulators of ECM assembly, namely FN1, BGN, POSTN, COL4A1, COL11A1, COL1A2 and COL5A2. The other 5 genes have roles in angiogenesis: ANXA2, CPE, MDK, IGFBP3, and 3 transcripts of PTGDS. Although ANXA2 (annexin A2) is a substrate for a variety of protein kinases, and plays an important role in plasmin regulation and in cancer cell invasiveness and metastasis, ANXA2P3 (annexin A2 pseudogene 3) is a novel marker not being previously reported. We discuss these genes in more detail in a later section.
21 Significantly similar comparisons of the ordered gene lists with the same labels used by the Figure 3
A B L
A B M
A B P
A B G
A L P
B M G
M P G
A B L M
A B L P
A B L G
A B M G
A B P G
A L M P
A L M G
A M P G
L M P G
A B L M G
A B M P G
A L M P G
We want to emphasize that we did not test the statistical significance of the identified genes with the survival outcomes by fitting the Cox proportional-hazards model to each gene . We believe it contains information that the consensus change of these genes in a group, and this information is of critical importance in elucidating the complex genetic architecture of tumor progression, e.g. certain biochemical path. In fact, two of the small set of transcripts are insulin-like growth factor binding protein-3 (IGFBP3), over-expression of which has already annotated as apoptosis promoter of cancer cells, activated by p53 [30, 31]. Moreover, it has recently been independently detected by other studies in vivo or in vitro that the increased expression of COL5A2 in colorectal cancer , the increased expression of PTTG1 with correlation to poor prognosis in glioma , and the down-regulation of the PTGDS as an important variable in liver and bladder cancer cell and in malignant progression forms of oral tissue [34–36].
Treatment of cancer patients is known to impact in several ways on prognosis. For an identical tumor, prognosis may be good if the condition has been diagnosed in good time but hopeless otherwise. Also, the set of genes that show significant changes of expression in one specific tumor includes genes that are significant for prognosis. Genes that are recognized statistically, especially in small data sets, might be of little value for new patients. In contrast, the genes that show consistent changes across all prognostic gene-lists have key roles in cancer development and progression. Therefore, to detect universal prognostic markers, integrated analysis based on large patient groups is required, and significance needs to be judged at the meta-analysis stage. SOGL quantifies and tests the similarities between two or more gene lists. The genes driving the similarity are those with prominent ranks in all the lists compared. Notwithstanding personal and other influences, these genes may genuinely indicate molecular alterations common among neoplasias. Another serious concern for bioinformatics researchers is the arbitrary or over-fitted choice of statistical approach that yields far-from-reliable gene sets. Information about clinical outcomes is unstable and weak because the differences among individuals might be large, and the challenge is to overcome this problem. Our results show that the SOGL method complements previous methods and is robust. The marker genes identified on the basis of one effect size concur with those based on another in our limited data. Though without strongly superiority, SOGL is tend to be more accurate than highest t-score for variable selection by meta-analysis. Studies that in isolation do not provide solid evidence for differential gene expression may present striking similarities in their gene lists. Thus SOGL can identify consensus signals from either strong or weak effects, independently of the arbitrary threshold. Moreover, it would be of greater interest to apply SOGL to the exploration of disease mechanisms based on these commonly changed genes in consensus. It is different from the approaches targeting only the "best" marker, result of SOGL might include genes that are so-called "redundant" by certain "threshold" of significance or correlation in individual study. Co-regulating genes in a biological path, genes in a parallel path, and genes having epistatic actions are in fact genes of critical importance in elucidating the complex genetic architecture of a complex disease . Thus SOGL might be used to uncover the hidden pattern of genes on microarrays. Instead of distinctions of significance or correlation, it focuses on the genes relevant to the condition of interest that are consistently changed across multiple studies.
Biologists usually compare independent studies addressing the same research question to confirm findings. It is also possible to compare studies from slightly different but related contexts in order to discover common markers. This is an attempt to revolutionize cancer data sets to screen for common molecular features shared among phenotypically different types of cancer involving distinct biological underpinnings, disease progression, diagnosis and prognosis. We detected and confirmed that significant similarities span several kinds of cancer. This result supports the emerging notion that different types of tumors for which prognosis is poor share common disorders in the regulation of gene expression. This implies that poor prognosis sometimes develops independently of original cancer type.
A substantial literature suggests that the similarity-driving genes are promising as tumor markers and as targets for tumor therapy. The genes common to the top ends of the lists for the outcomes of the three cancers studied here include those originally used by Singh and Gordon [5, 38] for outcome prediction, such as IGFBP3. FN1 has also been used in a real-time PCR-based multigene outcome predictive model for lymphoma  and prostate cancer . Expression of POSTN is reportedly a bone metastasis from breast cancer  and is proposed as a prognostic marker in lung tumor invasion . Dysregulation of ANXA2 has been reported in human bone cancer metastases  and is correlated with the clinical prognosis of prostate cancer . Additional supportive evidence of the prognostic value of the genes in Table 2 from experiment in vitro and in vivo has been cited in the last section of result.
Our most striking finding, however, is the over-representation of genes detected from fold changes (MDK, CPE, POSTN, COL4A1, COL11A1, COL1A2, COL5A2, IGFBP3, FN1, ANXA2, BGN and PTGDS) and all 4 genes detected from the z-statistic as effect size (PTTG1, COL5A2, IGFBP3 and PTGDS) are associated with angiogenesis. Angiogenesis leads to the formation of a large anastomosing vascular network, allowing tumor growth, intravasation and the spread of metastases. MDK, which plays an important role in the intercellular interactions involved in angiogenesis, is reported to be strongly correlated with poor prognosis in a large number of cases irrespective of tissue type [45–50]. Another gene, CPE, is relatively down-regulated in the three poor-outcome samples of carcinoid tumors , and takes part in producing angiogenic factors upon the maturation of follicle stimulating hormone . Generally, the breakdown of ECM proteins, which correlates with angiogenesis, is an essential step in cancer invasion and metastasis . We found that up-regulation of 7 genes involved with the ECM is associated with poor cancer outcomes. ECM-related genes that promoted the strongest proliferation, including POSTN , BGN  type I collagen  and type IV collagen , have already been identified as cancer markers, and might be molecular targets for gene therapy. In addition, BGN and PTGDS have recently been reported in an in vitro angiogenesis system . The oncogenic potential of PTTG1 has been well characterized in mouse fibroblast (NIH3T3) cells, in which it induces proliferation and promotes tumor formation and angiogenesis . It has been reported as a prognostic marker for tumor invasiveness and metastasis  and is suggested to be a potent human oncogene . These findings suggest that by inhibiting angiogenesis, it may be possible to restrict the blood supply to tumors and limit their ability to grow and metastasize. Our results support the anti-angiogenic hypothesis concerning polymeric FN1  and ANXA2  and suggest more candidate markers. Because the similarities among multiple tumor tissues can not be identified by speculation, we believe that further meta-analysis on more data will aid further research on prognostic markers of many cancers.
For a small clinical trial, it is important to summarize all the evidence obtained and combine it with evidence from other trials or laboratory studies. Meta-analysis enables general conclusions to be drawn, develops support for hypotheses, and produces an estimate of the overall effects of a program, combining with the developed multiple statistical algorithm. This study suggests that our meta-analysis of gene lists for different clinical or physiological phenotypes provides a golden opportunity for detecting biologically relevant gene dysregulations between different phenotypes and possibly leading to improved diagnostic accuracy, or generating insightful molecular mechanisms to build the underlying bridges between different phenotypes. To this end, SOGL is superior to other measurements of gene selection for meta-analysis of clinical microarrays for handling study-to-study differences. It focuses on the genes relevant to the condition of interest that are consistently changed across multiple studies, rather than on distinctions of significance or correlation. Our study has assessed its potential for identifying prognostic markers of multiple cancer types from studies of different laboratories, especially for studies with large inter-individual variations or small sample size. The proposed method is a complementarity and enlargement algorithm for research on gene expression.
In addition, our results suggest and confirm that a common molecular mechanism underlies the poor outcomes of several kinds of cancer. The genes we detected have important implications for our understanding of the potential involvement of angiogenesis in the malignant progression of primary tumors. It suggests that meta-analysis has considerable potential in clinical studies of gene expression profiles, which is a focus of active research for computer-assisted diagnosis. To ensure reproducibility of our biological findings, larger numbers representing a greater percentage of disease is required. It is expected that further studies incorporating more data sets with larger number of samples will identify universal prognostic markers in cancer.
In transcriptional research, the raw data have to be corrected for different conditions by normalization. We normalized all raw profiling files on an additive scale by pre-processing methods for stabilizing variance . "An additive scale" means transforming the intensities to a scale where the variance is approximately independent of the mean intensity. This can be achieved by calibrating for sample-to-sample variations through shifting and scaling, or by log-transforming the data. For simplicity, we focused on the published microarray studies of cancer outcomes based on Affymetrix chips, which have sufficient data and have gained acceptance in recent years because of the reliable annotation and identification and the good hybridization characteristics of oligonucleotides with wide-ranging expression levels . Only the best-matched transcripts  were used to compare studies based on different chips.
The definitions of outcomes for all the studies we collected strictly followed those of the original papers. To evaluate the power of signature detection in transcript expression and the accuracy of prediction by our adopted method, we integrated all the relatively non-malignant outcomes as "good". In contrast, the patients were grouped as "poor" if they suffered shorter survival or if there was recurrence within the observed time. The data sets were:
Leukemia C: The data came from research on adult T-cell acute lymphoblastic leukemia (ALL) . The good prognosis group consisted of 7 patients in complete clinical remission (CCR) and 2 patients who had not relapsed within two years. The poor prognosis group consisted of 6 refractory patients and 12 who had relapsed within two years.
Leukemia Y: The data included 327 children suffering leukemia . Excluding the patients without outcome information, The good group consisted of 201 CCR patients, while the poor responder group consisted of 44 patients with different types of relapse.
Leukemia R: 93 patients with prognostic information from above study were examined the gene expression profiling by Ross et al. using another microarray chip . The good prognosis group consisted of 71 CCR patients. The poor prognosis group consisted of 16 relapsed patients and six 2nd AML patients.
Mesothelioma: A prognostic study on mesothelioma, a lethal neoplasia of the pleura . The good responder group consisted of 8 patients who survived more than seventeen months, while the 10 patients in the poor responder group survived less than six months.
Prostate: This comparison was constructed from 21 prostate tumor samples with respect to recurrence following surgery . The good prognosis group consisted of 13 patients who had shown no relapse for at least four years, and the poor outcome groups consisted of 8 relapse patients.
Glioma: This comparison was based on the data of Shai et al. . The good prognosis group consisted of 8 primary (not secondary) glioblastoma multiforme (GBM) patients of various pathological types and grades with a survival time of more than three years, while the poor responder group consisted of 10 malignant glioma patients who survived less than one year.
Breast 1: The data were taken from a prognostic study of primary breast tumors by Huang et al. . In total, 37 patients were included. The good prognosis group consisted of 19 "low-risk" patients, and the poor responder group consisted of 18 patients identified as "high-risk" by their lymph-node status.
Breast 2: These data were also described by Huang et al. . Here, however, the prognostic groups were defined directly by clinical outcome. The good responder group consisted of 34 patients who were recurrence-free over three years, while the poor responder group consisted of 18 patients who suffered recurrent disease within the first three years after surgery.
Lung: The data included 126 adenocarcinoma (one subtype of lung cancer) cases without metastases reported by Bhattacharjee et al. . The lung cancer data set did not define the outcome classification for each case. However, the author reported that the neuroendocrine C2 adenocarcinoma were associated with a less favorable survival outcome. Therefore the poor responder group consisted of 9 neuroendocrine C2 adenocarcinoma patients, while the good responder group consisted of all the other 117 adenocarcinoma patients.
Illustration of the cardinality of the intersection O n (G D )
O n (G1,2,3)
where decreasing weights (w) are used as: . In this way, we strengthen the two ends of the integrated transcript orders. By setting the parameter α, one can calibrate the weight to decide that how deeply these gene orders are to be investigated.
where w was the false positive rate, and ROC(w) was the true positive rate. A high pAUC-score indicates good separation. Given a parameter α i , the separation of alternative scores and noise scores indicates the similarity between the leading genes in these gene orders. For a predefined finite grid of parameters, then we can pick the value providing the best discrimination between signal and noise. The significance was then evaluated for a given α. To this end, we simulated the distribution of similarity score under assumption of unrelated lists and generated B (= 1000) set of ternately random ranks to calculate the random scores. Significance was evaluated by computing an empirical p-value for the observed scores from the B random scores.
The similarity-driving genes should be consistently represented among the leading items in the gene orders. One can count a cutoff value n* such as to accounts for 95% of the score S α*, given an identical α*. Note that SOGL is the sum of the scores for the two ends. Thus we identify the similarity scores for up- and down-regulation, ignoring genes for which the isolated up- or down-regulation yields scores no higher than the 99th percentile of the random scores. The expected random scores are given by B (= 1000) shuffled orderings. For example, if a certain significance is due to the most strongly down-regulated genes but not to the most strongly up-regulated genes, we ignore the intersection of up-regulated genes.
We expected that combined studies will predict the outcome for single patients better than a single study can, assuming that there is commonality in the dysregulation of gene expression for certain malignant processes. To validate this assumption, we calculated the number of correct predictions for each study via two steps. (1) All patients from three similar studies were mixed into one integrated data set to cross-validate the outer and inner loops. This resulted in a vote matrix containing the number of times each sample was assigned to each class in the outer cross-validation loop. We counted the coincidences between true class and consensus class for samples study by study to obtain three tables. (2) The same cross-validation was run on single data to obtain independent tables for each study. For both steps, we repeated the cross-validation with the same stratified strategy (class-balanced folds ) and adopted the identified variable selection method and the classification method. We assume that we can combine data from different studies into one replicated data set, if the gene lists are significantly similar for a certain two-condition test.
We next compared SOGL with the traditional highest t-score to select variables for prediction, carrying out the same classification and patient clustering strategy before meta-analysis. To avoid study-to-study bias or prevalence of smaller sample sizes, we randomly employed class-balanced  and study-balanced training sets. "Class-balanced" means that we guarantee the combined training set comprises approximately half poor-outcome and half good-outcome patients. "Study-balance" means that we guarantee the training set contains all the different tumors, and keeps more or less the same proportion of each. Patients not used in the combined training set were used for validation. For SOGL variable selection, we focused on a fixed number of orderings to calculate the similarity score, and selected the intersection to account for 95% of the score. The resulting variables were used to predict the outcomes for patients in the associated validation set after tuning hyperparameters of SVM. After this step, we recorded the number of selected genes, then picked the same number of genes with the highest t-statistic to estimate the accuracy of prediction in the validation set. The above training/validation step was iterated D (= 500) times carrying SVM algorithm performing linear kernel by R package e1071. To compare the two variable-selection methods, we drew a Receiver Operating Characteristic (ROC) curve from the correct error metrics generated from the D repeats of training/validation step for each test. ROC is a plot of the true positive rate (TPR) on the y axis against the false positive rate (FPR) on the x axis for the different possible cut-off points of a diagnostic test. Thus, for every observed FPR, we calculated the mean value of the corresponding TPR to plot the point on the ROC curve. Let u be the good prognostic for the true good-outcome patients; ve the bad prognostic for the true good-outcome patients; t be the good prognostic for the true bad-outcome patients; and s be the bad prognostic for the true bad-outcome patients. For the null hypothesis that all patients are poor outcome, the sensitivity and specificity are:
TPR(poor - outcome) = s/(s + t); FPR(poor - outcome) = v/(u + v).
We measured the area under the ROC curves to evaluate the difference between SOGL and t-scores.
We thank Dr. C Lottaz from Max Planck Institute for Molecular Genetics (Berlin) and Dr. J Jaeger from Swiss company Hamilton for helpful discussions about the statistical analysis. We are grateful to Prof. Z Ai from University of Illinois at Chicago for careful reading of draft. This research has been supported by the Natural Science Foundation 60671018, 60121101, National High Technology Research and Development Program of China 863-2005AA231070 and Southeast University Foundation XJ0711279.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.