Description of datasets
The study utilizes two publicly available cancer datasets. The first is a subset of the breast cancer dataset collected at The Norwegian Radium Hospital, Oslo . The original dataset contained 85 tissue samples representing 84 patients, including 78 breast carcinomas, 3 fibroadenomas, and 4 normal breast tissue samples. Given that a common interest in theme-driven experiments is to distinguish subtypes of tumours, only data from a 51-sample subset of locally advanced breast carcinomas were used in the current study. The gene expression data of these 51 samples were downloaded from the Stanford Microarray Database (SMD) . The clinical data were downloaded from the web supplement related to this dataset.
We also obtained a lung cancer dataset collected at the Howard Hughes Medical Institute . The original dataset consisted of 67 human lung tumours and 6 normal lung tissue samples representing 56 individuals. All tumour samples were used in this study. The gene expression data of the 67 samples were downloaded from SMD. The clinical data were downloaded from the web supplement related to this dataset.
The gene expression data for each tissue sample of the breast dataset were retrieved in separate Excel spreadsheets (one spreadsheet per sample) from SMD. All blank probes (i.e. the empty spots on the array) as well as all customised non-mappable probes were discarded (since genes in each geneset were represented by Entrez Gene accession numbers, it would have been difficult to include customised probes since no mapping tool converts those to Entrez Gene accession numbers). As a result, only those probes that were mappable IMAGE clones remained in the filtered dataset. Some clones referred to mitochondrial genes and were also discarded. Only adequately measured probes were used in this experiment, defined as a probe with fluorescent hybridization signals at least 1.5-fold greater than the local background signal in the reference channel. For each adequately measured probe, the normalised log ratio to the base 2 was extracted, whereas for inadequate probes a missing value (NA) was assigned. Since the dataset was derived from four distinct batches of cDNA microarrays, some probes only had values for a subset of samples where such probes were unique to the array type of those samples. In the cases where arrays contained replicated probes (i.e. the same IMAGE clone identifiers) a single average log ratio was used. Only those probes for which technically adequate measurements were obtained from at least 60% of the samples were used . To correct for any systematic bias resulting from different amounts of cDNA in the reference channel, the probes of the gene expression subset were median centred (i.e. the gene-wise median expression value of a probe was subtracted from each log-ratio in the probe).
A similar procedure was carried out for the lung cancer dataset, with the only difference being the number of microarray types from which the dataset was derived and the number of probes in the arrays. The lung dataset came from six different batches of cDNA microarrays but, as opposed to the breast dataset, all six microarray types had approximately the same set of probes. For this reason it was possible to use a stricter criterion for filtering probes, with 80% of adequate measurements being required for a probe to be used .
Both the breast and lung cancer datasets were accompanied by extensive clinical data. Information relevant to this study included the overall survival (OS) time and its status (alive versus dead) for both datasets, with relapse free survival (RFS) time and its status (no relapse versus relapse) also being available for the breast cancer dataset. The OS status of the breast dataset was simplified from four different values (0 = no evidence of disease, NED; 1 = alive with disease, AWD; 2 = dead of disease, DOD; 3 = dead of other causes, DOC) to two (0 = NED, AWD, DOC; 1 = DOD), since our study aim was to focus on patient survival.
Description of genesets
Predefined sets of genes were constructed from three distinct functional annotation groups: Biocarta pathways, KEGG pathways, and GO terms. The genesets were obtained from the human gene annotation file hosted at the Cancer Genome Anatomy Project (CGAP) web site . For each functional group, all pathways or terms were retrieved along with all genes (Entrez Gene identifiers) that belonged to them. As a result, 179 genesets from KEGG, 314 from Biocarta, and 6,032 from GO were generated. One other set included was the original Chang et al. gene expression signature of fibroblasts in response to serum (CSR as defined by the authors), which was the only geneset made up of IMAGE clone identifiers.
Genesets that contained small (< 5) or very substantial (> 1,000) numbers of genes were excluded, as they are likely to be of limited biological utility. Entrez Gene accession numbers were converted to IMAGE clone identifiers using the web tool Clone/Gene ID Converter . A geneset was also discarded if its genes were mapped to less than 5 or more than 2,500 probes. Additionally genesets that resulted in very uneven clusters were discarded (see next section), as well as a small number of genesets whose corresponding data was difficult to cluster because of excessive numbers of missing values. Overall, 1,082 genesets were tested in the breast cancer dataset, and 1,426 in the lung cancer set.
Patient classification and initial statistical analyses
Patient classification and statistical analyses were performed with R (version 2.7) . For each geneset and each cancer dataset, the gene expression values of the geneset probes (i.e. mapped genes) that were present in the cancer dataset were retrieved. The samples of this subset of gene expression data were clustered by average linkage clustering using a correlation-based distance matrix d
where ρxy is the Pearson's correlation coefficient of x and y. In some cases, clustering could not be performed because missing values in the gene expression data would have generated missing values in the distance matrix. Since all pairwise distances are needed to cluster the data, any geneset that produced a distance matrix with missing values was discarded. After clustering, samples were segregated into two groups based on the first bifurcation of the hierarchical sample dendrogram, and only cancer samples with available clinical data were used to perform survival analysis.
Correlation of the clustered groups of patients to survival time was assessed using a log-rank test (first hypothesis test). Although many researchers have chosen to use a log-rank test in their survival studies, it has been shown that this test breaks down with small sample sizes, such as when at least one cluster or group of patients consists of very few samples . To avoid extremely uneven clusters we discarded genesets where less than 10% of the samples were in the smaller cluster. For each survival estimate (i.e. breast OS, breast RFS, and lung OS), a p-value p1 was calculated.
Empirical distributions of p-values
A distribution of p-values1 for random genesets was approximated for each type of geneset (i.e. Biocarta, GO, and KEGG) and for each survival estimate (i.e. breast OS, breast RFS, and lung OS) using 100,000 random sets of unrelated genes, and their prognostic power was assessed using the above methods (i.e. hierarchical clustering and a log-rank test). The number of random genesets was chosen according to a stability analysis which showed that 100,000 permutations were enough for the purposes of our study (Additional Files 10, 11, 12). The permutation results were used to generate "Biocarta-like", "GO-like", and "KEGG-like" empirical distributions of p-values1 for both breast and lung cancer datasets. A distinction was made for the three different types of genesets because the "universe" of genes (i.e. the genes that have an annotation) is not the same. For example, the universe of GO genes contains approximately 14,000 genes, whereas the universe of Biocarta genes only has around 3,000 genes. Selecting genes from a wider, or more restricted, universe than the real one may result in biased estimates . Additionally, since an association is present between geneset size and significance (Figure 1, Additional Files 1, 2), the sizes of the randomly generated genesets were matched to those of the original genesets so as to include the same proportion of geneset sizes in each distribution.
Association between the theme of a geneset and survival time was assessed with an empirical test based on the previously modelled distributions (second hypothesis test). An empirical p-value p2 was calculated for each survival estimate for the original 1,082 and 1,426 genesets analysed in breast and lung cancer, respectively, which was defined by:
where is the number of random genesets from the appropriate empirical distribution which yielded p-values less than or equal to the p-value p1 obtained from the log-rank test for the geneset of interest at the first hypothesis test, and is the total number of random genesets. To control for the multiple testing of genesets in each hypothesis test we used a false discovery rate (FDR) method, defined as the proportion of expected false positive findings over the total number of alternative hypotheses accepted at the specified significance level . FDR1 and FDR2 values were calculated for p-values of the first and second hypothesis tests, respectively.
In order to compare our CSR signature results to those of Chang et al., three CSR-like empirical distributions of p-values1 for random genesets (breast cancer OS, breast cancer RFS, and lung cancer OS), were constructed from 100,000 random sets of unrelated genes. For each distribution, the randomly generated genesets were chosen to match the size of the CSR signature. Furthermore, the universe of genes was designed so as to only include genes that could have been selected in the original experiment, based on the procedures described by Chang et al.