Gene expression anti-profiles as a basis for accurate universal cancer signatures
© Corrada Bravo et al.; licensee BioMed Central Ltd. 2012
Received: 13 June 2012
Accepted: 17 October 2012
Published: 22 October 2012
Skip to main content
© Corrada Bravo et al.; licensee BioMed Central Ltd. 2012
Received: 13 June 2012
Accepted: 17 October 2012
Published: 22 October 2012
Early screening for cancer is arguably one of the greatest public health advances over the last fifty years. However, many cancer screening tests are invasive (digital rectal exams), expensive (mammograms, imaging) or both (colonoscopies). This has spurred growing interest in developing genomic signatures that can be used for cancer diagnosis and prognosis. However, progress has been slowed by heterogeneity in cancer profiles and the lack of effective computational prediction tools for this type of data.
We developed anti-profiles as a first step towards translating experimental findings suggesting that stochastic across-sample hyper-variability in the expression of specific genes is a stable and general property of cancer into predictive and diagnostic signatures. Using single-chip microarray normalization and quality assessment methods, we developed an anti-profile for colon cancer in tissue biopsy samples. To demonstrate the translational potential of our findings, we applied the signature developed in the tissue samples, without any further retraining or normalization, to screen patients for colon cancer based on genomic measurements from peripheral blood in an independent study (AUC of 0.89). This method achieved higher accuracy than the signature underlying commercially available peripheral blood screening tests for colon cancer (AUC of 0.81). We also confirmed the existence of hyper-variable genes across a range of cancer types and found that a significant proportion of tissue-specific genes are hyper-variable in cancer. Based on these observations, we developed a universal cancer anti-profile that accurately distinguishes cancer from normal regardless of tissue type (ten-fold cross-validation AUC > 0.92).
We have introduced anti-profiles as a new approach for developing cancer genomic signatures that specifically takes advantage of gene expression heterogeneity. We have demonstrated that anti-profiles can be successfully applied to develop peripheral-blood based diagnostics for cancer and used anti-profiles to develop a highly accurate universal cancer signature. By using single-chip normalization and quality assessment methods, no further retraining of signatures developed by the anti-profile approach would be required before their application in clinical settings. Our results suggest that anti-profiles may be used to develop inexpensive and non-invasive universal cancer screening tests.
Early detection through mass screening remains one of the most effective approaches for reducing health care costs [1–4] and mortality [5–10] due to cancer. Despite the benefits, there remain significant barriers to cancer screening including cost [11, 12], lack of insurance [11, 13], and anxiety or embarrassment about invasive procedures [11, 12, 14]. There are also cancer types for which mass-screening tools have not been developed [15, 16]. Reducing the cost and inconvenience of screening may lead to increased early screening and potentially improve patient and health economic outcomes.
Peripheral blood-based genomic signatures are a promising avenue for developing non-invasive cancer biomarkers [17–21]. However, lack of stable markers in cancer gene expression profiles and associated blood samples has made finding robust screening biomarkers difficult. Here we take advantage of a new theoretical model for evolutionary fitness that suggests that a defining characteristic of cancer is increased epigenetic and gene expression variability . Supporting evidence was provided by the observation of increased variability in DNA methylation across five different cancer types . This model implies that a stable characteristic is that certain genes will consistently show higher across-sample variability in cancer as compared to normal samples. We present a statistical technique that leverages this characteristic by identifying genes that show normal variation in healthy samples, but hyper-variability across tumor samples and use these genes to predict outcome using what we refer to as an anti-profile. We define an anti-profile score for a specific sample as the number of hyper-variable genes for which expression in that sample falls outside a defined range of normal expression (see Methods for details). We illustrate the technique on a colon cancer dataset, suggest its potential by predicting cancer in a peripheral blood dataset, and explore the possibility of a universal cancer predictor by simultaneously predicting outcome with data from 52 cancer types. All datasets were obtained from public repositories.
We complement our novel statistical approach with new biological insights related to cancer. For the colon cancer anti-profiles we incorporate the finding that consistent decreases in methylation are observed along large (5kb – 10Mb) genomic blocks . Specifically, we only considered genes that lie inside these blocks for the colon cancer anti-profile. For the universal anti-profile we incorporated the finding that genes showing epigenetic hyper-variability in cancer tend to be tissue specific genes [23–25]. We therefore restricted genes in our universal cancer anti-profile to tissue-specific genes.
Gene expression variability and stochasticity have been studied previously in the context of normal populations [26, 27], with recent work exploring the role of genetic variants in altering expression variation and stochasticity . Of particular interest is recent work showing a link between variation in normal populations and HIV susceptibility . It is only recently, however, that direct association between gene expression variability and disease has been studied on neurological disease [23, 30] and cancer . We show that increased variability in specific genes is a characteristic feature in many cancer types that can be used for prediction. The anti-profile method we propose here is an application to the predictive setting of ideas in existing statistical methods developed to identify and model outliers in gene expression due to cancer [31, 32]. Here we expand these ideas and leverage our knowledge of and experience with preprocessing and normalization of high-throughput expression data to describe and demonstrate the effectiveness of the anti-profile method to develop signatures based on technology ready to be used in clinical settings (through quality assessment and normalization) and a general and stable cancer marker (increased gene expression hypervariability of specific genes).
We developed the anti-profile method as a simple and robust approach to define cancer genomic signatures by specifically taking advantage of heterogeneity in cancer. An important first step in our approach is to normalize raw gene expression data; an often-overlooked, but key issue in the development of genomic signatures based on microarray data. Standard microarray normalization methods cannot be used when developing clinical diagnostics since they require multiple samples and normalized values depend on which samples are normalized together [33, 34]. This means that signatures can only be translated to the clinic after independent retraining of the signatures is performed with single-sample normalization techniques . For all signatures developed here, we employ a recently developed single-sample normalization technique for microarrays  and a single-array quality metric . Since signatures are developed with single-sample normalization, they can be directly used as clinical diagnostics, without further retraining.
To determine the relationship between gene expression hyper-variability and CpG DNA methylation hyper-variability, we examined a publicly available DNA methylation dataset comparing colon cancer with matched normal colon tissue on the Illumina HumanMethylation 27k BeadChip array (see Methods). We found that there is significant overlap between genes with hyper-variable expression in colon cancer and promoter region CpG hyper-variable methylation (Fisher’s exact test OR=2.41, P=0.005, see Methods). We then repeated the experiment on the two colon cancer expression datasets using CpG hyper-variable methylation to select anti-profile genes and observed worse prediction performance (AUC=.84 and AUC=.97). Enrichment of hyper-variable CpG DNA methylation in blocks of hypo-methylation for this dataset has been previously reported . Considering the reduced coverage of the 27k array, which is biased towards CpG islands, this prediction result indicates the advantage of using hypo-methylation blocks in cancer as a stable and comprehensive proxy for methylation hyper-variability in the absence of suitable direct measurements.
We collected and manually curated a set of 6,172 cancer and normal microarray samples in biopsies (n=4,950 and n=1,222 respectively) from 59 tumor types and 102 normal tissue types across 176 different studies in the Gene Expression Omnibus (GEO, ). Additional file 1: Table S1 lists the GEO accession number of experiments included in the dataset after removing samples that did not pass the single-chip quality filtering criteria, along with the tissue or tumor type and clinical characteristics annotated in each experiment. These data represent all the clinical information available about each of these samples in GEO. For each tissue or tumor type the number of biological replicates varied and for seven tissue types (adrenal cortex, colon, endometrium, kidney, skin, stomach and vulva) we had at least 10 samples of each of normal tissue and corresponding tumor type.
Our results suggest that the universally consistent gene expression hyper-variability we report here cannot be fully ascribed to cellular heterogeneity in cancer samples. For a gene to show hyper-variability in cancer due to cellular heterogeneity, it must also be a marker for a number of distinct cell types in a heterogeneous cellular mixture found in a tumor. However, we found that a large number (45%) of universally hyper-variable genes in cancer are not consistently expressed in any of the normal tissues in our dataset (we say a gene is consistently expressed for a tissue if it is expressed in at least 95% of the normal samples for that tissue, see Methods section). This implies that, for almost half of the universally hyper-variable genes in cancer, hyper-variability cannot be the result of a heterogeneous mixture of markers for different cellular subtypes since these genes are usually silenced in normal tissues. Also, while hyper-variable genes are enriched in the set of tissue-specific genes, we found that the majority of tissue-specific genes are not consistently hyper-variable (64%). The vast majority of tissue-specific genes show hyper-variability in a small number of cancer types (Additional file 1: Figure S6) as expected from a histologically heterogeneous sample. This suggests that the lack of regulation of the particular tissue-specific genes that are consistently hyper-variable across cancer types represents a specific and general characteristic of cancer.
We also investigated the relationship between cancer-specific hyper-variability and tissue-specificity in the seven tissues for which we have sufficient samples of both normal and cancer. We found that the vast majority (95-99%) of hyper-variable genes in each of these cancers are not tissue-specific for the corresponding normal tissue (Additional file 1: Table S5). However, hyper-variable genes in each of these cancers are enriched in the set of genes that are specific for the corresponding normal tissue, although the number of genes is small. This small set of genes could indeed include those where hyper-variability in that specific cancer is due to cellular heterogeneity, as normal cells may be included in varying proportions in these tumor samples. We looked at the relationship between cancer-specific differential expression, determined using Empirical Bayes methods  as fold-change greater than 1 and significance less than 10% FDR, and tissue-specificity in the same seven tissues. Similar to hyper-variability we found that the vast majority of differentially expressed genes in each of these cancers are not tissue-specific for the corresponding normal tissue. However, in contrast to hyper-variable genes there is no enrichment of differentially expressed genes in the set of genes that are specific for the corresponding normal tissue.
Considering this finding, we investigated the relationship between cellular-specificity and the colon cancer peripheral blood result reported above. We determined genes that are specific to strictly one of two types of lymphocytes for which we had five or more samples in our dataset (CD4+ and CD31+ T-cells) and found that 12% of the genes used in the peripheral blood colon cancer anti-profile fall under this category. Furthermore, lymphocyte-specific genes are enriched in the set of genes with hyper-variable expression in colon cancer inside colon cancer hypo-methylation blocks (Fisher’s exact test OR 3.0, P=1.2e-11). This suggests that we cannot rule out that varying lymphocyte composition in the peripheral blood samples of colon cancer patients may drive the prediction performance of the peripheral blood anti-profile.
We used pathological tumor stage or grade annotation available for a subset of the samples used in the leave-one-tissue-out cross-validation experiment to determine if heterogeneity across samples in pathological tumor stage or grade may explain the increased gene expression variability observed in anti-profile genes used for prediction. For each of the leave-one-tissue-out experiments reported in Figure 4, we used an F-test to find genes that are differentially expressed across pathological stages or grades (FDR<0.1, Additional file 1: Table S6). We then applied a Fisher exact test to determine if the 100-gene anti-profile signature used in the leave-one-out-tissue experiment overlapped this set of differentially expressed genes. We found very few genes that are differentially expressed across pathological tumor stage or grade for adrenal cortex, stomach and vulva (22, 2 and 4 respectively). For the remaining experiments no substantial overlap was observed (OR<2, P-value<0.05). This suggests that increased gene expression variability in anti-profile genes is not explained by heterogeneity of pathological tumor stage or grade in our samples.
We have introduced and developed gene expression anti-profiles for cancer biomarker discovery. Anti-profiles explicitly model increased gene expression variability in cancer to define robust and reproducible gene expression signatures capable of accurately distinguishing tumor samples from healthy controls. We have developed an anti-profile signature in tissue samples from a colon cancer study and validated our signature in a second independent validation set, collected by a different experimental group. We have also applied this signature directly, without retraining, to classify patients with cancer from normals on the basis of genomic measurements in peripheral blood.
We note that Mammaprint [46, 47], one of the most successful genomic cancer biomarkers, fits our notion of an anti-profile: its score is calculated based on the correlation between the test sample and a good prognosis gene expression profile. The failure of other, more complex genomic methods to outperform Mammaprint may be due to their reliance on defining specific cancer profiles . While both Mammaprint and our anti-profile method classify samples based on deviation from a reference profile, there are two significant differences in the way Mammaprint and the anti-profile method achieve this: 1) Mammaprint uses tumor samples with good prognosis to determine the reference profile. Since these are tumor samples many of the genes used in the profile may exhibit high variability across the good prognosis group. Defining a stable and robust reference profile is essential to the success of this type of method. 2) Mammaprint uses correlation to measure how samples deviate from the reference profile. Our anti-profile method instead uses a robust measure where deviation is based on the number of the genes for which expression falls outside normal ranges of expression, which are themselves estimated using robust methods. It may be possible to improve on the accuracy of the Mammaprint test by adopting a more robust anti-profile based on the methods presented in this paper.
In this case we can use the anti-profile score, that is, the number of genes in the anti-profile where expression deviates from a normal range of expression obtained from normal breast tissue samples, to determine prognosis. Since this score is based on stable expression in normal tissues, it may be more robust than calculating correlation to a mean signature for tumors with good prognosis that would show high variability. This will require that more samples of both normal breast tissue and tumor are available on platforms for which robust, single-chip normalization methods exist.
In addition to developing a peripheral blood signature for colon cancer, we have confirmed the existence of hyper-variable genes across 59 distinct cancer types. We also provide evidence of the close relationship between hyper-variability across cancer types and tissue-specific gene expression. Consistent with these observations on tissue-specificity, gene ontology category enrichment analysis found that categories involving development, organ morphogenesis and differentiation are enriched with hyper-variable genes and the remaining gene categories enriched with hyper-variable genes involved cellular interaction with extracellular matrix, e.g., adhesion, localization and collagen catabolic processing or in cell locomotion and cellular component movement. These results argue strongly against the observed hyper-variability being a consequence of sample heterogeneity in the cancer samples.
Incorporating this general result on tissue-specificity and hyper-variability we developed anti-profiles able to classify tissue samples across multiple tissue and cancer types, even when a specific cancer/tissue type is not included in the original training set. Our cross-validation results suggest that consistent hyper-variability of a small set of tissue-specific genes is a stable mark of cancer across tissue types. Our results also suggest the potential for developing peripheral blood signatures for cancer diagnostics on the basis of anti-profiles.
In the course of achieving these results we have used recently developed statistical preprocessing methods to remove potential artifacts in a way that is applicable to single clinical samples . This is a somewhat unique approach, as genomic signatures are typically derived after applying population-level pre-processing such as RMA or artifact removal such as surrogate variable analysis. That we achieve such high accuracy in public data – known to be subject to a broad range of technical and biological artifacts  – speaks to the strength of our methods.
We downloaded CEL files for 6,172 Affymetrix HGU133plus2 microarrays from 176 studies in the Gene Expression Omnibus (GEO, ). CEL files were preprocessed with the frma ( ) single-chip procedure. Expression measurements were standardized using Gene Expression Barcode z-scores ( ). We removed arrays that were deposited multiple times into the repository (Euclidean distance between arrays less than 1). We used the GNUSE metric ( ) to assess array quality and removed all arrays from studies with median GNUSE greater than 1.25 and removed individual arrays with GNUSE greater than 1.2. We did further hand curation to retain only normal tissue and cancer samples (n=688 and n=4,138 respectively). Additional file 1: Table S1 contains the complete list of studies and samples used in the reported analyses including the type of clinical annotation available for each sample. The curated and preprocessed data is available for download at http://cbcb.umd.edu/~hcorrada/antiProfiles.
We used the HGU133plus2 probeset annotation from Ensembl (version 15, gene dataset version: GRCh37.p5) to map probesets to genes and obtain each gene’s transcription start site. In the colon cancer anti-profile, we only consider probesets for genes with transcription start sites inside blocks of DNA methylation change ( , genomic coordinates available at http://www.nature.com/ng/journal/v43/n8/extref/ng.865-S2.xls). We use the ratio of standard deviations across samples as a statistic to select probesets for the anti-profile: r g = log 2(S gc /S gn )where sgc is the across-sample standard deviation of expression for probeset g among the colon tumor samples, and sgn is the across-sample standard deviation of expression for probeset g among the normal samples. The anti-profile includes probesets with rg>1 (variability in cancer is twice that of normal).
Normal regions of expression are defined for each probeset as median expression +/− 5 median absolute deviations of expression in the normal samples. We found that our results are quite insensitive to the choice of median absolute deviation multiplier (Additional file 1: Figure S8). The anti-profile score for a specific sample is then the number of probesets outside their respective range of normal expression. A cutoff score can be used to turn the anti-profile score into a classification: scores greater than the cutoff are classified as cancer, scores lower than the cutoff are classified as tumor. A specific cutoff can be determined according to a prescribed objective: e.g. maximize accuracy, or maximize specificity at a given sensitivity in a held-aside test set. We used area under the ROC curve  to measure anti-profile accuracy and the DeLong method  as implemented in the pROC package  to test for differences in AUC.
We downloaded a publicly available dataset of methylation levels of 22 matched colon normal/tumor samples assayed using Illumina’s HumanMethylation 27k array (GEO accession number GSE17648). Methylation measurements were used with no further preprocessing. Differences in methylation variability were determined using an F-test and significance determined at 1% false discovery rate. For each probeset in our expression data we found the CpG inside it’s promoter region (defined as 1000bp upstream and 250bp downstream) nearest to the transcription start site. We determined significant expression hyper-variability using an F-test at 1% false discovery rate to determine overlap between expression hyper-variability and DNA methylation hyper-variability.
We obtained peripheral blood Affymetrix HGU133plus2 samples from colon cancer patients and healthy controls (  from the study authors, and  from GEO with accession number GSE10715). Arrays were preprocessed with fRMA and normalized using the gene expression barcode. Arrays with GNUSE values >1.2 were removed, which left 15 colon cancer samples and 15 normal samples from the first study. Median GNUSE for the second study was 1.46 and thus was not included in the analysis (all but three cancer samples had GNUSE >1.2 in this study).
We defined the anti-profile from colon tissue by combining samples from the two colon cancer biopsy datasets used in the Gene Expression Antiprofiles Results section [38, 40, 52]. Probesets were included in the anti-profile and regions of normal expression defined as described above. No retraining was done to test on the blood dataset. The list of genes and corresponding median and median absolute deviation of expression are given in Additional file 2: Table S3.
To assess the sensitivity to signature size of the accuracy of the peripheral blood signature, we tested signatures of increasing size with genes included in order of decreasing hyper-variability across colon tumor samples (Additional file 1: Figure S1). While the signature reported in the manuscript obtained an AUC of 0.89, similar AUCs are obtained with signatures with about 500–2000 genes inside blocks indicating that the prediction result reported in the manuscript is not very sensitive to the specific signature size chosen. To ascertain significance of the prediction results obtained we performed a randomization test: for each signature size, we generated 1000 signatures with randomly selected subsets of genes of the appropriate size to build each anti-profile. Ranges of normal expression do not change since these are defined from the colon tissue dataset. We used the proportion of random signatures obtaining an AUC greater than or equal to the anti-profile of the corresponding size as a measure of uncertainty. Results that showed significantly high AUC were signatures that include about 500–2000 of the top hyper-variable genes inside methylation blocks.
To determine probesets that exhibit hypervariable expression in cancer we compute a variance ratio statistic across multiple tissues. We restrict this computation to tissues and cancer types with more than 10 samples in our dataset (list given in Figure 3). We compute standard deviation of expression for probeset g (sgt) separately for each tissue t and cancer type c (sgc). We define the variance ratio statistic ug (Additional file 1: Figure S2) as u g = log 2(mean c s gc /mean t s gt ).
To define the universal normal range of expression we use a similar method: we compute median expression for each gene g on each tissue t separately (mgt) along with median absolute deviation (madgt). The universal range is then defined as mg +/− 5 * madg where mg=mediant(mgt) and madg=mediant(madgt). The list of hyper-variable genes (ug>1) and associated median expression and median absolute deviation of expression are provided in Additional file 3: Table S4.
To define tissue-specific genes, we tabulated the number of samples in which a gene is expressed (defined as gene expression barcode z-score greater than 2.54) for each tissue in our dataset with more than 10 normal samples. Tissue-specific genes were defined as those in which the gene is expressed in more than 95% of the samples of at most three tissues. Fisher’s exact test was used to determine enrichment of hyper-variable genes in the set of tissue-specific genes (Additional file 1: Figure S5).
Gene ontology (GO) enrichment analysis was done using a hyper-geometric test for association between hyper-variable genes (defined as ug>1) and GO terms. We used the implementation in the Bioconductor GOstats package ( ). We used the q-value ( ) method to control for multiple hypothesis testing and report enriched categories with Q<0.05 in Additional file 1: Table S2.
We performed two types of cross-validation experiments to quantify the accuracy of universal cancer anti-profiles. The first was ten-fold cross validation, data was randomly split into 10 equal-sized subsets, retaining the proportion of normal and cancer samples from the full dataset in each subset. Each of the 10 subsets (or folds) was used sequentially as a test set, scored using an anti-profile trained on the remaining 90% of the data (this includes all steps: 1) filtering to include only tissue-specific probesets, 2) computing the universal variance ratio ug , 3) selecting the top 100 genes based on the ratio statistic, and 4) computing the universal normal range of expression).
The other type of cross-validation experiment was carried out on the 7 tissues for which we had at least 10 samples each of normal tissue and tumor. For each tissue type, we performed a leave-one-tissue-out experiment by using all samples (normal and corresponding tumor type) as test set and scored them using an anti-profile trained on the remaining data. This ensures that no samples from the corresponding tissue (normal or cancer) are included in the training set. Again, all steps required to train the anti-profile were done completely for each leave-one-tissue-out fold.
To classify a new sample we count the number of anti-profile genes for which their expression fell outside their normal range (Figure 2A). A large number of genes with expression outside the normal range, corresponding to a high anti-profile score, are indicative of cancer. To develop a predictor for new samples, a cutoff must be defined on the number of genes outside the normal range. If the anti-profile score is less than the cutoff, the sample is classified as normal, if it is greater than cutoff then the sample is classified as cancer.
HCB, JTL and RAI conceived, designed and performed experiments, analyzed data and drafted the manuscript; VP performed experiments and analyzed data; MM contributed reagents. All authors read and approved the final manuscript.
This work was partially funded by the National Institutes of Health R01 grant GM083084.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.