GCOD - GeneChip Oncology Database
- Fenglong Liu†1,
- Joseph A White†2Email author,
- Corina Antonescu2,
- Daniel Gusenleitner2 and
- John Quackenbush2
© Liu et al; licensee BioMed Central Ltd. 2011
Received: 13 September 2010
Accepted: 3 February 2011
Published: 3 February 2011
DNA microarrays have become a nearly ubiquitous tool for the study of human disease, and nowhere is this more true than in cancer. With hundreds of studies and thousands of expression profiles representing the majority of human cancers completed and in public databases, the challenge has been effectively accessing and using this wealth of data.
To address this issue we have collected published human cancer gene expression datasets generated on the Affymetrix GeneChip platform, and carefully annotated those studies with a focus on providing accurate sample annotation. To facilitate comparison between datasets, we implemented a consistent data normalization and transformation protocol and then applied stringent quality control procedures to flag low-quality assays.
The resulting resource, the GeneChip Oncology Database, is available through a publicly accessible website that provides several query options and analytical tools through an intuitive interface.
Although gene expression microarrays have been widely used to study human disease, by far the most extensive application has been to the analysis of human cancers. Despite the large number of array experiments deposited in public databases such as GEO  and ArrayExpress , our ability to perform meta-analyses of these data to discover cross-cutting patterns has been hampered by both the heterogeneous nature of the data and the lack of consistent annotation of the experimental samples. Although there have been some attempts to organize these data in resources such as Oncomine  and Genevestigator , both focus on analyses of subsets of the data and neither fully addresses the problem of integration across studies.
To overcome these limitations, we developed GCOD, the GeneChip Oncology Database, a freely-available web-accessible resource focused on gene expression profiles in cancer collected on the Affymetrix GeneChip platform. Relative to other resources, GCOD has three distinguishing features that we believe greatly enhance its overall utility. First, since GCOD focuses on expression data derived from a single platform and on studies where raw data are available, all datasets in GCOD are uniformly processed and properly scaled such that levels of gene expression in multiple samples across studies are comparable. Second, quality control protocols have been implemented in GCOD so that samples from hybridizations of questionable quality are identified and removed, improving the reliability of any subsequent data analysis. Third, and most importantly, sample annotations are manually curated based on descriptions in the paper and provided in a tabular format that is compatible with most microarray data analysis packages.
GCOD has a number if advantages over other databases. First, the data have been reprocessed to provide normalized and scaled values that can be compared across studies. This is not possible at GEO as it is simply an access portal that has no online analysis tools. Although ArrayExpress has several basic analysis tools, the data are not consistently normalized, making global analyses and their interpretation difficult. Oncomine provides access to a variety of data types, but access is limited for non-paying users so that certain data are not available. Genevestigator is freely accessible to academic users, but places access limitations on result sets and analysis tool access for those who have not paid. Genevestigator only includes data from GEO and not those in ArrayExpress. In contrast, GCOD contains a more comprehensive collection of cancer data, available without restriction, and includes a set of basic analysis tools.
Construction and Content
Raw data (CEL) files from experiments run on the Affymetrix GeneChip platform are identified based on the keyword "cancer" and downloaded from public databases. These were first processed using the MAS5.0 algorithm (mas5) implemented in the Bioconductor package 'affy' to get detection calls and the 3' to 5' signal ratios for the GAPDH and β-ACTIN probesets. All the CEL files were then normalized using RMA [5, 6] (rma in the affy package) for each experimental group (study). After RMA normalization, expression values are scaled such that the mean of each experiment is set equal to a common value.
For each GeneChip platform, probset definitions and other annotation are obtained from CDF (chip description files) files, supplied by Affymetrix,
Number of Arrays in GCOD Grouped by Cancer Type
Number of Arrays
cancer cell lines
germ cell cancer
head and neck cancer
After curation and normalization, data are loaded into an ETL (extract, transform and load) database via a series of Perl scripts designed specifically for the formats produced during curation. The ETL database is used for loading and cleaning the data prior to transfer to a QA/QC database; the schema for the ETL database is normalized to 3rd normal form. Oracle sqlldr is used to bulk-load gene expression data from the flat-files written by the Perl scripts into target database tables. These data are marked with a 'data set' identifier (dba_id), so that the results of an analysis can be rapidly accessed. Once the data for a study are completely loaded and checked, they are transferred from the ETL database to our QA/QC database by SQL insert statements issued on an Oracle database link. The QA/QC database is the data source for our internal web site, which we use to evaluate the presentation and completeness of the data in the web pages generated by our Apache server. After inspection, the data are transferred to the production GCOD database by SQL insert statements issued on an Oracle database link.
The main fact table, dbadata, is partitioned into 3 separate tablespaces based on the analysis algorithm used to generate the data (either MAS5, RMA or RMA + scaling) and indexed on those foreign keys most commonly used in our queries. This produces exceptionally fast retrieval times. The dimension tables are not partitioned. The GCOD web site relies on information provided by the Gene Index (TGI) Resourcerer database which also resides in the production database server. The TGI database supplies GenBank accessions, PubMed identifiers and pre-computed mappings between various Affymetrix chips.
Web site implementation
The GCOD website is implemented on an Apache web server through a series of Perl/CGI/DBI scripts. These scripts use the CGI interface to present web pages, and the DBI interface to access and query the production GCOD database. The Perl/CGI/DBI scripts also access Resourcerer to obtain pre-computed mappings between Genbank, Ref_seq, probeset and array-type identifiers. This allows us to map probesets on one Affymetrix chip to probesets on another Affymetrix chip (whether they are identical or not). Analysis of GCOD data that is presented on the website is performed in R by direct system calls.
Utility and Discussion
The study-centered views allow users to browse the list of published studies and to search for datasets of interest. For each individual study the title of the publication, a summary of samples and experimental factors involved, and the total number of hybridizations on the specified array type are displayed. Listed next to each study name are three separate options: access QC information, compare experimental groups using a t-test, and download the dataset. The QC information (Figure 3B) includes two scatter plots showing the quality control information derived from the MAS 5.0 algorithm; one is the percentage of 'Present' calls vs. scaling factor (target = 500) and the other is the 3' to 5' signal ratios of the control transcripts GAPDH versus β-ACTIN ; both plots identify questionable hybridizations as outliers in the graphs.
To provide basic utilities for comparison between phenotypic groups defined by the curated sample annotation, we implemented Student's t-test with data-trimming filters, p-value thresholds for significance, and Bonferroni correction for multiple testing. Results are presented in a table that lists significant genes together with group means, standards deviations, differences between the means, degree of freedom and raw (and corrected) p-values. The results are sorted by p-value and each probeset name is linked to expression values for all the samples in our database. Users can browse the results and download them as a text file.
The download function offers users options to choose either MAS 5.0 or RMA normalized data, whether or not to exclude the data points from questionable hybridizations flagged by the QC filter, and whether the data should be trimmed by removal of data (rows) with less than the specified percentage of MAS 5.0 'Present' calls across samples (columns). Sample annotations classified by experimental factors will be listed on the header section of the downloaded data table with an option to arrange the columns by any user-chosen experimental factor. In addition, RMA normalized expression data, in the format of the BioConductor eSet data object, can be downloaded with the "R Data Download" link.
Example Gene Identifier Entries for Gene-Centric Searches
33218_at, 1802_s_at, 216836_s_at, 210930_s_at
Locus Link/Entrez ID
ESRRB ERBB2 HER-2
Hs.446352, ERBB2, HER2, 33218_at
tumor, estrogen receptor
As an example, we examined VEGF expression across all of the studies represented in GCOD. Figure 4 shows the expression of all four probesets on the HG-U133A array that are annotated as VEGF probes. For most cancers, there is little difference in expression for VEGF; however the renal cancer studies clearly show differential expression for VEGF (Figure 4). One can also look at individual probesets. Figure 5 shows the individual mean normalized expression values of probeset 212717_x_at (VEGF) as box plots for each sample grouping from the Copland kidney cancer study  (Figure 5). The normalized data values used in generating the box plots can be downloaded in each case.
The expression of VEGF appears decreased compared to normal kidney samples in the Corbin, et al.  study of Wilms' tumor. We believe these observations to be completely accurate based on published experimental work that describes decreased VEGF expression in Wilms' tumor  using RT-PCR. Compared to clear-cell renal cell carcinoma (ccRCC) (Lenburg et al. ), which is a highly vascularized tumor , Wilms' tumor is an early childhood nephroblastoma that is non-invasive; elevated VEGF would be expected in ccRCC, but not necessarily so in Wilms' tumor. The published data appear to support our expression-based observations that VEGF has elevated expression in ccRCC renal cancer.
Results from t-test contrasting normal versus tumor samples from the Lenburg renal cancer study
calbindin 1, 28kDa
solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 3
solute carrier family 22 (organic anion transporter), member 8
pipecolic acid oxidase
uromodulin (uromucoid, Tamm-Horsfall glycoprotein)
calbindin 1, 28kDa
serine (or cysteine) proteinase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 5
deiodinase, iodothyronine, type I
aldehyde dehydrogenase 6 family, member A1
X-prolyl aminopeptidase (aminopeptidase P) 2, membrane-bound
solute carrier family 7 (cationic amino acid transporter, y+ system), member 8
solute carrier family 7 (cationic amino acid transporter, y+ system), member 8
vitamin D (1,25- dihydroxyvitamin D3) receptor
cyclin-dependent kinase (CDC2-like) 10
hypothetical protein FLJ20920
vascular endothelial growth factor
Calbindin-D28k, CALB1, is a vitamin D dependent, calcium binding protein that is expressed in several tissues including the kidney, pancreas and brain . CALB1 acts to buffer calcium concentration in the blood and tissues, and may have regulatory properties similar to other calcium binding proteins, e.g. calmodulin and troponin-C. HPPD is part of the tyrosine catabolic pathway; it converts 4-hydroxyphenylpyruvate to homogentistate which is subsequently catabolyzed to acetoacetate and fumarate. HPPD is expressed in the liver and kidneys, as well as cerebral cortex, cerebellum and hippocampus. Mutations in HPPD result in type III tyrosinemia, a hereditary condition in which mild mental retardation and seizures occur due to the accumulation of tyrosine and phenylalanine in the blood. Oddly enough, inactivation of or deletion of HPPD alleviates the effects of type I tyrosinemia caused by deficiency of fumarylacetoacetase (the last enzyme in the tyrosine catabolism pathway), and an accumulation of fumarylacetoacetate, succinylacetoacetate and derivatives [15, 16].
Number of Arrays in GCOD Grouped by Array Platform
Number of Arrays
QC filters identify potentially poor hybridizations (Figures 3C and 3D) based on criteria that include: a) scaling factor values greater than 100, b) actin_ratio greater than 10 and gapdh_ratio greater than 10, and c) present call (detection) percentage less than 10. Hybridizations failing to meet these criteria are flagged for exclusion, but only excluded if the user selects the option to do so. Hybridizations failing to meet those criteria represent a) 0.124%, b) 3.70% and c) 1.94% of the data, respectively, with 5.63% of the hybridizations, in total, that fail to meet one or more of these criteria.
The GCOD web site provides access to normalized and scaled gene expression data from analyses of a variety of cancer types. The site provides filtering based on QC analysis of the data, and the ability to do t-tests based on the experimental parameters for the individual studies in the database. The GCOD site also offers the option to download data for each study. In the near future we plan to augment the GCOD web site to include: a) additional data QC metrics, b) a cancer gene signatures search function, and c) a batch search function. Lastly, new data sets are added to GCOD as they become available.
This research was supported by institutional funds from the Dana-Farber Cancer Institute, NLM Grant # 5R01 LM008795-04, and NSF Grant # DBI-0649614. We thank Priya Karanam (firstname.lastname@example.org) for assistance with database maintenance and resolving data retrieval issues.
- Barrett Tanya, Dennis B Troup, Stephen E Wilhite, Ledoux Pierre, Rudnev Dmitry, Evangelista Carlos, Irene F Kim, Soboleva Alexandra, Maxi Tomashevsky and Ron Edgar: Nucleic Acids Research. 2007. Vol. 35, Database issue, D760-D765 Vol. 35, Database issue, D760-D765Google Scholar
- Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone S: Nucleic Acids Research. 2003, 31: 68–71. 10.1093/nar/gkg091PubMed CentralView ArticlePubMedGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Neoplasia. 2004, 6(1):1–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Bolstad BM, Irizarry R, Astrand M, Speed TP: A comparison of normalization methods for high density oligonulcotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4: 249–264. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
- Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F, Quackenbush J: Genome Biology. 2001., 2(11): software:0002.1–0002.4. software:0002.1-0002.4. 10.1186/gb-2001-2-11-software0002Google Scholar
- Claire L. Wilson, Crispin J. Miller: Simpleaffy: a BioConductor package for Affymetrix Quality Control and data analysis. Bioinformatics 2005, 21(18):3683–3685. 10.1093/bioinformatics/bti605View ArticleGoogle Scholar
- Gumz ML, Zou H, Kreinest PA, Childs AC, Belmonte LS, LeGrand SN, Wu KJ, Luxon BA, Sinha M, Parker AS, Sun LZ, Ahlquist DA, Wood CG, Copland JA: Secreted frizzled-related protein 1 loss contributes to tumor phenotype of clear cell renal cell carcinoma. Clin Cancer Res 2007, 13(16):4740–4749. 10.1158/1078-0432.CCR-07-0143View ArticlePubMedGoogle Scholar
- Corbin M, de Reynies A, Rickman DS, Berrebi D, Boccon-Gibod L, Cohen-Gogo S, Fabre M, Jaubert F, Faussillon M, Yilmaz F, Sarnacki S, Landman-Parker J, Patte C, Schleiermacher G, Antignac C, Jeanpierre C: WNT/b-Catenin Pathway Activation in Wilms Tumors: A Unifying Mechanism with Multiple Entries? Genes, Chromosomes & Cancer 2009, 48: 816–827.View ArticleGoogle Scholar
- Baudry D, Faussillon M, Cabanis MO, Rigolet M, Zucker JM, Patte C, Sarnacki S, Boccon-Gibod L, Junien C, Jeanpierre C: Changes in WT1 splicing are associated with a specific gene expression profile in Wilms' tumour. Oncogene 2002, 21: 5566–5573. 10.1038/sj.onc.1205752View ArticlePubMedGoogle Scholar
- Lenburg ME, Liou LS, Gerry NP, Frampton GM, Cohen HT, Christman MF: Previously unidentified changes in renal cell carcinoma gene expression identified by parametric analysis of microarray data. BMC Cancer 2003, 3: 31–49. 10.1186/1471-2407-3-31PubMed CentralView ArticlePubMedGoogle Scholar
- Baldewijns MM, van Vlodrop IJH, Vermeulen PB, Soetekouw PMMB, van Engeland M, de Bruıne AP: VHL and HIF signalling in renal cell carcinogenesis. J Pathol 2010, 221: 125–138. 10.1002/path.2689View ArticlePubMedGoogle Scholar
- Christakos S, Gabrielides C, >Rhoten WB: Vitamin D-dependent calcium binding proteins: Chemistry, Distribution, Functional Considerations, and Molecular Biology. Endocr Rev 1989, 10(1):3–26. 10.1210/edrv-10-1-3View ArticlePubMedGoogle Scholar
- Neve S, Aarenstrup L, Tornehave D, Rahbek-Nielsen H, Corydon TJ, Roepstorff P, Kristiansen K: Tissue distribution, intracellular localization and proteolytic processing of rat 4-hydroxyphenylpyruvate dioxygenase. Cell Biol Int 2003, 27(8):611–24. 10.1016/S1065-6995(03)00117-3View ArticlePubMedGoogle Scholar
- Kavana M, Moran GR: The Interaction of (4-Hydroxyphenyl)pyruvate Dioxygenase with the Specific Inhibitor 2-[2-Nitro-4-(trifluoromethyl)benzoyl]-1,3-cyclohexanedione. Biochemistry 2003, 42: 10238–10245. 10.1021/bi034658bView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.