Improved precision and accuracy for microarrays using updated probe set definitions
BMC Bioinformaticsvolume 8, Article number: 48 (2007)
Microarrays enable high throughput detection of transcript expression levels. Different investigators have recently introduced updated probe set definitions to more accurately map probes to our current knowledge of genes and transcripts.
We demonstrate that updated probe set definitions provide both better precision and accuracy in probe set estimates compared to the original Affymetrix definitions. We show that the improved precision mainly depends on the increased number of probes that are integrated into each probe set, but we also demonstrate an improvement when the same number of probes is used.
Updated probe set definitions does not only offer expression levels that are more accurately associated to genes and transcripts but also improvements in the estimated transcript expression levels. These results give support for the use of updated probe set definitions for analysis and meta-analysis of microarray data.
Microarrays have been used for the last decade to analyze the global gene expression programs of different biological processes and disease states. During that time, e.g. the methodologies for background adjustment , normalization  and probe set summaries  have been improved and it is likely that further efforts will enable better analysis of microarray data. The exponential use of microarrays in biology has resulted in large public gene expression repositories that include thousands of arrays for the most commonly used Affymetrix platforms. The integrated analysis of this wealth of gene expression data raises both new possibilities and challenges .
Most of the commonly used Affymetrix platforms were designed before the respective genomes were fully sequenced. Therefore, these platforms have many probes that were designed after consensus sequences of clusters of Expressed Sequence Tags. In the original Affymetrix probe set definitions many probe sets often map to the same gene (e.g. they may target different transcript isoforms) and some integrative microarray studies use ad-hoc heuristics such as the average or maximum to integrate these values into a single expression estimate [5, 6].
To curate and solve the above mentioned problems, updated probe set definitions have been generated by re-annotating the existing probes on Affymetrix platforms to better reflect the transcript information and gene annotations available today [7, 8]. These pioneering studies have shown that updated probe set definitions will affect approximately 20–30% of all probe sets, thus affecting a large portion of the gene estimates . As a consequence, the genes identified as differentially expressed using the original and updated probe set definition only show 50% overlap [7, 8]. Updated probe set definitions that map probes to transcript annotations, such as ensEMBL transcripts, Refseq and Entrez GeneID are now available and can easily be integrated into bioconductor packages such as affy and gcrma. Updated probe annotations have also been shown to improve the cross-platform reproducibility of microarray experiments [9, 10].
The use of updated probe set definitions represents a significant improvement in mapping the platform probe signals to genes, transcripts and even exon expression levels and will presumably become the standard procedure. There is however no study that has evaluated the impact of updated probe set definitions on precision and accuracy in the estimated expression levels. In this study we provide such a comparison and we show that updated probe set definitions have significantly better precision (reproducibility) and accuracy than the original probe set definitions. These results give support for a widespread use of updated probe set definition in analyzing and re-analyzing microarray data.
Re-analyses of raw data using updated probe set definitions
We investigated how updated probe set definitions affect the estimated expression levels by re-analyzing a gene expression data set using both the original (NetAffx) and six updated probe set definitions (custom CDF's). Previously, this data set was used to estimate the precision and accuracy in microarray experiments across laboratories and platforms using the original probe set definitions . The data set was generated by creating two RNA samples, which differed in the expression of only a few genes. Both samples were hybridized to two arrays each by five different labs. Within each lab the two pairs of replicates was used to estimate the precision and accuracy (see below) by analyzing the log2 relative expression level measurements between the two samples. Because five labs performed the identical experiment, this data set provides a good opportunity to study effects of selected probe set definitions, since the estimated precision and accuracy obtained in each lab can be summarized to provide a robust assessment of the effects. We therefore used this data set to address the effects of using six different recently published updated probe set definitions in comparison with the default probe set definitions provided by Affymetrix (NetAffx). We re-analyzed the raw data for the Affymetrix HG-U133A arrays generated in the five different laboratories to estimate probe set expression levels using six different updated probe set definitions  and using the default probe set definitions. The six different updated probe set definitions (custom CDF's) re-mapped the probes on the array to i. ensEMBL exons, ii. ensEMBL genes, iii. ensEMBL transcripts, iv. Entrez GeneIDs, v. Refseq transcripts and vi. UniGene ids .
Significant improvement in precision using updated probe set definitions
We first investigated the effect of using updated probe set definitions on precision, which measures the data reproducibility and variability. As described previously , we defined precision as the correlation between the relative log2 expression ratios of the two RNA samples using the two pairs of replicates (i.e. A1/B1 vs A2/B2) pairs. The precision is a clear indication of the experiment performance and a correlation of 1 indicates perfect precision while a correlation of 0 indicates no precision. For each lab we calculated the precision using the different probe set definitions respectively. The mean precision difference for each updated probe set definition as compared with the original probe set definitions are reported in Table 1. The significance of each difference in precision was assessed by a two-tailed paired t-test using the precision differences obtained from the five labs. The precision was significantly improved for all updated probe set definitions except for the ensEMBL exons (Table 1), for which it was significantly worse (commented on below). The improvement was most obvious when using RMA estimated expression levels (Table 1).
The decrease in precision for probe set definitions to ensEMBL exons was likely due to the fewer number of probes that map to each exon (compared to the whole transcript). We therefore calculated the mean number of probes mapping to each probe set using the different probe set definitions. Indeed, the mean number of probes per probe set is lower for ensEMBL exons (Table 2). Using fewer probes when estimating an expression level likely increase the variance and lower the precision. Likewise, the improved precision for the other updated probe set definitions could be due to a larger number of probes mapping to each probe set since the mean number of probes are higher than for the original probe set definitions (Table 2). We therefore analyzed the precision as a function of the number of probes used to estimate each probe set (Figure 1a) for all probe set definitions and averaged across the five labs. To enable this analysis we had to group probe sets in bins of 4 as too few probe sets would otherwise give unreliable precision estimates. We found a positive correlation between number of probe sets and precision. However, the updated probe set definitions appears to achieve better precision than the original, even when similar numbers of probes were integrated into the signal estimates (see probe intervals 10–13 and 14–17 in Figure 1a). The numbers of probe sets defined by a particular number of probes are presented in Figure 1b. Similar results were obtained when analyzing the data from each of the five labs independently [see Additional file 1]. Thus, updated probe set definitions have significant improvements in precision.
Significant improvement in accuracy using updated probe set definitions
We next investigated the accuracy in detecting differentially expressed genes when using the updated probe set definitions. Accuracy was defined (ref11) to estimate how close the microarray estimates are to the "real expression" changes. Most often the "real" expression is measured using RT-PCR (real time PCR). To assess how accurate estimates the updated probe set definitions achieved, we compared the differential expression detected with microarrays to those measured by RT-PCR for 16 genes , for the different probe set definitions respectively. The accuracy was defined as the slope after a linear regression  between RT-PCR and microarray data (i.e. an accuracy of 1.0 is optimal,). We calculated the difference in accuracy for each lab between the updated probe set definitions and the standard probe set definition and then asked if the mean accuracy difference (averaged across the five labs) was significant using a paired t-test (two-tailed distribution). Significant improvements in accuracy were observed (when data was normalized using RMA) when all but the UniGene definition. The mean accuracy differences between the updated probe set definitions and the standard probe set definition, as well as the p-values calculated using the paired t-test are shown in Table 3. The slopes estimated from the five different labs were in general in good agreement as evident by the low standard deviations in Figure 2 and Table 3.
Accurate probe set definitions are essential for integrating the probe signals from a microarray experiment into a set of expression levels. Different investigators have recently introduced updated probe set definitions [7, 8] that more accurately map probes to genes and transcripts. The updated probe set definitions for Affymetrix arrays use fewer or more probes (by removing erroneous or non-specific probes and by pooling several probe sets targeting the same gene/transcript) but also estimates fewer probe sets (i.e. transcripts or genes) as compared with the original annotations. As a consequence, the number of probe pairs per probe set is no longer identical across all probe sets. We therefore investigated how the updated probe set definitions with variable number of probe pairs integrated into each probe set estimate would affect the precision and accuracy in estimated expression levels. We initially hypothesized that that the more stringent selection of probes that would be included in each probe set may have a negative impact on precision as fewer probes would be included in some probe sets. Such a result would have argued for caution in using updated probe set definitions.
We show that using updated probe set definitions (custom CDFs) improves both the precision and accuracy of the relative expression level estimates. The improvement in precision depends mainly on the increased number of probe pairs per probe set (Figure 1a). Furthermore, an improvement was also detected in comparisons where similar number of probe pairs were used which indicate that the re-annotation improves the expression estimate presumable by removing erroneous or non-specific probes that otherwise adds noise. The observed improvement in accuracy may also be due to removal of erroneous probes that otherwise would lower the estimated differential expression estimate. Improving the precision and accuracy effect the possible inferences from an experiment. E.g. a microarray study with increased precision will likely improve the ability to identify differential expressed genes, due to a lower variation within the biological groups. Similarly, the improvement in accuracy leads to prediction of relative expression level changes that better reflect the 'real change' (as measured by more precise methods). This is the first assessment on the impact of updated probe set definitions (and custom CDFs) on these two fundamental measurements and our results strongly argue for a wide spread use of updated probe set definitions.
More accurate probe set definitions will also be important for studies comparing microarray expression levels to sequence features, e.g. on pre-mRNAs. The correct mapping of pre-mRNA sequences to expression levels will likely improve the possible inferences (e.g. the identification of cis-regulatory elements). Therefore, we predict that using updated probe set definitions will be important for studies on post-transcriptional regulation  e.g. at the level of miRNA targets (e.g. ) and alternative splicing.
The public repositories (e.g. Gene Expression Omnibus , ArrayExpress  and Stanford Microarray Database ) contain a wealth of gene expression data that could be used for re-analysis and meta analysis . Only experiments that are deposited as raw data however could be re-analyzed by taking advantage of the updated probe set definitions. It is therefore troublesome that still only a limited portion of the data in the public repositories are available as raw data  to be used for future comparative microarray analysis e.g. using updated probe set definitions.
Updated probe set definitions do not only offer expression levels that are more accurately associated to genes and transcripts but also shows improvements in the estimated transcript expression levels. These results give further support for a widespread use of updated probe set definitions for analysis and meta-analysis of microarray data.
Gene expression data
We re-analyzed the gene expression data generated in five different labs using the same RNA hybridized to HG-U133A Affymetrix arrays . The raw data files (i.e. CEL files) were downloaded from the authors' URL . Each lab produced a comparison two different samples in duplicates.
We downloaded the version 7 of the updated probe set definitions  generated for the HG-U133A platform from the authors' URL . We considered all seven probe set definition that mapped probes to ensEMBL exons (Hs133A_Hs_ENSE_7), ensEMBL transcripts (Hs133A_Hs_ENST_7), ensEMBL genes (Hs133A_Hs_ENSG_7), Entrez Gene IDs (Hs133A_Hs_ENTREZG_7), RefSeq (Hs133A_Hs_REFSEQ_7) and UniGene (Hs133A_Hs_UG_7).
Probe set summaries were calculated for each laboratory (4 arrays per lab) using three different methods for expressional level estimation (MAS5, RMA and GCRMA) and seven different custom CDF files, resulting in twenty-one different probe set estimates per array. All calculations were performed in R using the bioconductor packages affy and gcrma and the default settings for MAS5, RMA and GCRMA. We named the custom CDF files as previously described .
The data sets generated in each lab consisted of sample A hybridized to two arrays, A1 and A2 and sample B hybridized to two arrays B1 and B2. Following Irizarry and co-workers , precision was defined as the Pearson correlation between the log2 ratios of A1/B1 and A2/B2. The precision presented in Table 1 was calculated on MAS5, RMA and GCRMA estimated probes sets signals and using the different custom CDF files independently. Figure 1 shows the precision as a function of the number of probes integrated into a probe set for the GCRMA generated expression levels.
The relative change in expression levels of 16 genes were previously measured by RT-PCR  and we downloaded the corresponding log2 ratios from the authors' webpage . The accuracy measures how the magnitude of differential expression on a specific platform compares to the difference obtained by a more precise method e.g. RT-PCR. We used the annotations from NetAffx  to map the probe set of these 16 genes to the different updated probe set definition identifiers [see Additional file 2]. The accuracy was defined as the slope between the RT-PCR and microarray log2 ratios, determined by a linear regression . The accuracy for the different custom CDF files on RMA expression values are presented in Table 2. Accuracy estimates using GCRMA [see Additional file 3] were also calculated.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4: 249–264. 10.1093/biostatistics/4.2.249
Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31: e15. 10.1093/nar/gng015
Larsson O, Wennmalm K, Sandberg R: Comparative microarray analysis. Omics 2006, 10: 381–397. 10.1089/omi.2006.10.381
Wennmalm K, Wahlestedt C, Larsson O: The expression signature of in vitro senescence resembles mouse but not human aging. Genome Biol 2005, 6: R109. 10.1186/gb-2005-6-13-r109
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102: 15545–15550. 10.1073/pnas.0506580102
Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005, 33: e175. 10.1093/nar/gni179
Gautier L, Moller M, Friis-Hansen L, Knudsen S: Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics 2004, 5: 111. 10.1186/1471-2105-5-111
Carter SL, Eklund AC, Mecham BH, Kohane IS, Szallasi Z: Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces cross-platform inconsistencies in cancer-associated gene expression measurements. BMC Bioinformatics 2005, 6: 107. 10.1186/1471-2105-6-107
Elo LL, Lahti L, Skottman H, Kylaniemi M, Lahesmaa R, Aittokallio T: Integrating probe-level expression changes across generations of Affymetrix arrays. Nucleic Acids Res 2005, 33: e193. 10.1093/nar/gni193
Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martinez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W: Multiple-laboratory comparison of microarray platforms. Nat Methods 2005, 2: 345–350. 10.1038/nmeth756
Larsson O, Perlman DM, Fan D, Reilly CS, Peterson M, Dahlgren C, Liang Z, Li S, Polunovsky VA, Wahlestedt C, Bitterman PB: Apoptosis resistance downstream of eIF4E: posttranscriptional activation of an anti-apoptotic transcript carrying a consensus hairpin structure. Nucleic Acids Res 2006, 34: 4375–4386. 10.1093/nar/gkl558
Farh KK, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP: The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. Science 2005, 310: 1817–1821. 10.1126/science.1121158
Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R: NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res 2005, 33: D562–6. 10.1093/nar/gki022
Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Sansone S, Brazma A: ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 2005, 33 Database Issue: D553–5.
Ball CA, Awad IA, Demeter J, Gollub J, Hebert JM, Hernandez-Boussard T, Jin H, Matese JC, Nitzberg M, Wymore F, Zachariah ZK, Brown PO, Sherlock G: The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res 2005, 33 Database Issue: D580–2.
Larsson O, Sandberg R: Lack of correct data format and comparability limits future integrative microarray research. Nat Biotechnol 2006, 24: 1322–1323. 10.1038/nbt1106-1322
Multiple Lab Comparison of Microarray Platforms Web Page[http://www.biostat.jhsph.edu/~ririzarr/techcomp]
R.S. is supported by a postdoctoral fellowship from the Knut and Alice Wallenberg Foundation and O.L. is supported by a postdoctoral fellowship from the Swedish Research Council.
RS and OL designed and performed the experiment and wrote the manuscript. Both authors read and approved the final manuscript.