Improved precision and accuracy for microarrays using updated probe set definitions
© Sandberg and Larsson. 2007
Received: 26 September 2006
Accepted: 08 February 2007
Published: 08 February 2007
Skip to main content
© Sandberg and Larsson. 2007
Received: 26 September 2006
Accepted: 08 February 2007
Published: 08 February 2007
Microarrays enable high throughput detection of transcript expression levels. Different investigators have recently introduced updated probe set definitions to more accurately map probes to our current knowledge of genes and transcripts.
We demonstrate that updated probe set definitions provide both better precision and accuracy in probe set estimates compared to the original Affymetrix definitions. We show that the improved precision mainly depends on the increased number of probes that are integrated into each probe set, but we also demonstrate an improvement when the same number of probes is used.
Updated probe set definitions does not only offer expression levels that are more accurately associated to genes and transcripts but also improvements in the estimated transcript expression levels. These results give support for the use of updated probe set definitions for analysis and meta-analysis of microarray data.
Microarrays have been used for the last decade to analyze the global gene expression programs of different biological processes and disease states. During that time, e.g. the methodologies for background adjustment , normalization  and probe set summaries  have been improved and it is likely that further efforts will enable better analysis of microarray data. The exponential use of microarrays in biology has resulted in large public gene expression repositories that include thousands of arrays for the most commonly used Affymetrix platforms. The integrated analysis of this wealth of gene expression data raises both new possibilities and challenges .
Most of the commonly used Affymetrix platforms were designed before the respective genomes were fully sequenced. Therefore, these platforms have many probes that were designed after consensus sequences of clusters of Expressed Sequence Tags. In the original Affymetrix probe set definitions many probe sets often map to the same gene (e.g. they may target different transcript isoforms) and some integrative microarray studies use ad-hoc heuristics such as the average or maximum to integrate these values into a single expression estimate [5, 6].
To curate and solve the above mentioned problems, updated probe set definitions have been generated by re-annotating the existing probes on Affymetrix platforms to better reflect the transcript information and gene annotations available today [7, 8]. These pioneering studies have shown that updated probe set definitions will affect approximately 20–30% of all probe sets, thus affecting a large portion of the gene estimates . As a consequence, the genes identified as differentially expressed using the original and updated probe set definition only show 50% overlap [7, 8]. Updated probe set definitions that map probes to transcript annotations, such as ensEMBL transcripts, Refseq and Entrez GeneID are now available and can easily be integrated into bioconductor packages such as affy and gcrma. Updated probe annotations have also been shown to improve the cross-platform reproducibility of microarray experiments [9, 10].
The use of updated probe set definitions represents a significant improvement in mapping the platform probe signals to genes, transcripts and even exon expression levels and will presumably become the standard procedure. There is however no study that has evaluated the impact of updated probe set definitions on precision and accuracy in the estimated expression levels. In this study we provide such a comparison and we show that updated probe set definitions have significantly better precision (reproducibility) and accuracy than the original probe set definitions. These results give support for a widespread use of updated probe set definition in analyzing and re-analyzing microarray data.
We investigated how updated probe set definitions affect the estimated expression levels by re-analyzing a gene expression data set using both the original (NetAffx) and six updated probe set definitions (custom CDF's). Previously, this data set was used to estimate the precision and accuracy in microarray experiments across laboratories and platforms using the original probe set definitions . The data set was generated by creating two RNA samples, which differed in the expression of only a few genes. Both samples were hybridized to two arrays each by five different labs. Within each lab the two pairs of replicates was used to estimate the precision and accuracy (see below) by analyzing the log2 relative expression level measurements between the two samples. Because five labs performed the identical experiment, this data set provides a good opportunity to study effects of selected probe set definitions, since the estimated precision and accuracy obtained in each lab can be summarized to provide a robust assessment of the effects. We therefore used this data set to address the effects of using six different recently published updated probe set definitions in comparison with the default probe set definitions provided by Affymetrix (NetAffx). We re-analyzed the raw data for the Affymetrix HG-U133A arrays generated in the five different laboratories to estimate probe set expression levels using six different updated probe set definitions  and using the default probe set definitions. The six different updated probe set definitions (custom CDF's) re-mapped the probes on the array to i. ensEMBL exons, ii. ensEMBL genes, iii. ensEMBL transcripts, iv. Entrez GeneIDs, v. Refseq transcripts and vi. UniGene ids .
Improved precision using update probe set definitions
p = 0.094
p = 0.057
p = 0.0092
p = 0.056
p = 0.018
p = 0.052
p = 0.0041
p = 0.00025
p = 0.0071
p = 0.00011
p = 0.00045
p = 2.9E-06
p = 0.000051
p = 0.045
p = 0.28
p = 0.019
p = 0.062
p = 0.0071
Characteristics of probe set definitions
Probe set definition
Number of probe sets
Mean number of probe pairs per probe set
Improved accuracy using updated probe set definitions
Accurate probe set definitions are essential for integrating the probe signals from a microarray experiment into a set of expression levels. Different investigators have recently introduced updated probe set definitions [7, 8] that more accurately map probes to genes and transcripts. The updated probe set definitions for Affymetrix arrays use fewer or more probes (by removing erroneous or non-specific probes and by pooling several probe sets targeting the same gene/transcript) but also estimates fewer probe sets (i.e. transcripts or genes) as compared with the original annotations. As a consequence, the number of probe pairs per probe set is no longer identical across all probe sets. We therefore investigated how the updated probe set definitions with variable number of probe pairs integrated into each probe set estimate would affect the precision and accuracy in estimated expression levels. We initially hypothesized that that the more stringent selection of probes that would be included in each probe set may have a negative impact on precision as fewer probes would be included in some probe sets. Such a result would have argued for caution in using updated probe set definitions.
We show that using updated probe set definitions (custom CDFs) improves both the precision and accuracy of the relative expression level estimates. The improvement in precision depends mainly on the increased number of probe pairs per probe set (Figure 1a). Furthermore, an improvement was also detected in comparisons where similar number of probe pairs were used which indicate that the re-annotation improves the expression estimate presumable by removing erroneous or non-specific probes that otherwise adds noise. The observed improvement in accuracy may also be due to removal of erroneous probes that otherwise would lower the estimated differential expression estimate. Improving the precision and accuracy effect the possible inferences from an experiment. E.g. a microarray study with increased precision will likely improve the ability to identify differential expressed genes, due to a lower variation within the biological groups. Similarly, the improvement in accuracy leads to prediction of relative expression level changes that better reflect the 'real change' (as measured by more precise methods). This is the first assessment on the impact of updated probe set definitions (and custom CDFs) on these two fundamental measurements and our results strongly argue for a wide spread use of updated probe set definitions.
More accurate probe set definitions will also be important for studies comparing microarray expression levels to sequence features, e.g. on pre-mRNAs. The correct mapping of pre-mRNA sequences to expression levels will likely improve the possible inferences (e.g. the identification of cis-regulatory elements). Therefore, we predict that using updated probe set definitions will be important for studies on post-transcriptional regulation  e.g. at the level of miRNA targets (e.g. ) and alternative splicing.
The public repositories (e.g. Gene Expression Omnibus , ArrayExpress  and Stanford Microarray Database ) contain a wealth of gene expression data that could be used for re-analysis and meta analysis . Only experiments that are deposited as raw data however could be re-analyzed by taking advantage of the updated probe set definitions. It is therefore troublesome that still only a limited portion of the data in the public repositories are available as raw data  to be used for future comparative microarray analysis e.g. using updated probe set definitions.
Updated probe set definitions do not only offer expression levels that are more accurately associated to genes and transcripts but also shows improvements in the estimated transcript expression levels. These results give further support for a widespread use of updated probe set definitions for analysis and meta-analysis of microarray data.
We re-analyzed the gene expression data generated in five different labs using the same RNA hybridized to HG-U133A Affymetrix arrays . The raw data files (i.e. CEL files) were downloaded from the authors' URL . Each lab produced a comparison two different samples in duplicates.
We downloaded the version 7 of the updated probe set definitions  generated for the HG-U133A platform from the authors' URL . We considered all seven probe set definition that mapped probes to ensEMBL exons (Hs133A_Hs_ENSE_7), ensEMBL transcripts (Hs133A_Hs_ENST_7), ensEMBL genes (Hs133A_Hs_ENSG_7), Entrez Gene IDs (Hs133A_Hs_ENTREZG_7), RefSeq (Hs133A_Hs_REFSEQ_7) and UniGene (Hs133A_Hs_UG_7).
Probe set summaries were calculated for each laboratory (4 arrays per lab) using three different methods for expressional level estimation (MAS5, RMA and GCRMA) and seven different custom CDF files, resulting in twenty-one different probe set estimates per array. All calculations were performed in R using the bioconductor packages affy and gcrma and the default settings for MAS5, RMA and GCRMA. We named the custom CDF files as previously described .
The data sets generated in each lab consisted of sample A hybridized to two arrays, A1 and A2 and sample B hybridized to two arrays B1 and B2. Following Irizarry and co-workers , precision was defined as the Pearson correlation between the log2 ratios of A1/B1 and A2/B2. The precision presented in Table 1 was calculated on MAS5, RMA and GCRMA estimated probes sets signals and using the different custom CDF files independently. Figure 1 shows the precision as a function of the number of probes integrated into a probe set for the GCRMA generated expression levels.
The relative change in expression levels of 16 genes were previously measured by RT-PCR  and we downloaded the corresponding log2 ratios from the authors' webpage . The accuracy measures how the magnitude of differential expression on a specific platform compares to the difference obtained by a more precise method e.g. RT-PCR. We used the annotations from NetAffx  to map the probe set of these 16 genes to the different updated probe set definition identifiers [see Additional file 2]. The accuracy was defined as the slope between the RT-PCR and microarray log2 ratios, determined by a linear regression . The accuracy for the different custom CDF files on RMA expression values are presented in Table 2. Accuracy estimates using GCRMA [see Additional file 3] were also calculated.
R.S. is supported by a postdoctoral fellowship from the Knut and Alice Wallenberg Foundation and O.L. is supported by a postdoctoral fellowship from the Swedish Research Council.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.