Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: High-resolution annotation for microarrays
© Lu et al. 2007
Received: 28 August 2006
Accepted: 29 March 2007
Published: 29 March 2007
Skip to main content
© Lu et al. 2007
Received: 28 August 2006
Accepted: 29 March 2007
Published: 29 March 2007
Extracting biological information from high-density Affymetrix arrays is a multi-step process that begins with the accurate annotation of microarray probes. Shortfalls in the original Affymetrix probe annotation have been described; however, few studies have provided rigorous solutions for routine data analysis.
Using AceView, a comprehensive human transcript database, we have reannotated the probes by matching them to RNA transcripts instead of genes. Based on this transcript-level annotation, a new probe set definition was created in which every probe in a probe set maps to a common set of AceView gene transcripts. In addition, using artificial data sets we identified that a minimal probe set size of 4 is necessary for reliable statistical summarization. We further demonstrate that applying the new probe set definition can detect specific transcript variants contributing to differential expression and it also improves cross-platform concordance.
We conclude that our transcript-level reannotation and redefinition of probe sets complement the original Affymetrix design. Redefinitions introduce probe sets whose sizes may not support reliable statistical summarization; therefore, we advocate using our transcript-level mapping redefinition in a secondary analysis step rather than as a replacement. Knowing which specific transcripts are differentially expressed is important to properly design probe/primer pairs for validation purposes. For convenience, we have created custom chip-description-files (CDFs) and annotation files for our new probe set definitions that are compatible with Bioconductor, Affymetrix Expression Console or third party software.
Affymetrix GeneChips™ [1, 2] are widely used in biomedical research for genome-wide expression profiling. The level of gene expression is typically summarized from a probe set composed of several 25 mer probes designed to span a target region based on a UniGene cluster. Summarized expression measurements for a probe set are typically derived using a variety of algorithms, including MAS5.0 , model-based-expression indices (MBEI) , robust multi-chip-average (RMA) [5, 6], and the position-dependent nearest neighbor (PDNN) algorithm [7, 8].
Significant effort has been placed on extracting accurate and robust expression measurements summarized from multiple probes using a variety of statistical algorithms [9–11]. Recently, with the public release of microarray probe sequences, attention has been paid to the accuracy of individual probe annotations and its impact on gene expression data [12–17]. Probes within a probe set can be both ambiguous (non-specific, i.e. targeting multiple genes) and heterogeneous (target different transcript variants from one gene). For example, examination of the probe sequences incorporated in the Affymetrix Human Genome U95Av2 Array indicates that 10.5% of the probes are nonspecific and 9.3% are mistargeted . Moreover, interpretation of probe signal is complicated by probes cross-hybridizing to similar sequences and transcript variants from alternative splicing [15, 16]. It should be noted that grouping probes that map to different targets may create divergent signals that will significantly influence expression measurements from stochastic-model-based summarization approaches (e.g. RMA). For example, stray signal arising from probes with multiple targets within a probe set have been shown to contribute to misleading biological relationships . A more nuanced approach to estimating expression levels calls for consideration of alternative splicing, as more than half of all genes are alternatively spliced in the human genome . While use of the UniGene-based definition of Affymetrix probe sets may be sufficient to provide overall differential gene expression estimates, it is inadequate for distinguishing or preserving signal data arising from different transcript variants [15, 19].
Several groups have explored the effects of using alternative microarray annotations. By matching probe sequences to an up-to-date Reference Sequence (RefSeq) database [20, 21], Gautier et al  investigated an "alternative mapping" approach, wherein probes were grouped together if they matched a common RefSeq transcript and were excluded from a probe set if they matched 2 or more RefSeq entries. While this approach increases the specificity of each probe set, it might prove impractical in the long term with the continued growth of the RefSeq database, resulting in the erosion of probe sets over time. Carter et al  adopted a redefinition of Affymetrix probe sets where probes were matched against cDNA clones on spotted arrays. Their method showed improved concordance of expression measurements, hinting that concordant annotation would support concordance of results. In contrast, when they used the AceView transcript database to match Affymetrix probe sets containing probes that could be sequence matched to the same transcript sequence as the cDNA clone (Shared Transcript probes), they found relatively low cross-platform consistency as compared to direct sequence overlap. They postulated that the low correlation might be due to a number of factors including the presence of splice variants, the probes being subject to different cross-hybridization patterns, or incorrect clone sequence predictions . More recently, Dai et al  provided a method for redefining Affymetrix probe sets using several gene and transcript databases. In their regrouping strategy, all probes that match a single transcript or gene are simply grouped into a probe set. These approaches however, did not account for the heterogeneous manner in which individual probes can target transcripts. Hence, the expression signal from a given probe set is summarized across probes that individually map to varied and/or multiple sets of transcript variants.
Statistics of probe-to-transcript mapping and redefinition of probe sets
U133 Plus 2
Total probe sets from Affymetrix
Total probe sets, newly defined
Identical to original Affymetrix probe sets
Derived from 1 Affymetrix probe set
Derived from >1 Affymetrix probe sets
Probe sets containing ≥ 4 probes
Total unique probe sequences (_at probes)
Probes matching ≥ 1 transcript in AceView
Probes matching ≥ 1 transcript in RefSeq
Total number of AceView transcripts matched
Total number of RefSeq transcripts matched
Given a specific probe-to-transcript mapping, defining a probe set is straightforward: probes that are mapped to the same set of transcripts naturally belong to a common probe set. There are two ways in which a new probe set can be formed: it can be derived solely from a single Affymetrix probe set or it can be formed by merging probes from 2 or more Affymetrix probe sets. One example of the first scenario is shown in Figure 1 (top panel). The Affymetrix probe set 34666_at on the GeneChip U95Av2 contains 16 probes and targets the RefSeq entry NM_000636 which encodes superoxide dismutase, mitochondrial (SOD2). A "higher resolution" detailed view based on our probe-to-transcript mapping shows that this probe set actually maps to 3 AceView transcripts. Using our new probe set definition, the first five probes form one probe set since they all match transcripts SOD2.bAug05, SOD2.cAug05 and SOD2.iAug05; the next 8 probes match the transcript set SOD2.bAug05 and SOD2.cAug05, so they form another probe set, and the last 3 only match transcript SOD2.bAug05 to form yet another group. Note that, in our new probe set definition, probe sets never share probes, but the transcripts they represent may overlap.
Because of the large transcript-to-gene ratio in AceView, a majority of the probe sets match more than one AceView transcript. Interestingly there is an inverse relationship between probe set size and the number of transcripts a probe set targets. The inverse relationship is especially strong in probe sets derived by splitting, i.e. those probe sets smaller than the standard Affymetrix probe sets (16 probes for U95A, 11 for U133A), but it is also apparent in merged probe sets, i.e. those larger than the standard Affymetrix probe sets.
By regrouping probes into homogeneous probe sets using the AceView mapping, gene expression is examined at the transcript level rather than the gene level. Higher resolution probe set definitions allowed us to identify specific transcript variants that were initially undetectable within the original heterogeneous probe sets. In an earlier experiment using the Affymetrix platform, we compared pancreatic tumor cells prior to and after serum removal to study early events accompanying islet cell differentiation . Here, this data set was reanalyzed with both the original and the new probe set definitions and the two lists of differentially expressed genes were compared. In this analysis, we consider a gene or a probe set differentially expressed if the false discovery rate (FDR) adjusted p-value is less than 0.05 and the fold-change (up or down-regulated) is greater than 1.7.
Increased transcript specificity by using the newly defined probe sets
Affymetrix Probe Sets
Redefined Probe Sets
No. of variants
No. of variants
BNIP3: a, b
ANKMY2: a, b
TRIP13: a, e
PGRMC2: a, d
TXNL4A: d, e
BEXL1: a, c
TPX2: a, b
LMNB1: a, c
SEPP1: c, f
KRT15: a, c
MGC2574: a, c
NBL1: a, b
FEN1: a, c
LGALS3: a, c
ACTN1: a, c
HBP1: a, b
PCNA: a, b
GABARAPL2: a, c
BYSL: a, b
VARS: a, g
Comparing the two gene lists also identifies about 13–16% of genes/probe sets which can only be identified as being differentially expressed with either the original or the new probe set definition. The majority of the genes missed by either definition have p-values and fold-changes close to the threshold chosen above (data not shown), suggesting that the results using the two probe set definitions are largely similar. However, in a few cases, we observed that with higher resolution of the new probe set definition, new transcript level changes are also uncovered. One example is shown in Figure 1, where the probe-level signals were plotted (bottom panel). The original Affymetrix probe set "34666_at" is not considered to be differentially expressed at the 5% significance level (FDR adjusted p-value = 0.44). However, in our new AceView-based definition, 16 probes in this probe set (shown in Figure 1) were divided into 3 new probe sets. One probe set (b0805_9681) which maps to transcript variants b, c, and i of SOD2, appears to be significantly downregulated (FDR adjusted p-value = 0.04, fold-change = 1.9); the other two new probe sets map to variants b and c (b0805_616) or to b (b0805_11137) only, and both probe sets are not significantly changed (data not shown). From these results, we can infer that variant i might be significantly differentially expressed since it is being uniquely interrogated by b0805_9681. Another example is the Affymetrix probe set "37513_at" representing Stearoyl-CoA desaturase (SCD). This gene was found to be differentially expressed only by using the new definition (FDR adjusted p-value = 0.02, fold-change = 2.3), and this change was validated by real-time PCR (2.5-fold, p < 0.05).
Pearson's correlations between Affymetrix and Codelink data
Probe sets with size ≥ 4
Probe sets with size < 4
In this report we present a new approach to integrating an up-to-date probe annotation into routine Affymetrix array analysis. Although the Affymetrix GeneChip arrays are not particularly designed to detect alternative transcripts, with careful transcript-level annotation we have demonstrated that specificity can be achieved by using the new probe set definition. One of the advantages of using the newly redefined probe sets is that it allows the examination of gene expression in-depth at the transcript level, providing a level of clarity in data interpretation unavailable at the gene level or even at the RefSeq transcript level. With the total number of AceView transcripts at 243,707 compared with 39,115 in RefSeq, probes from all chips examined matched approximately four times the number of transcripts in AceView relative to ones annotated in RefSeq. In addition, ~80% of all U133 Plus 2.0 array probes matched AceView transcripts, which was ~50% more than the number that matched to RefSeq. Such a detailed view is necessary if one needs to design primers or probes for quantitative-PCR verification. Moreover, our method naturally separates the ambiguous and cross-hybridizing probes and automatically groups gene specific probes.
Although our approach to grouping probes into probe sets is independent of the particular transcript database being used, we consider AceView to be the most comprehensive and accurate database publicly available for conducting such transcript-level reannotation of probes. In comparison to RefSeq, which is a highly curated yet incomplete mapping of the transcriptome, AceView annotations identify on average, 5.0 transcripts per gene, greatly exceeding that of RefSeq's 1.3 per gene. Furthermore, in annotating the ENCODE region, the quality of AceView transcript annotation has been shown to be comparable with the gold-standard manual Havana annotation. If the overall depth and quality was considered, among the 16 annotation approaches compared, AceView is "by far the closest match" to the painstaking manual transcript annotation .
As a result of maintaining homogenous probe sets and excluding ambiguous and cross-hybridizing probes, this new redefinition often results in small probe sets (i.e. having fewer than 4 probes). Using a random sampling of probes from the original Affymetrix probe sets, we demonstrate that, without considering the annotation issue, at least 4 probes may be required for deriving reliable expression measurements. From all the arrays studied, these adequately sized probe sets comprise 58% of all new probe sets. Our observation that probe sets with fewer than 4 probes yield poor data may arise from a number of factors. Non-functioning probes may exist for certain probe sets: for instance, on the U95A chip, a number of probe pairs for probe sets 407_at and 36889_at were found to perform poorly . Deviation of probe length on the array from the designed 25-mer, due to synthesis inefficiency, may also contribute to both variability and poor probe performance, including array-to-array variation . Non-functioning probes due to the latter case are particularly difficult to trace and this problem is probably only circumvented by integrating data from multiple probes.
A recent paper by Dai et al  provided a method for redefining Affymetrix probe sets using several gene and transcript databases. Their regrouping strategy, however, is fundamentally different from the current method in that with their method, all probes that match a single transcript or gene are simply grouped into a probe set. However, their method does not generate "transcript-specific" probe sets for genes with multiple transcripts, and does not eliminate probe sets with multiple targets . Hence, there may be some probes within a newly regrouped probe set that may actually cross-hybridize to a different transcript. An example of this can be considered using Figure 1 to demonstrate. According to their method, transcript b (NM_000636) would utilize all probes from the original Affymetrix probe set. With our redefinition, only the last 3 probes (b0805_11137) are specific for this transcript. Furthermore, with their method, transcript c of SOD2 will be represented by merging our newly redefined probe sets, b0805_9681 and b0805_616. It is clear that the probes from these different probe sets show gene expression profiles that are markedly different. Thus, we expect that the specificity and homogeneity within our probe sets will result in more accurate gene expression measurements, as recently suggested in . To demonstrate, using the RefSeq-based remapping of Dai et al, there were clear differences in relative gene expression changes obtained, examples of which are presented in Supplemental Figures 1 (SOD2) and 2 (TXNL4A) [See Additional file 1]. However, while these examples demonstrate differences in individual results, they did not translate into global improvements in the cross-platform correlation using our current method over RefSeq-mapped probe sets. A possible probe selection bias towards abundant transcripts through the use of RefSeq-based probe sets may account for this lack of difference.
The quality of the new probe set definition depends on a number of factors. It is notable that 2–4% of probes on the human arrays studied are ambiguous (i.e. they align to multiple genes), and the resulting probe sets should be used with caution. The gene(s) targeted by each new probe set are made available in the annotation files downloadable from . In addition, because of the relative lack of information on poly-A sites, it should be stressed that the current probe sets may not accurately reflect the regulation that occurs at the level of alternative poly-adenylation. For instance, regrouping of probes derived from more than one Affymetrix probe set may have resulted from poly-A sites currently unannotated in AceView. Conversely, there may have been some probe sets which are split by the presence of partial cDNAs in AceView that do not clearly define a poly-A site. As greater sequence coverage and refinement of the human genome become available, a strategy such as described here would permit continuous updating and refinement of probe sets, and better interpretation of results, based on the latest knowledge . While we used AceView for redefining probe sets, the method of regrouping probes can be applied using any public or "in-house" database, and the guidelines provided here for creating a viable "probe set" should be generally applicable. This method is also particularly relevant with the recently developed exon arrays which have genome-wide probe content specific to individual exons, observed or predicted. A method to estimate quantitative expression data at the gene-level is suggested in . This approach employs a variety of annotations for grouping probes into sets, followed by summarization with the PLIER algorithm  or a derivative of it. However, we note that while transcript level annotations can be derived from naturally homogeneous exon-level probe sets, preliminary examination indicates that not all probe sets are actually homogeneous. Exon array probes are based on probe selection regions, or PSR, which are built around "exon clusters" or overlapping exons that may or may not share similar splice sites . Hence exon arrays, while providing a significant improvement over 3' expression arrays towards transcript specificity, may continue to heterogeneously target multiple transcript variants. Since an array design of 4 probes per single exon minimally satisfies the requirements for a summarized expression value, splitting these into smaller sets might further degrade the accuracy of these probe sets. With the rising number of alternative variants annotated in AceView and elsewhere, transcript-specific arrays would require much higher densities to achieve even greater resolution while maintaining an adequate number of probes from which to extract accurate expression data. As such, probes on whole genome tiling arrays designed for transcript mapping could be grouped de novo based on AceView transcripts and are a viable platform for this strategy.
In conclusion, our transcript-level reannotation and redefinition of probe sets complement the original Affymetrix design. Redefinitions introduce probe sets whose sizes may not support reliable statistical summarization; therefore, we advocate using our transcript-level mapping redefinition in a secondary analysis step rather than as a replacement. Knowing which specific transcripts are differentially expressed is important to properly design probe/primer pairs for validation purposes. The custom chip-description-files (CDFs) and annotation files for our new probe set definitions  are compatible with Bioconductor, with Affymetrix's Expression Console or third party software.
We regrouped probes into probe sets based on AceView, a comprehensive human transcript annotation database . The AceView transcripts are reconstructed from mRNAs in three databases: GenBank, dbEST and RefSeq; therefore, AceView shows a broader coverage and identifies many more transcript variants than RefSeq alone . Affymetrix probe sequences for the various types of GeneChips were downloaded from . Each probe sequence was then matched against transcripts in AceView (Release August 2005; human 35.4/hg17; non-cloud genes). Here we named a probe by its Affymetrix probe set identifier and the interrogation position (seen in downloaded probe sequence files) separated by '-'. A probe is considered to match a transcript if the probe shares 22 or more contiguous base pairs (bps) with that transcript sequence. The length cutoff of 22 was chosen based on our empirical observation that in the Affymetrix U95A spike-in dataset (available at ), probes matching 22 bases of a transcript are capable of detecting 2-fold differences (data not shown). Through this mapping procedure, we constructed a hash table where the keys and values are probe sequences and sets of AceView transcript identifiers, respectively. Next, probes are grouped into a probe set if they all match exactly the same set of transcripts (as shown in Figure 1). If a probe does not share transcript mapping with any other probe, it is assigned as an independent probe set. The naming of newly defined probe sets is somewhat arbitrary. A set of tab-delimited files containing the annotation of newly defined probe sets, including probe set names, the original Affymetrix probe set definition, gene symbol(s) and description, are available for download at . The chip description files (CDFs) required for mapping the probe positions on the chips to the sequence annotation were made using the R package "altcdfenvs" [22, 36], These CDF files and corresponding CDF packages are compatible with other bioconductor packages, such as "affy", to derive expression summary values for the newly defined probe sets, and are available for download as well . In addition, custom CDF files which are compatible with third party software are also available for download.
To evaluate how many probes in a probe set are required to derive a robust expression measurement, we ran a simulation to test the accuracy and consistency of a standard data set where probe sets are redefined based on having different sizes (i.e. having different numbers of probes). To do this, all probe sets for the U133A Genechip were artificially redefined by size (denoted as d1, d2, ..., d10), by randomly sampling various numbers of probes from the original probe sets. For example, the original Affymetrix probe set on GeneChip U133A has 11 probes; however, in our artificial probe set definition, say d2, each probe set only contains 2 probes which are randomly drawn from the corresponding original Affymetrix probe sets. Next, using the R package altcdfenvs, we built 10 chip design files (CDF), with each corresponding to a probe set size-based definition.
Using these CDF files and standard summarization approach RMA , we generated 10 artificial data sets from the original U133A Spike-In data sets downloaded from the Affymetrix website . The "affy" package in bioconductor was used to read and process the raw.cel files. To make comparisons consistent, the array preprocessing is the same for all simulated datasets, using the CDF file from Affymetrix (the default in affy). We chose RMA background correction and quantile normalization . The normalized probe-level expression data was then used for deriving gene expression summary values. Since a set of standard evaluation tools are available in Bioconductor's "affycomp" package [9, 11] for generating a series of comparison plots and summarization tables, we used it to compare the gene expression summaries derived from different-sized probe sets. The 10 sets of expression measurements from the simulation study are available for download at our website .
The expression data for Affymetrix and Codelink were obtained as described in . In the cross-platform comparison, we compared RNAs from 6 samples: 3 technical replicates from PANC-1 cells grown in serum-rich medium (the control group) and 3 replicates from cells one day after the serum was removed (the treatment group). Identical RNA samples were applied to the Affymetrix U95Av.2 arrays and the Codelink UniSet Human I Bioarrays from Amersham (30 mer oligonucleotide probes). The raw expression data from both platforms are available at . For the Affymetrix platform, data were pre-processed and normalized using the RMA method available in the bioconductor "affy" package . For the Codelink data we used quantile normalization as used in RMA, and only probes with measurements labeled as "Good" across all 6 samples were included in our analysis. Next, the three individual log2 ratios of expression values for the treatment versus control samples were calculated, where the pairing of a sample in the control group with a sample in the treatment group is arbitrary. These log-ratios were used as recommended  to calculate and compare the Pearson's correlations for data from the two platforms. The probe identifiers from the two platforms were cross-mapped by two methods: the UniGene IDs  and the AceView transcripts. First, RESOURCERER  (version July 2005) was used to carry out the UniGene-based mapping between the Codelink identifiers and the original Affymetrix probe sets. The AceView-based mapping is straightforward: a Codelink probe is considered matching a newly defined Affymetrix probe set if both are mapped to the same set of AceView transcripts.
We conducted a comparison between two groups (cells with serum versus cells with one-day after serum removal) using our newly defined probe sets and the original Affymetrix probe sets. For simplicity we averaged data from 3 technical replicates into one biological replicate in each group (so each group contains three biological replicates). For the new probe set definition, probe sets with 3 or less probes were excluded from the analysis. The empirical Bayes method  was applied to calculate t-statistics and p-values and the p-values were further adjusted by the False Discovery Rate (FDR) approach using the "p.adjust" function in the "limma" package .
The authors would like to acknowledge Danielle and Jean Thierry-Mieg for valuable discussions and input regarding the AceView and RefSeq database probe mappings and the writing of the manuscript, and for providing figure 1. The authors would like to acknowledge David Wheeler for his help in creating the cdf packages, and Mark Reimers for his scientific and editorial input. This research was supported by the Intramural Research Program of the NIH, NIDDK.
Disclaimer: Certain commercial equipment, instruments, or materials are identified in this document. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the products identified are necessarily the best available for the purpose.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.