Avoiding inconsistencies over time and tracking difficulties in Applied Biosystems AB1700™/Panther™ probe-to-gene annotations
© Noth et al; licensee BioMed Central Ltd. 2005
Received: 26 July 2005
Accepted: 22 December 2005
Published: 22 December 2005
Significant inconsistencies between probe-to-gene annotations between different releases of probe set identifiers by commercial microarray platform solutions have been reported. Such inconsistencies lead to misleading or ambiguous interpretation of published gene expression results.
We report here similar inconsistencies in the probe-to-gene annotation of Applied Biosystems AB1700 data, demonstrating that this is not an isolated concern. Moreover, the online information source PANTHER does not provide information required to track such inconsistencies, hence, even correctly annotated datasets, when resubmitted after PANTHER was updated to a new probe-to-gene annotation release, will generate differing results without any feedback on the origin of the change.
The importance of unequivocal annotation of microarray experiments can not be underestimated. Inconsistencies greatly diminish the usefulness of the technology. Novel methods in the analysis of transcriptome profiles often rely on large disparate datasets stemming from multiple sources. The predictive and analytic power of such approaches rapidly diminishes if only least-common subsets can be used for analysis. We present here the information that needs to be provided together with the raw AB1700 data, and the information required together with the biologic interpretation of such data to avoid inconsistencies and tracking difficulties.
Studying the cellular transcriptome and its dynamics using microarray technology has become a common place application in modern biomedical research . Dedicated databases [i.e. [2–4]] store several hundreds of individual microarray datasets and are growing exponentially. Many different commercial and research originating microarray formats and platforms are being used [2–4]. Since microarray technology currently can not be used to determine absolute expression levels of genes, comparative analysis of transcriptome data across different biological conditions is challenging. Cross-platform comparisons can only be carried out if coherent mapping of genes between the platforms and their particular probe-to-gene and gene-to-genome annotations can be achieved. The public microarray databases are very valuable as here standard formatting and annotation procedures are being imposed, thereby rendering the individual microarray experiments useful beyond the immediate purpose they were acquired for [4, 5].
Since both the genome sequences as well as individual gene annotations are subject to constant discovery-driven change, probe-to-gene annotations undergo frequent revisions. Such revisions of the initial probe-to-gene mapping lead to inconsistencies if they are not well controlled . Today any dataset requires to be published together with the probe-to-gene annotation used for biological analysis or an explicit reference to a static open-source of such. Unfortunately, this is not state-of-the-art yet, and we join in the call for more sensitivity towards this issue .
We show here that for Applied Biosystems AB1700 Genome Survey Arrays the same contention of potential inconsistencies in the probe-to-gene annotations holds true if several precautions are not taken.
The probe-sets as well as the probe-to-gene annotations (PGAs) for AB1700 technology are revised regularly. In order to assure correct and unambiguous interpretation of AB1700 data, we incite particular attention to the following aspects:
PGA revisions, unless tracked and conserved by the user lead to inconsistencies and ambiguities in the interpretation of AB1700 data
Significant changes between probe-to-gene releases and microarray versions in Applied Biosystems AB1700 data are also observed. The conclusions drawn from the very same experimental dataset will differ depending on the particular PGA version used. In order to retain transparency, the microarray data need thus to be annotated with the PGA version used for interpretation. Currently, AB1700 PGA files are not date-stamped and simply replaced by revised versions in the AB Gene Expression System Software  once those become available. The AB1700 user has therefore to keep copies of back-versions and a mapping of datasets to PGA files. This problematic has been previously discussed for an analogous case .
AB1700 PGAs are not publicly available, and the PANTHER web-source can not be used for third party verifications
Third parties have no direct access to the AB1700 PGA files. Independent verification of published results could thus only be achieved using the Applied Biosystems PANTHER web-source which is similar in function to NetAffx [13–16]. The user hereby can upload, temporally store, and analyze datasets containing gene or probe IDs and associated signal measurements. PANTHER's use of Hidden Markov Models (HHMs) and protein-family trees clearly provides for significant insights into the nature of the biologic problem studied, and is frequently used by research.
PANTHER internally operates on a continuously updated probe-to-gene annotation table, whereas the PGA releases are communicated discontinuously. Today there is no way of tracking and/or recording the continuous updates. PANTHER also contains only information on protein coding genes. The completeness of the internal PGA is further compromised by excluding GenBank-only mRNAs as they usually do not contain associated protein information. By consequence, published lists of gene or probe IDs with associated PANTHER pathways analyses are ambiguous, potentially incomplete, and the interpretation is not necessarily verifiable by a third party.
Several probes for a single gene
Probe design for the AB1700 has been guided by the idea to have a single probe targeting all isoforms of a given gene. This is, due to alternative splicing, alternative promoter usage, and likely also to annotation errors, not possible for about 13% of all HGS V2.0 represented genes, which are consequently quantified using up to ten different probes on the array [9–11]. The probe signal intensities are determined individually and need to be kept separate during primary analysis of the data ; a challenge for automated data analysis, as the existence of multiple probes for a single gene is not evident from the gene ID or probe ID alone. A look-up table needs to be compiled in order to establish proper mapping between any given gene ID and its single or multiple probes for every new PGA release [see Additional file 1] [see Additional file 2]. Finally, the logic used to integrate the different signals for a single gene needs to be communicated in order to achieve transparency for third parties.
Several gene IDs for a single probe
Many spotted probes are also not mono-specific for a single gene. In cases of significant cross-reactivity, e.g. with closely related members of a gene family, not only a primary gene ID is listed in the PGA tables but in addition, as a separate entry, a list of alternate gene IDs the probe cross-hybridizes with. The existence of multiple transcripts hybridizing to a single probe is obviously not systematic, but also not appreciable from the probe ID nomenclature. Hence a dedicated look-up table needs to be generated.
Obsolete gene IDs and nomenclature suffixes
PANTHER annotation versus disseminated annotation releases 1 & 2. The data from the Supplementary Files 1 & 2 were submitted to the PANTHER website either using the probe IDs or the gene IDs. The differences in identification, display, and absentee calls, between the datasets and the annotation releases are due to the fact that PANTHER is continuously updated, only considers protein coding genes, excludes GenBank-only annotated mRNAs, and retains only the gene ID with the highest suffix.
Number of unique entries ~
Probe ID/Gene ID set
found by PANTHER
displayed by PANTHER
not found by PANTHER
not accounted for
Probe/Gene IDs for MAPKK3. The different probe and gene IDs corresponding to the two annotation releases (V& and V2) where submitted to PANTHER with the listed results. This table summarizes for a single gene the differences and potential ambiguities when submitting gene IDs rather than probe IDs, and illustrates the effect of curation status (obsolete vs. valid) or annotation release. Note that >70% of all gene IDs in the current HGS V2 PGA table carry suffixes, that ~13% of all represented genes have more than one probe ID associated, and that probes often have secondary hits. All of these require different look-up tables to be generated by the user in order to achieve coherency and transparency in the data analysis process.
if Probe IDs (V1) are submitted:
if the Gene IDs (V1) are submitted:
Probe ID (V1)
Gene ID (V1)
PANTHER Gene ID
PANTHER Gene ID
2 × 10E-23
2 × 10–23
6 × 10E-80
3 × 10E-70
6 × 10E-58
if Probe IDs (V2) are submitted:
if the Gene IDs (V2) are submitted:
Probe ID (V2)
Gene ID (V2)
PANTHER Gene ID
PANTHER Gene ID
2 × 10E-23
2 × 10–23
6 × 10E-80
3 × 10E-70
6 × 10E-58
The importance of unequivocal annotation of microarray experiments is evident. The analytical power of novel technologies such as the AB1700 platform from Applied Biosystems  certainly could be curtailed by incorrect annotation and failure to statically associate the correct annotations with the data.
In order to maintain a maximum of transparency and consistency, we conclude that such microarray data can only be analyzed based on probe IDs rather than gene IDs, and need to be supplemented with the following information for purposes of publication and reanalysis: (i) the microarray version, (ii) the probe-to-gene annotation release used, (iii) how multiple probes for a single gene, and (iv) how probe cross-reactivity have been considered. In case PANTHER was used for analysis: (v) which probe and gene corresponds to the HMM scores, and finally (vi) a submission record-file generated by PANTHER at the same time the statistical and biological analysis was performed.
The authors are grateful to the Applied Biosystems team for useful insights into the annotation and curation process of AB1700 data, and to the unknown referees for very helpful comments. We thank all members of the systems epigenomics group for stimulating discussions. This work was supported by the European Hematology Association – José Carreras Foundation, the Région Nord, the Institut des Hautes Etudes Scientifiques, the Institut de Recherches Interdisciplinaires, the Centre National de la Recherche Scientifique (CNRS), the Institut National de la Santé Et de la Recherche Médicale (INSERM), and the French Ministry of Research through the "Complexité du Vivant – Action STICS-Santé" program (all to A.B.).
- Stoughton RB: Applications of DNA microarrays in biology. Annu Rev Biochem 2005, 74: 53–82. 10.1146/annurev.biochem.74.082803.133212View ArticlePubMedGoogle Scholar
- Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R: NCBI GEO: mining millions of expression profiles – database and tools. Nucleic Acids Res 2005, (33 Database):D562–6.Google Scholar
- Ball CA, Awad IA, Demeter J, Gollub J, Hebert JM, Hernandez-Boussard T, Jin H, Matese JC, Nitzberg M, Wymore F, Zachariah ZK, Brown PO, Sherlock G: The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res 2005, (33 Database):D580–2.Google Scholar
- Koehler J, Rawlings C, Verrier P, Mitchell R, Skusa A, Ruegg A, Philippi S: Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures. In Silico Biol 2005, 5: 33–44.PubMedGoogle Scholar
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martinez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W: Multiple-laboratory comparison of microarray platforms. Nat Methods 2005, 2: 345–50. Erratum in: Nat Methods. 2005, 2:477 10.1038/nmeth756View ArticlePubMedGoogle Scholar
- Perez-Iratxeta CP, Andrade MA: Inconsistencies over time in 5% of NetAffx probe-to-gene annotations. BMC Bioinformatics 2005, 6: 183. 10.1186/1471-2105-6-183PubMed CentralView ArticlePubMedGoogle Scholar
- AB1700 platform, Applied Biosystems[http://www.appliedbiosystems.com/catalog/myab/StoreCatalog/products/CategoryDetails.jsp?hierarchyID=101&category3rd=112395&trail=no]
- Stefano GB, Burrill JD, Labur S, Blake J, Cadet P: Regulation of various genes in human leukocytes acutely exposed to morphine: expression microarray analysis. Med Sci Monit 2005, 11: MS35–42.PubMedGoogle Scholar
- Human Genome Survey Array 2.0, Applied Biosystems[http://docs.appliedbiosystems.com/pebiodocs/00114084.pdf]
- Mouse Genome Survey Array 1.0, Applied Biosystems[http://docs.appliedbiosystems.com/pebiodocs/00113400.pdf]
- Rat Genome Survey Array 1.0, Applied Biosystems[http://docs.appliedbiosystems.com/pebiodocs/00114006.pdf]
- Expression Array System Software V1.1.1. (current version 07/2005), Applied Biosystems: ProdNo: 4366478Google Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: A library of protein families and subfamilies indexed by function. Genome Res 2003, 13: 2129–41. 10.1101/gr.772403PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N, Ladunga I, Ulitsky-Lazareva B, Muruganujan A, Rabkin S, Vandergriff JA, Doremieux O: PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nuc Acids Res 2003, 31: 334–341. Corrigendum: Nuc. acids Res. 2003, 31: 2024 10.1093/nar/gkg115View ArticleGoogle Scholar
- Mi H, Ulitsky-Lazareva B, Loo R, Kejariwal A, Vandergriff JA, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD: The PANTHER database of protein families, subfamilies, functions and pathways. Nuc Acids Res 2005, 33: D284–288. 10.1093/nar/gki078View ArticleGoogle Scholar
- The PANTHER website, Applied Biosystems[https://panther.appliedbiosystems.com/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.