Inconsistencies over time in 5% of NetAffx probe-to-gene annotations
© Perez-Iratxeta and Andrade; licensee BioMed Central Ltd. 2005
Received: 30 March 2005
Accepted: 20 July 2005
Published: 20 July 2005
DNA microarray probes are designed to match particular mRNA transcripts, often based on expressed sequences like ESTs, or cDNAs, many times incomplete. As a result, the relations between probes and genes can change as the sequence data are updated. However, it is frequent that the reported results of microarray analyses are given just as lists of genes without any reference to the underlying probes.
We show for a particular commercial microarray design that the number of probes associated to some genes change with time. These changes concern approximately 5% of the probe sets across the history of annotation releases over a two year span.
We recommend to report probe set identifiers when publishing microarray results, and to submit those analyses to microarray public databases to ensure that the interpretation of the data is updated with the latest set of annotations.
Example of a pair of probe sets inconsistently annotated.
probe set id
Affymetrix DNA microarrays include probes for the detection of target sequences that are mainly based on UniGene clusters . UniGene is a database of gene-oriented clusters of GenBank sequences, where in addition to sequences of well-characterized genes, hundreds of thousands of novel expressed sequence tag sequences (ESTs) have been included. Affymetrix probe sets are annotated according to their related current records in UniGene and LocusLink, including genomic location, gene symbol, and function description, when available (NetAffx database, ).
We obtained all 8 NetAffx releases for the MOE430A/B microarray, dated from 2003 March 17th until 2004 June 6th (kindly provided by Marco Raposo, Affymetrix). First, we observed that there was at least one gene name change for 13,699 of the approximately 45,000 probe sets included in the MOE430A/B chips. Many of these changes were simply probe sets initially without a gene name that were eventually associated to one. This reflects a general improvement in the functional annotation of the human genome. Other changes could be explained by the use of synonymous gene symbols. However, according to a table of synonymous gene symbols that we extracted from the LocusLink gene database , there was still a total of 2277 probe sets with gene name changes that could not be explained by the use of a synonym. This represents about the 5% of the total of probe sets in the chip.
Number of split and joined probe set pairs between consecutive versions of NetAffx.
1 → 2
2 → 3
3 → 4
4 → 5
5 → 6
6 → 7
7 → 8
The design of DNA microarray probe sets is often based on assembled groups of expressed sequences observed as ESTs or cDNAs, and might represent partial transcripts. Additional evidence in the form of new sequences, or even new gene predictions, can modify the preliminary assignment (for example, by discovering that two ESTs that were considered to be representing different mRNA transcripts are actually part of the same one). Therefore, information assigned to a probe based on gene predictions (such as a gene name) can be considered non-static and might change over time. Although, one can expect annotations will improve over time due to more accurate genomic assemblies, the changes will still occur for a while since a large fraction of genes are still predicted. Probe sequences constitute the only static information attached to the microarray: this information is inherent to the design of the microarray and will not change over time. This was pointed out in the manuscript that describes the NetAffx annotation files  but currently there is no visible warning or reminder in the Affymetrix website.
It happens that, although these are implicitly well known facts in the bioinformatics community, experimental users of microarrays are not so aware of the problem, probably because the surprisingly large extent of these changes has not been pointed out before. For example, the recent letter from the Microarray Gene Expression Data Society  explains that deposition of microarray data in public databases assures data persistence, integration, accessibility, and data standardization, but misses the problem of variable gene structure. There are recent publications that deal with the analysis of relations between Affymetrix probe sets and gene sequences [6–8], but they do no report the extent of the variation of these relations along time as we have done here. This latter fact, which could convince many microarray users to send their data to public databases, has not been well publicized.
Deposition of microarray data in public databases is much more than just making the data public, but to making them really of use to the scientific community. Those databases include the descriptions of probe sequences and update constantly the non-static information associated to them, thus allowing the re-interpretation of the data and solving the problem we presented here.
The authors are grateful to Marco Raposo (Affymetrix) for providing all NetAffx annotations for the MOE430A/B microarray, to the members of the OGIC's Bioinformatics group for helpful discussions, and in particular to Gareth Palidwor for his work in the maintenance of our computer systems.
- Perez-Iratxeta C, Palidwor G, Porter CG, Sanche NA, Huska MR, Suomela BP, Muro EM, Krzyzanowski P, Hughes E, Campbell PA, Rudnicki MA, Andrade MA: Study of stem cell function using microarray experiments. FEBS Letters 2005, 579: 1795–1801. 10.1016/j.febslet.2005.02.020View ArticlePubMedGoogle Scholar
- Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res 2003, 31: 82–86. 10.1093/nar/gkg121PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2005, 33(Database Issue):D39–45. 10.1093/nar/gki062PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 2004, 32(Database issue):D35–40. 10.1093/nar/gkh073PubMed CentralView ArticlePubMedGoogle Scholar
- Ball C, Brazma A, Causton H, Chervitz S, Edgar R, Hingamp P, Matese JC, Icahn C, Parkinson H, Quackenbush J, Ringwald M, Sansone SA, Sherlock G, Spellman P, Stoeckert C, Tateno Y, Taylor R, White J, Winegarden N: An open letter on microarray data from the MGED Society. Microbiology 2004, 150: 3522–3524. 10.1099/mic.0.27637-0View ArticlePubMedGoogle Scholar
- Harbig J, Sprinkle R, Enkemann SA: A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array. Nucleic Acids Res 2005, 33: e31. 10.1093/nar/gni027PubMed CentralView ArticlePubMedGoogle Scholar
- Roche FM, Hokamp K, Acab M, Babiuk LA, Hancock RE, Brinkman FS: ProbeLynx: a tool for updating the association of microarray probes to genes. Nucleic Acids Res 2004, 32: W471–4. 10.1093/nar/gkh123PubMed CentralView ArticlePubMedGoogle Scholar
- Leong HS, Yates T, Wilson C, Miller CJ: ADAPT: A Database of Affymetrix Probesets and Transcripts. Bioinformatics 2005.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.